Design, architect, and implement scalable, secure, and cost-efficient AWS infrastructure across multiple regions, aligned with the AWS Well-Architected Framework.
Build, maintain, and evolve all cloud infrastructure using Terraform, enforcing module reusability, remote state management, and IaC best practices across environments.
Own and optimize PostgreSQL database: including performance, scalability, and reliability.
Deploy, manage, and optimize workloads on Kubernetes clusters using Helm and Kustomize.
Design, implement, and maintain CI/CD pipelines using GitHub Actions.
Lead DR planning, runbook creation, and failure scenario modeling — including database backup and recovery strategies.
Support and unblock engineering teams on infrastructure and database needs.
Operate and improve monitoring, logging, and alerting systems — including database-specific monitoring (query performance, replication health, connection saturation) — to ensure high availability and fast incident response.
Participate in and help improve the weekly on-call rotation.
Requirements
6–8+ years of professional experience in Cloud Engineering, DevOps, or SRE roles, with a proven track record operating highly scalable, high-availability systems in production.
Deep, hands-on experience with AWS core services (EKS, ECS, EC2, VPC, IAM, RDS, Amazon Aurora, S3, Route 53, CloudFront, ALB/NLB, etc.) in real production workloads.
Expert-level proficiency with Terraform, including module design, remote state management, and multi-environment/multi-region setups.
Strong PostgreSQL expertise in production, including: query and index performance tuning, sharding strategies (e.g., application-level sharding, or partitioning), replication setup and management (streaming, logical), connection pooling (PgBouncer), vacuum tuning, and planning/executing major version upgrades with minimal downtime.
Experience managing large-scale PostgreSQL databases (hundreds of GBs to TBs) under high-traffic workloads, with a solid understanding of how schema design, indexing, and partitioning decisions affect performance at scale.
Strong production experience operating and optimizing Kubernetes clusters (deployments, scaling, RBAC, networking, security policies, cluster upgrades).
Proven experience designing and maintaining CI/CD pipelines using GitHub Actions.
Solid experience with GitOps principles and tools; hands-on experience with ArgoCD is strongly preferred.
Strong understanding of networking fundamentals (DNS, VPC peering, Transit Gateway, VPN, load balancing) and cloud security best practices.
Experience with logging, monitoring, and alerting stacks (e.g., ELK, EFK, LGTM, CloudWatch) across multiple environments, including database-specific monitoring.
Proficiency in Bash and Python for automation and tooling.
Strong Git workflow knowledge, including branching strategies and code review practices.
Experience designing and implementing multi-region architectures with failover and DR strategies.