Design, build, and maintain multi-region AWS infrastructure using Terraform.
Operate and scale EKS clusters across production regions: autoscaling, node lifecycle, workload health.
Manage networking across environments: VPC design, DNS, load balancing, and cross-region connectivity.
Support infrastructure changes, migrations, and expansions into new regions.
Help build and run incident management processes: severity definitions, escalation paths, on-call practices.
Lead incident response, debugging, and root-cause analysis.
Write postmortems and drive systemic reliability improvements from what they surface.
Improve observability across metrics, logging, tracing, and dashboards.
Provide security-conscious feedback on platform architecture decisions.
Own cloud IAM governance: roles, policies, and access boundaries across accounts and services.
Improve CI/CD pipelines and infrastructure validation.
Support engineers with infrastructure debugging, environment setup, and performance issues.
Contribute to tooling and automation in Python and Bash.
Requirements
5+ years in SRE, DevOps, or infrastructure engineering roles, with a track record of operating production systems across multiple regions.
Terraform experience: Modules, state management, and multi-environment patterns.
AWS depth: Solid experience across VPC, IAM, EKS, S3, and CloudWatch.
Kubernetes expertise: Cluster operations, autoscaling, RBAC, and Helm.
CI/CD and GitOps: Experience with GitHub Actions, ArgoCD, or similar workflows.
Networking fundamentals: CIDR, DNS, load balancing, VPN, and cross-region connectivity.
Observability: Experience with tooling such as Prometheus and Grafana.
Scripting: Comfort with Python and Bash for tooling and automation.
Cross-platform familiarity: Working knowledge of both Linux and Windows environments. Operational experience supporting Windows-based workloads is a meaningful advantage.
Pragmatism and ownership: Comfortable in a fast-moving startup with evolving priorities. You take ownership of systems while collaborating closely with other teams, and you're pragmatic about tradeoffs between speed, reliability, and complexity.