Own and evolve our AWS-based infrastructure, improving platform performance and availability today, and building toward deployable configurations that support enterprise customer environments tomorrow.
Own EKS cluster operations across production regions: node pool strategy, AMI lifecycle, autoscaling, and Kubernetes workload health.
Support the GitOps deployment pipeline
define, deploy, and manage applications across clusters using infrastructure-as-code.
Lead infrastructure deprecation and migration efforts with minimal disruption.
Own SLO measurement infrastructure; enable proactive triage of emerging issues before they impact customers.
Lead incident investigation, root cause analysis and postmortems, driving systemic fixes rather than one-off patches.
Design and improve automated remediation systems to reduce MTTR.
Review and provide security-conscious feedback on platform architecture decisions.
Own cloud IAM governance
roles, policies, and access boundaries across accounts and services.
Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.
Partner with application development teams to build an inherently secure platform and drive next-generation deployment architecture.
Partner with customer teams to ensure availability for expected utilization.
Partner with Finance on cloud cost optimization
lifecycle policies, right-sizing, and spend visibility.
Support GPU and batch workloads in collaboration with simulation and ML engineering teams.
Improve CI/CD pipelines and automated infrastructure validation.
Support engineering teams with infra-side debugging, log analysis, and environment configuration.
Requirements
5+ years in SRE, DevOps, or infrastructure engineering roles.
Infrastructure-as-code proficiency
Terraform modules, state management, and multi-environment patterns.
cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, node autoscaling (Karpenter experience a plus); solid understanding of containerization and AMI lifecycle management.
CI/CD
experience with GitOps workflows and pipeline tooling (ArgoCD, GitHub Actions, Jenkins)