Design and maintain GitHub Actions reusable workflows across a multi-repository ecosystem
Own GitOps deployments through ArgoCD, including promotion workflows, sync policies, drift detection, and automated rollback strategies
Implement deployment safety mechanisms such as environment protections, concurrency rules, and verification gates
Operate and upgrade EKS clusters, including Karpenter provisioning, node groups, and critical cluster add-ons
Maintain Terraform-driven infrastructure and enforce PR-driven workflows through Atlantis
Define and maintain SLOs, SLIs, alerting rules, and monitoring dashboards across platform services
Lead incident response, coordinate recovery efforts, and execute structured post-incident reviews
Participate in an on-call rotation and contribute to improving operational processes
Operate and maintain HashiCorp Vault, including policies, authentication backends, and secret engines
Implement supply-chain security controls, including Trivy scanning, Cosign signing, SBOM generation, and OPA/Gatekeeper enforcement
Partner with Security Engineering on network policies, egress controls, and compliance standards
Automate repetitive tasks and maintain proactive runbooks to reduce operational risk
Use AI tools to improve infrastructure automation, documentation, and deployment safety validation
Collaborate with product teams to strengthen SLOs and deployment safety practices
Challenge technical assumptions and advocate for scalable, secure DevOps architectures
Requirements
Proven ownership of production-grade CI/CD pipelines using GitHub Actions reusable workflows and GitOps automation with ArgoCD
Expert-level Kubernetes and EKS operations, including node group management, Karpenter autoscaling, RBAC, PDBs, and topology constraints
Production-scale Terraform expertise, including module design, S3 + DynamoDB remote state, and PR-driven workflows via Atlantis
Strong reliability engineering experience, including SLO/SLI design, alerting strategies, dashboards, incident response, and post-incident reviews
Hands-on experience operating HashiCorp Vault, including auth backends, PKI, dynamic secrets, and audit logging
Experience implementing supply-chain security controls, including image scanning and signing, SBOM generation, and policy enforcement with OPA/Gatekeeper
Strong experience with observability stacks, including Prometheus, Grafana, Loki, Tempo, and Alertmanager
Experience with service mesh technologies such as Istio, including traffic management, mTLS, AuthorizationPolicies, and circuit breaking
Scripting ability using Python and Bash for automation and operational tooling
Active use of AI-assisted engineering tools such as Cursor, GitHub Copilot, or Cloud Code to accelerate IaC development, incident response, and runbook generation
Strong communication skills, with the ability to communicate clearly and confidently with VP-level stakeholders during operational incidents
Advanced English proficiency, as you will work directly with US-based clients.