Elios Talent is seeking a Senior DevOps / SRE Engineer to ensure platform reliability and manage CI/CD pipelines in a high-scale production environment. The role involves close collaboration with engineering teams to build and maintain cloud-native infrastructure, focusing on Kubernetes and reliability engineering best practices.
Responsibilities:
- Design, build, and maintain CI/CD pipelines using reusable GitHub Actions workflows
- Own GitOps workflows using ArgoCD, managing application promotion across environments
- Operate and upgrade Kubernetes clusters (EKS), including node groups, autoscaling, and cluster add-ons
- Manage infrastructure as code using Terraform, including PR-driven workflows and state management
- Define and maintain SLOs, alerting strategies, and observability dashboards across platform services
- Operate and maintain secrets management systems (HashiCorp Vault), including policies and authentication
- Implement supply chain security controls including image scanning, signing, SBOM generation, and policy enforcement
- Partner with security teams on network policies, egress controls, and compliance requirements
- Participate in on-call rotations and lead incident response and post-incident reviews
Requirements:
- 6+ years of experience in DevOps, SRE, Platform Engineering, or Production Operations
- Strong experience managing CI/CD pipelines, GitOps workflows, and Kubernetes in production environments
- Experience operating and scaling Kubernetes clusters (EKS preferred)
- Expertise in infrastructure as code (Terraform), including state management and automated deployment workflows
- Proven experience implementing observability and reliability practices (SLOs, alerting, dashboards, incident response)
- Experience with secrets management systems such as HashiCorp Vault
- Strong collaboration skills with the ability to support multiple engineering teams
- Kubernetes (cluster operations, autoscaling, RBAC, workload isolation, upgrades)
- GitOps (ArgoCD configuration, sync policies, rollback strategies)
- CI/CD (GitHub Actions, reusable workflows, deployment gates, secrets management)
- Terraform (modular design, state management, Atlantis workflows)
- Observability tools (Prometheus, Grafana, Loki, Tempo, Alertmanager)
- Service mesh (Istio, mTLS, traffic management, authorization policies)
- Autoscaling and provisioning (KEDA, Karpenter)
- Secrets management (HashiCorp Vault)
- Container and supply chain security (Trivy, Cosign, SBOMs, OPA/Gatekeeper)
- Scripting and automation (Python, Bash)
- Experience leveraging AI tools to accelerate infrastructure development, CI/CD workflows, and operational processes
- Familiarity with AI-assisted incident response, log analysis, and runbook generation
- Ability to integrate AI-driven quality and security checks into delivery pipelines
- Strong ownership mindset over reliability, scalability, and system performance
- Focus on automation and eliminating manual operational work
- Ability to proactively identify and address reliability risks
- Clear and structured communication during incidents and operational events