Design, build, and maintain scalable, secure, and highly available infrastructure on AWS (EKS, EC2, RDS / Aurora Postgres, MSK, S3, VPC, IAM).
Manage and optimize Kubernetes clusters (EKS) across multiple environments, and deploy applications using Argo CD with GitOps best practices.
Implement and maintain CI/CD pipelines using GitHub Actions, including reusable workflows, build/push/scan flows for ECR, and frontend deployment pipelines.
Operate and tune Kafka-based event streaming on Amazon MSK for high-throughput, low-latency device data pipelines.
Define and manage Infrastructure as Code with Terraform and Terragrunt, with reusable modules, sensible environment separation, and review-friendly plans.
Manage identity and access across platforms with Auth0 / EntraID integrations, IAM roles for service accounts (IRSA), and short-lived credentials.
Build and maintain observability with Grafana, Loki, Prometheus / Mimir, and related tooling so on-call engineers can quickly find and fix issues.
Monitor and optimize infrastructure cost across environments, partnering with engineering teams on right-sizing, capacity planning, and waste reduction.
Partner with our Cloud Security team to enforce security standards, integrate with SIEM tooling, and respond to vulnerabilities and incidents.
Debug complex production issues across infrastructure, deployment, and networking layers, and turn the lessons learned into automation and runbooks.
Requirements
5+ years in DevOps, SRE, or Platform Engineering with production experience operating AWS infrastructure.
Deep hands-on experience administering Kubernetes (EKS or equivalent) and deploying via GitOps (Argo CD or Flux).
Proficiency with Infrastructure as Code using Terraform; comfort with Terragrunt or a similar wrapper.
Hands-on experience designing and maintaining CI/CD pipelines, preferably with GitHub Actions and reusable workflows.
Production experience operating distributed systems such as Kafka (MSK).
Strong understanding of networking, DNS, TLS, and security best practices, including IdP-driven access control (Auth0, EntraID, or similar).
Solid experience with monitoring and logging stacks such as Grafana, Loki, Prometheus, Mimir, or equivalents.
Ability to debug complex production issues across infrastructure, deployment, and networking layers.
Comfortable working in Linux environments with strong scripting skills (Python or Bash preferred for automation).
Knowledge of version control workflows, automated testing, and release management.
Tech Stack
AWS
Cloud
Distributed Systems
DNS
EC2
Flux
Grafana
Kafka
Kubernetes
Linux
Postgres
Prometheus
Python
Terraform
Benefits
Health, Dental & Vision (Gold and Platinum with some providers plans fully covered)
Paid parental leave
Alternating day off (every other Monday)
“Off the Grid”, a two week per year paid break for all employees.