AWSGrafanaKubernetesTerraformAIGitHub ActionsPulumiEKSIAMGitHubCI/CDCommunicationRemote Work
About this role
Role Overview
Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives.
Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows.
Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts.
Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design.
Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation.
Diagnose and debug production incidents; drive root-cause analysis and post-incident improvements to prevent recurring problems.
Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding.
Contribute to runbooks, architecture documentation, and team processes.
Requirements
7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management.
Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi.
Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls.
Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early.
Excellent async written communication skills; comfortable working with a geographically distributed team.
Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency.
Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes.
Tech Stack
AWS
Grafana
Kubernetes
Terraform
Benefits
Fully remote work & schedule flexibility
Company-provided laptop
Annual bonus program
Monthly remote work stipend
Annual professional development stipend
Industry conferences
Company all-hands and team gatherings
24 days PTO per year (prorated)
Your birthday
Year-end company shutdown
9 wellbeing days
Public holidays
Other paid leave
Quarterly wellbeing stipend for personal / family activities