Own and evolve our multi-tenant infrastructure on Kubernetes, including dedicated clusters per customer and the full tenant lifecycle (provisioning, scaling, migration, and offboarding).
Make our GitOps deployments faster and safer by improving the ArgoCD and Helm based pipeline that deploys 1,000+ applications across hundreds of tenants.
Replace manual infrastructure runbooks (for example, Kubernetes upgrades, Private Link setups, DR drills, cluster onboarding) with reliable automation using Infrastructure-as-Code and workflow engines.
Strengthen observability and efficiency by improving our logging, metrics, and alerting stack and using it to drive better reliability, visibility, and meaningful cloud cost reduction.
Lead customer-facing infrastructure work and incidents end to end, and turn what you learn into clear runbooks, dashboards, and Claude Skills that help both humans and AI agents operate the platform.
Requirements
7+ years in platform engineering, infrastructure, SRE, or backend systems at a SaaS company, with high ownership, strong written/async communication, and enthusiasm for AI-native development tools.
Deep hands-on experience operating Kubernetes in production: managing clusters, upgrades, networking and RBAC, and multi-tenant concerns, not just deploying apps.
Strong GitOps and Helm experience (for example ArgoCD or similar) at meaningful scale, including dealing with sync failures, drift, chart complexity, and improving deployment safety and speed.
Production-quality infrastructure automation skills in Go or Python; familiarity with TypeScript is a plus.
Solid cloud and Infrastructure-as-Code foundation: deep experience with at least one major cloud (AWS, GCP, or Azure), and having designed, written, and reviewed substantial Terraform or Crossplane modules.
Comfort debugging end to end across GitOps pipelines, Kubernetes, and cloud provider layers when deployments or tenants are stuck.
Experience with multi-tenant SaaS infrastructure, observability stacks (logs, metrics, traces, dashboards), and practical cloud cost optimisation (for example autoscaling, instance strategy, or savings mechanisms), ideally with exposure to workflow engines such as Temporal or internal self-service / developer platform tooling.
Tech Stack
AWS
Azure
Cloud
Google Cloud Platform
Kubernetes
Python
Terraform
TypeScript
Go
Benefits
Competitive Compensation: We benchmark at the top of the market and keep compensation simple: strong base salary, performance‑based variable pay, and impact‑driven equity (for most roles), so your total rewards grow in step with the value you create over time.
AI Native Culture: Atlan is where AI-native builders come to build the systems the future of work will run on. AI isn’t an add-on, it’s woven into how we build, think, and work every day, empowering every Atlanian to move faster and create a bigger impact.
Health & Wellness: From Day‑1 health, dental, vision, and mental health to flexible health stipends, we design benefits offerings that lead in each country we're in.
Flexible Time Off & Leave Policies: We trust you to own your energy: flexible time off and modern leave so you can unplug properly, support yourself and your loved ones, and come back ready to drive an impact.
Accelerated Growth & Learning: Develop at an uncommon velocity through cutting-edge tech, complex implementations, and an experienced team that values mastery.
Global, Remote-First, High-Trust: Work from anywhere with a diverse team across 15+ countries, in a trust-first, async environment that gives you true flexibility and ownership over how you work.