Klaviyo is a company that empowers creators to own their destiny by making first-party data accessible and actionable. As the Senior Manager of Production Infrastructure, you will lead teams responsible for compute runtimes, service networking, and observability, ensuring product engineers have a stable and efficient foundation to work from.
Responsibilities:
- Own and evolve platform primitives in scope (compute runtimes, service networking/ingress, observability) with clear APIs, SLOs, runbooks, and support tiers
- Lead by example technically: drive design reviews, review PRs, and author reference implementations, starter repos, and Terraform/Helm modules that demonstrate the golden path
- Deliver golden paths and self‑service scaffolding; reduce time‑to‑first‑service and lead time for changes
- Raise the bar on reliability: incident response (blameless), alert hygiene, capacity planning, and on‑call health
- Be production‑close: participate in critical incident response and postmortems; trace issues across Kubernetes, service mesh, and data paths; convert learnings into durable fixes, guardrails, and policy‑as‑code
- Standardize observability end‑to‑end: expand OpenTelemetry adoption, define log/trace schemas, and make SLOs and error budgets first‑class in dashboards and alerts
- Evolve our Kubernetes and networking layers: plan cluster upgrades, right‑size node/Pod configs, harden ingress/gateway policies, and advance mTLS/service identity and traffic shaping
- Advance CI/CD and GitOps: ensure fast, safe deploys with progressive delivery, automatic rollbacks, and pre‑prod environments that mirror prod; enforce guardrails via policy‑as‑code
- Stand up a concise scorecard (SLO coverage, incident frequency/severity, lead time, MTTR, developer platform NPS, cost‑to‑serve) and drive consistent trend improvements
- Partner with Security, Data Platform, and Product to clarify ownership boundaries and enable safe, fast delivery
- Improve cost‑to‑serve via quotas, right‑sizing, and showback in partnership with Finance
- Transform workflows by putting AI at the center, building smarter systems and ways of working from the ground up; pilot AI‑assisted runbooks and incident summarization to shorten resolution time
Requirements:
- 7–10+ years in infra/SRE/platform with 3–5+ years leading teams (including managers or staff/lead ICs)
- Demonstrated SRE practices (SLI/SLO design, incident mgmt, capacity planning) and experience with Kubernetes/container orchestration, service networking, IaC, and modern observability
- Technically credible and hands‑on: comfortable reading and discussing code (e.g., Go, Python, or Java), reviewing PRs, and writing small prototypes/tooling when it accelerates the team
- Fluent with Kubernetes internals (scheduling, autoscaling, resource management) and service networking (e.g., Envoy/Istio/Linkerd, API gateways)
- Operate the full observability stack (metrics, logs, traces, profiling) and instrument SLIs/SLOs using OpenTelemetry‑friendly patterns
- Automate by default: Terraform (or Pulumi), Helm/Kustomize, GitOps, CI/CD; you prefer guardrails and policy‑as‑code over manual gates
- You write crisp docs/diagrams and define platform contracts that hold up under scale
- You drive measurable developer velocity and reliability improvements and communicate progress with clarity
- You build inclusive, high‑trust teams and partner tightly across Security/Product/Finance
- You've already experimented with AI in work or personal projects and are eager to deepen your fluency responsibly
- Platforms 'as a product' (DX metrics, roadmaps), event‑driven architectures, and cost‑to‑serve optimization in high‑growth SaaS
- Experience contributing to platform code or tooling (e.g., base images, CLI/scaffolding, controllers/operators, admission/policy), multi‑cluster or multi‑region operations, and progressive delivery