Klaviyo is a company that empowers creators to own their destiny by making first-party data accessible and actionable. They are seeking a Principal Software Engineer to lead enterprise scalability initiatives, focusing on performance, reliability, and large-tenant readiness while driving architectural changes and optimizing systems across teams.

Responsibilities:

Define enterprise scalability fitness functions (latency/throughput/error rates) and a scorecard; align teams to SLOs and budgets
Design/implement sharding and partitioning strategies, caching/back‑pressure, multi‑region readiness, and high‑volume migration paths
Build lightweight enablement: benchmarks, profiling harnesses, reproducible testbeds; pair with teams to land fixes
Lead scalability reviews and readiness gates that accelerate—not block—delivery; drive incident deep dives tied to systemic fixes
Communicate clearly to execs and engineers, tying technical work to business impact and customer outcomes
Integrate AI into scale and resiliency work—from proactive anomaly detection to synthetic load and guided runbooks—so performance improvements stick and incidents don’t repeat

Requirements:

Experience: 12+ years scaling multi‑tenant SaaS with a reputation for removing major bottlenecks and proving impact with data
Technical expertise: Performance engineering, capacity planning, sharding/partitioning, caching/back‑pressure, multi‑region readiness, and high‑volume migrations; you turn hotspots into robust patterns
AI tools & automation: You apply AI to scale work—profiling assistance, workload modeling, synthetic traffic generation, anomaly detection, and runbook copilots—always with explicit guardrails and observability
Cross‑org influence: You align teams through fitness functions, scorecards, and readiness gates that accelerate—not block—delivery; you communicate tradeoffs crisply to execs and engineers
AI fluency: Curious, adaptable, and proactive in exploring AI that responsibly improves scale outcomes
Scale scorecard: Company‑wide fitness functions (latency/throughput/error rates) are adopted and reviewed regularly
High‑impact wins: 2–3 bottlenecks removed with documented, reproducible testbeds; pXX latencies and error rates improve on top enterprise workloads; repeat P0s trend down
AI‑assisted scale engineering: AI‑driven anomaly detection reduces alert noise while improving signal; generative load testing and copilot runbooks are used in release/readiness checks for the top critical services; time‑to‑isolate regressions drops 20–30%

Principal Software Engineer, Enterprise Scalability

Key skills

About this role

Responsibilities:

Requirements: