Klaviyo is a company that empowers creators to own their destiny by making first-party data accessible and actionable. They are seeking a Principal Software Engineer to lead enterprise scalability initiatives, focusing on performance, reliability, and large-tenant readiness while driving architectural changes and optimizing systems across teams.
Responsibilities:
- Define enterprise scalability fitness functions (latency/throughput/error rates) and a scorecard; align teams to SLOs and budgets
- Design/implement sharding and partitioning strategies, caching/back‑pressure, multi‑region readiness, and high‑volume migration paths
- Build lightweight enablement: benchmarks, profiling harnesses, reproducible testbeds; pair with teams to land fixes
- Lead scalability reviews and readiness gates that accelerate—not block—delivery; drive incident deep dives tied to systemic fixes
- Communicate clearly to execs and engineers, tying technical work to business impact and customer outcomes
- Integrate AI into scale and resiliency work—from proactive anomaly detection to synthetic load and guided runbooks—so performance improvements stick and incidents don’t repeat
Requirements:
- Experience: 12+ years scaling multi‑tenant SaaS with a reputation for removing major bottlenecks and proving impact with data
- Technical expertise: Performance engineering, capacity planning, sharding/partitioning, caching/back‑pressure, multi‑region readiness, and high‑volume migrations; you turn hotspots into robust patterns
- AI tools & automation: You apply AI to scale work—profiling assistance, workload modeling, synthetic traffic generation, anomaly detection, and runbook copilots—always with explicit guardrails and observability
- Cross‑org influence: You align teams through fitness functions, scorecards, and readiness gates that accelerate—not block—delivery; you communicate tradeoffs crisply to execs and engineers
- AI fluency: Curious, adaptable, and proactive in exploring AI that responsibly improves scale outcomes
- Scale scorecard: Company‑wide fitness functions (latency/throughput/error rates) are adopted and reviewed regularly
- High‑impact wins: 2–3 bottlenecks removed with documented, reproducible testbeds; pXX latencies and error rates improve on top enterprise workloads; repeat P0s trend down
- AI‑assisted scale engineering: AI‑driven anomaly detection reduces alert noise while improving signal; generative load testing and copilot runbooks are used in release/readiness checks for the top critical services; time‑to‑isolate regressions drops 20–30%