Lead and grow a team of highly independent engineers across Reliability & Resilience and Developer Productivity teams; set org structure, hiring plan, and delivery goals.
Own the platform roadmap and execution for improvements in development velocity, iteration speed, platform availability, and deployment safety.
Build an industry-leading reliability practice: manage SLOs and error budgets, run incident response and postmortems, and prioritize resilience work across critical services.
Operate and evolve core platform services including API gateway, storage and caching infrastructure, secrets management, and observability.
Manage capacity and cost: forecasting, right-sizing, tuning, and spend governance tied to workload and growth plans.
Own key relationships with critical SaaS vendors supporting our platform stack, including evaluation, contracts/renewals, and operational integration.
Requirements
3+ years managing engineers (managing managers is a plus)
Hands-on technical depth in Kubernetes production operations, CI/CD systems.
Track record owning key platform dependencies such as API gateways, caches, petabyte-scale KV stores and databases.
Demonstrated ownership of reliability programs: SLOs, error budgets, incident response, postmortems, and measurable reductions in downtime.
Proven ability to translate business goals into technical strategy and drive cross-org alignment
8+ years building and operating large-scale distributed systems
Track record establishing trust, psychological safety, and clear expectations; skilled at timely, candid feedback
Strong facilitator in technical conflict—you listen, synthesize, decide, and bring the team with you
Tech Stack
Distributed Systems
Kubernetes
Benefits
Competitive compensation package, including equity.
Inclusive Healthcare Package.
Learn and Grow
we provide mentorship and send you to events that help you build your network and skills.
Flexible Time Off.
We will provide you the gear you need to do your role, and a WFH budget for you to outfit your space as needed.