Temporal Technologies is an innovative company focused on creating a reliable programming model for developers. They are seeking a Staff Software Engineer to lead the reliability of Temporal Cloud, defining reliability expectations, conducting chaos testing, and mentoring other engineers to enhance reliability practices.
Responsibilities:
- Own reliability outcomes for operating Temporal Cloud end to end, partnering across engineering, infrastructure, and product to drive measurable improvements
- Define, implement, and evolve reliability targets and associated practices, including alerting thresholds, operational readiness criteria, and escalation paths
- Plan and run gamedays to validate incident response, operational procedures, and cross-team coordination under realistic failure scenarios
- Build and scale a chaos testing program that exercises failure modes safely and drives remediation work that reduces real risk
- Define and maintain a reliability scorecard across services and key operational processes, and use it to prioritize reliability investments
- Lead load testing and performance testing efforts, including test design, tooling, and analysis of bottlenecks and capacity constraints
- Improve observability standards (metrics, logs, traces, dashboards) so reliability signals are consistent, actionable, and easy to audit
- Drive post-incident learning and corrective actions, ensuring fixes are durable and reduce recurrence risk over time
- Make system-level tradeoffs across reliability, performance, cost, and velocity, and document decisions clearly for long-term maintainability
- Mentor other engineers and raise the bar on reliability engineering practices across teams
Requirements:
- Strong computer science fundamentals, especially in distributed systems, concurrency, and performance
- Demonstrated ability to design and build complex systems that operate reliably under high load and partial failure
- Experience driving reliability improvements across multiple services, not just within a single codebase
- Hands-on experience with at least one of: gamedays, chaos testing, load testing, or building reliability scorecards
- Strong judgment in ambiguous situations, including the ability to prioritize reliability work based on risk and impact
- Excellent communication skills, including the ability to align multiple stakeholders on reliability goals, plans, and tradeoffs
- A collaborative mindset and a track record of mentoring and leveling up engineering practices
- Experience operating multi-tenant systems and designing protections against noisy-neighbor behaviors
- Deep expertise in observability (metrics design, tracing strategy, dashboard standards) and alert hygiene
- Experience building internal platforms or tooling that enables other teams to meet reliability standards
- Familiarity with workflow orchestration systems or durable execution platforms
- Open source contributions, especially in infrastructure or distributed systems