Temporal Technologies is an innovative company focused on creating a reliable programming model for developers. They are seeking a Staff Software Engineer to lead the reliability of Temporal Cloud, defining reliability expectations, conducting chaos testing, and mentoring other engineers to enhance reliability practices.

Responsibilities:

Own reliability outcomes for operating Temporal Cloud end to end, partnering across engineering, infrastructure, and product to drive measurable improvements
Define, implement, and evolve reliability targets and associated practices, including alerting thresholds, operational readiness criteria, and escalation paths
Plan and run gamedays to validate incident response, operational procedures, and cross-team coordination under realistic failure scenarios
Build and scale a chaos testing program that exercises failure modes safely and drives remediation work that reduces real risk
Define and maintain a reliability scorecard across services and key operational processes, and use it to prioritize reliability investments
Lead load testing and performance testing efforts, including test design, tooling, and analysis of bottlenecks and capacity constraints
Improve observability standards (metrics, logs, traces, dashboards) so reliability signals are consistent, actionable, and easy to audit
Drive post-incident learning and corrective actions, ensuring fixes are durable and reduce recurrence risk over time
Make system-level tradeoffs across reliability, performance, cost, and velocity, and document decisions clearly for long-term maintainability
Mentor other engineers and raise the bar on reliability engineering practices across teams

Requirements:

Strong computer science fundamentals, especially in distributed systems, concurrency, and performance
Demonstrated ability to design and build complex systems that operate reliably under high load and partial failure
Experience driving reliability improvements across multiple services, not just within a single codebase
Hands-on experience with at least one of: gamedays, chaos testing, load testing, or building reliability scorecards
Strong judgment in ambiguous situations, including the ability to prioritize reliability work based on risk and impact
Excellent communication skills, including the ability to align multiple stakeholders on reliability goals, plans, and tradeoffs
A collaborative mindset and a track record of mentoring and leveling up engineering practices
Experience operating multi-tenant systems and designing protections against noisy-neighbor behaviors
Deep expertise in observability (metrics design, tracing strategy, dashboard standards) and alert hygiene
Experience building internal platforms or tooling that enables other teams to meet reliability standards
Familiarity with workflow orchestration systems or durable execution platforms
Open source contributions, especially in infrastructure or distributed systems

Staff Software Engineer - Reliability

Key skills

About this role

Responsibilities:

Requirements: