Temporal Technologies is an open-source programming model company focused on enhancing developer experience and building reliable applications. They are seeking a Staff Software Engineer to lead the Replication Foundations team, responsible for evolving Temporal's core replication stack and ensuring high availability and scalability of their cloud services.
Responsibilities:
- Lead the design and implementation of core components of Temporal’s OSS replication stack, from initial design through rollout and long-term operational ownership
- Design and evolve replication protocols that power: High Availability namespaces, Cross-cluster and cross-region replication, Migration between Temporal clusters (cloud ↔ self-hosted, cloud ↔ cloud)
- Build scalability and reliability capabilities such as: Multi-cell namespaces, Protocols enabling a single namespace to span multiple clusters, Dynamic split/merge strategies based on usage, hot spots, and capacity needs
- Reason deeply about correctness: consistency models, ordering guarantees, idempotency, failure recovery, and safe rollouts of protocol changes
- Drive cross-team alignment with Cloud Enablement and other CGS teams to ensure OSS foundations support current and future cloud products
- Author high-quality design docs that clarify invariants, trade-offs, failure modes, and operational playbooks for complex changes
- Raise engineering standards through reviews, mentorship, and technical leadership—improving correctness testing, fault injection, and incident readiness
- Participate in on-call/incident response related to replication and core system behavior, helping build durable fixes and prevention mechanisms
Requirements:
- 10+ years building production systems, including significant experience with distributed systems and correctness-critical infrastructure
- Strong experience with replication, consistency, fault tolerance, and failure recovery in distributed environments
- Demonstrated ability to design and implement concurrent, correctness-critical systems with clear invariants and safety guarantees
- Proven track record of leading complex technical projects across teams—setting direction, driving execution, and landing changes safely in production
- Hands-on experience debugging complex production issues involving race conditions, data consistency, partial failures, and performance degradation
- Proficiency writing production-quality concurrent code, preferably in Go (Java/C++ or similar systems languages also welcome)
- Solid understanding of distributed systems fundamentals such as replication, sharding/partitioning, backpressure, failure detection, and durability mechanisms
- Ability to operate with high ownership and minimal oversight, balancing deep technical rigor with pragmatic delivery
- Curiosity and rigor in understanding how systems behave under stress, failure, and scale
- Experience designing or maintaining replication protocols or data-plane infrastructure
- Experience with multi-cluster or multi-region architectures (active-active / active-passive)
- Familiarity with database internals, log-based replication, or event-sourced systems
- Prior contributions to large OSS projects or distributed systems infrastructure