Temporal Technologies is an open-source programming model company focused on enhancing developer experience and building reliable applications. They are seeking a Staff Software Engineer to lead the Replication Foundations team, responsible for evolving Temporal's core replication stack and ensuring high availability and scalability of their cloud services.

Responsibilities:

Lead the design and implementation of core components of Temporal’s OSS replication stack, from initial design through rollout and long-term operational ownership
Design and evolve replication protocols that power: High Availability namespaces, Cross-cluster and cross-region replication, Migration between Temporal clusters (cloud ↔ self-hosted, cloud ↔ cloud)
Build scalability and reliability capabilities such as: Multi-cell namespaces, Protocols enabling a single namespace to span multiple clusters, Dynamic split/merge strategies based on usage, hot spots, and capacity needs
Reason deeply about correctness: consistency models, ordering guarantees, idempotency, failure recovery, and safe rollouts of protocol changes
Drive cross-team alignment with Cloud Enablement and other CGS teams to ensure OSS foundations support current and future cloud products
Author high-quality design docs that clarify invariants, trade-offs, failure modes, and operational playbooks for complex changes
Raise engineering standards through reviews, mentorship, and technical leadership—improving correctness testing, fault injection, and incident readiness
Participate in on-call/incident response related to replication and core system behavior, helping build durable fixes and prevention mechanisms

Requirements:

10+ years building production systems, including significant experience with distributed systems and correctness-critical infrastructure
Strong experience with replication, consistency, fault tolerance, and failure recovery in distributed environments
Demonstrated ability to design and implement concurrent, correctness-critical systems with clear invariants and safety guarantees
Proven track record of leading complex technical projects across teams—setting direction, driving execution, and landing changes safely in production
Hands-on experience debugging complex production issues involving race conditions, data consistency, partial failures, and performance degradation
Proficiency writing production-quality concurrent code, preferably in Go (Java/C++ or similar systems languages also welcome)
Solid understanding of distributed systems fundamentals such as replication, sharding/partitioning, backpressure, failure detection, and durability mechanisms
Ability to operate with high ownership and minimal oversight, balancing deep technical rigor with pragmatic delivery
Curiosity and rigor in understanding how systems behave under stress, failure, and scale
Experience designing or maintaining replication protocols or data-plane infrastructure
Experience with multi-cluster or multi-region architectures (active-active / active-passive)
Familiarity with database internals, log-based replication, or event-sourced systems
Prior contributions to large OSS projects or distributed systems infrastructure

Staff Software Engineer, Replication Foundations

Key skills

About this role

Responsibilities:

Requirements: