Temporal Technologies is an open-source programming model company on a mission to simplify code and enhance application reliability. They are hiring a Senior Software Engineer to join the Cloud Enablement team, focusing on backend systems that power critical Temporal Cloud capabilities and ensure high availability and reliability.
Responsibilities:
- Design and implement backend features that apply and extend the Temporal OSS replication stack to new Temporal Cloud capabilities
- Contribute to Temporal Cloud High Availability features, including:
- Namespace replication within and across regions and cloud providers
- Monitoring replication health and lag
- Supporting manual and automated failover workflows
- Build and improve namespace migration systems, including:
- Migration of namespaces and workloads between self-hosted Temporal clusters and Temporal Cloud
- Migration between Temporal Cloud environments or regions
- Tooling that supports safe cutover, validation, and rollback
- Own medium-to-large features end-to-end, from design through production rollout and long-term maintenance
- Write clear design documentation describing system behavior, tradeoffs, and failure modes
- Ensure features are production-ready by delivering:
- Service-level logs, metrics, and tracing
- Alerts, dashboards, and operational runbooks
- Participate in operational ownership, including on-call rotations, incident response, and postmortems
- Collaborate with teammates to continuously improve reliability, operability, and development velocity
Requirements:
- Strong experience designing and building distributed backend systems with a focus on reliability and scalability
- Hands-on experience operating production systems, including debugging failures and improving observability
- Experience developing highly concurrent systems
- Demonstrated ability to write concurrent production code, preferably in Go (Java or similar languages also welcome)
- Solid understanding of failure modes, replication, and resiliency patterns in distributed systems
- Ability to independently drive work from problem definition to delivery, while collaborating closely with peers and stakeholders
- A mindset focused on building systems that are safe to operate, easy to reason about, and resilient to change
- Experience with replication, failover, or disaster recovery systems
- Experience designing or operating migration tooling for distributed systems
- Familiarity with cloud infrastructure and containerized environments (e.g., Kubernetes)