Temporal Technologies is an open-source programming model company aiming to simplify code and enhance application reliability. They are seeking a Staff Software Engineer to lead the design and implementation of a new persistence layer for Temporal Visibility, focusing on scalability and performance while ensuring safe data migrations and operational reliability.
Responsibilities:
- Re-architect Temporal Visibility at scale
- Lead the design and implementation of a new persistence layer for Temporal Visibility, informed by its real-world access patterns (high-volume writes, time-based queries, filtering, sorting, and pagination across long-running workflows)
- Evaluate and select the most appropriate storage technologies (e.g., ClickHouse, Elasticsearch, or complementary systems), clearly articulating tradeoffs around indexing models, consistency, cost, latency, and operational complexity
- Design schemas, APIs, and query models that make Visibility both powerful and intuitive for customers
- Plan and execute online migrations of live Visibility data from existing persistence stores to new backends, at scale and without customer downtime
- Design dual-write, backfill, validation, and cutover strategies that prioritize correctness, observability, and rollback safety
- Build tooling and automation to validate data integrity and performance throughout the migration lifecycle
- Define and own SLOs for Visibility storage and query paths
- Profile hot paths, design benchmarks, and lead systematic performance tuning efforts
- Build operational playbooks, dashboards, and alerting that make the system understandable and debuggable for on-call engineers
- Lead incident reviews and reliability improvements related to persistence and indexing systems
- Break down large, ambiguous roadmap initiatives into concrete, executable phases
- Author and steward design docs and RFCs through review with peers and stakeholders
- Mentor and unblock other engineers working in the persistence and storage domain
- Partner closely with Server, Cloud, and Developer Experience teams to land features end-to-end
Requirements:
- 5 or more years of experience as an 'Arranger' and/or 'Builder/Enhancer' of highly scalable distributed systems
- Strong computer science fundamentals in distributed systems, including concurrency, consistency models, and failure modes
- Significant experience writing and operating concurrent production systems in Go, Java, or similar languages, at a high-end intermediate to expert level
- Experience writing concurrent code in production with languages like Go or Java or other applicable languages with skill level as 'high end of Intermediate' and/or 'Advanced' or 'Expert' levels
- Hands-on experience designing, operating, and tuning ClickHouse and/or Elasticsearch, ideally in self-hosted environments (managed services are a strong plus)
- Experience building and running services on AWS. Bonus: Azure and/or GCP experience
- Demonstrated ability to lead large, multi-quarter technical initiatives, especially those involving core data infrastructure and live data migrations
- Prior contributions to Temporal, Cadence, or other workflow engines
- Deep expertise in storage internals (e.g., columnar stores, LSM trees, inverted indexes, transactional logs)
- Experience operating multi-region services with ≥99.99% uptime
- Strong background in operating and evolving Open Source systems
- Experience building Kubernetes controllers and/or CRDs