OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The role involves building and operating Habitat, a database platform that handles high-QPS, latency-sensitive workloads, requiring close collaboration with internal teams to enhance system performance and reliability.
Responsibilities:
- Design and build core abstractions spanning storage, caching, routing, CDC, and privacy enforcement
- Own a major surface area end to end, from product and API design to operational excellence
- Improve latency, correctness, and cost efficiency for real production workloads at massive scale
- Build strong instrumentation, debugging workflows, and developer-first tooling
- Collaborate closely with internal product and infrastructure teams to understand requirements and ship pragmatic solutions
- Participate in an on-call rotation and raise the bar on reliability while aggressively improving performance and usability
Requirements:
- 8+ years of industry experience building production software, including 3+ years leading large-scale, complex projects or technical initiatives as a tech lead or senior IC
- Strong passion for building distributed systems at scale, with a focus on reliability, scalability, security, and continuous improvement
- Proficiency in Rust and/or Python (Rust preferred for core systems work; Python commonly used for tooling, services, and ecosystem integration)
- Excellent communication skills, with the ability to build alignment and drive decisions across diverse technical and non-technical stakeholders
- A strong track record building and operating high-scale backend or data-intensive distributed systems in production
- Excellent systems judgment and the ability to make tradeoffs across latency, cost, correctness, and reliability
- Deep care for developer experience: simple abstractions, strong defaults, guardrails, and debuggability
- Comfort owning ambiguous problems end to end and driving roadmap plus execution
- Experience in one or more of: databases, caching systems, routing and load balancing, indexing and retrieval, CDC pipelines
- Deep experience with tail latency and global performance optimization (p95, p99, request steering, locality)
- Experience designing platform APIs consumed by many internal teams
- Experience with multi-region systems, consistency semantics, and failover design