Domino Data Lab is a software company that aids AI-driven organizations in developing and operating advanced data science solutions. They are seeking a Staff Platform Reliability Engineer to serve as the technical owner of their scale and reliability platform, ensuring it remains reliable and aligned with evolving infrastructure needs while diagnosing and resolving performance issues.
Responsibilities:
- Serve as the technical owner of Tempest, Domino's scale and reliability platform, ensuring it remains reliable, extensible, and aligned with evolving infrastructure needs
- Diagnose and drive resolution of performance bottlenecks and resource misconfigurations surfaced by scale testing — working directly with platform and infrastructure teams to ship fixes, not just file tickets
- Deliver accurate, data-driven sizing recommendations for customer-facing documentation based on rigorous empirical testing across deployment sizes
- Strengthen observability across scale testing by improving Prometheus and New Relic instrumentation, making it faster to pinpoint root causes during and after multi-day load runs
- Establish and operationalize scale testing on cloud platforms, ensuring appropriate sizing and configuration guidance for this increasingly divergent product line
- Partner with platform teams to enable effective scale and reliability testing across additional cloud providers, helping position Domino for future multi-cloud success
- Increase the efficiency and leverage of a small team by building infrastructure automation that scales operationally as the product and customer base grow
Requirements:
- Background in SRE, platform engineering, or infrastructure with hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments
- Strong proficiency in Python and comfort working in a large, modular codebase that spans orchestration, infrastructure automation, and systems integration
- Experience with observability stacks (Prometheus, Grafana, New Relic, or similar) — writing queries, building dashboards, and using metrics to diagnose performance and reliability issues at the systems level
- Demonstrated ability to go beyond detection to resolution: profiling services, identifying resource bottlenecks, and working with engineering teams to ship durable fixes
- Familiarity with performance and load testing methodologies (e.g., Locust, k6, or similar) as part of a broader infrastructure or reliability practice
- Clear ownership mindset — self-directed, accountable, and able to communicate priorities and status effectively in a remote, async environment