Domino Data Lab is a software company that aids AI-driven organizations in developing and operating advanced data science solutions. They are seeking a Staff Platform Reliability Engineer to serve as the technical owner of their scale and reliability platform, ensuring it remains reliable and aligned with evolving infrastructure needs while diagnosing and resolving performance issues.

Responsibilities:

Serve as the technical owner of Tempest, Domino's scale and reliability platform, ensuring it remains reliable, extensible, and aligned with evolving infrastructure needs
Diagnose and drive resolution of performance bottlenecks and resource misconfigurations surfaced by scale testing — working directly with platform and infrastructure teams to ship fixes, not just file tickets
Deliver accurate, data-driven sizing recommendations for customer-facing documentation based on rigorous empirical testing across deployment sizes
Strengthen observability across scale testing by improving Prometheus and New Relic instrumentation, making it faster to pinpoint root causes during and after multi-day load runs
Establish and operationalize scale testing on cloud platforms, ensuring appropriate sizing and configuration guidance for this increasingly divergent product line
Partner with platform teams to enable effective scale and reliability testing across additional cloud providers, helping position Domino for future multi-cloud success
Increase the efficiency and leverage of a small team by building infrastructure automation that scales operationally as the product and customer base grow

Requirements:

Background in SRE, platform engineering, or infrastructure with hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments
Strong proficiency in Python and comfort working in a large, modular codebase that spans orchestration, infrastructure automation, and systems integration
Experience with observability stacks (Prometheus, Grafana, New Relic, or similar) — writing queries, building dashboards, and using metrics to diagnose performance and reliability issues at the systems level
Demonstrated ability to go beyond detection to resolution: profiling services, identifying resource bottlenecks, and working with engineering teams to ship durable fixes
Familiarity with performance and load testing methodologies (e.g., Locust, k6, or similar) as part of a broader infrastructure or reliability practice
Clear ownership mindset — self-directed, accountable, and able to communicate priorities and status effectively in a remote, async environment

Staff Platform Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: