Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
Establish reliability frameworks such as SLIs/SLOs and error budgets, and use them to guide prioritization and engineering trade-offs.
Provide visibility into system health through clear operational metrics and reliability reporting.
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
Influence product and system direction through design reviews, architectural discussions, and cross-team collaboration.
Share knowledge through clear, high-quality documentation and technical communication—internally and, where appropriate, externally—to help teams build and operate systems more effectively.
As the reliability foundation matures, grow into broader application and product development leadership, contributing architectural and technical depth beyond operations.
Requirements
Strong experience with DevOps/SRE practices, including operating and evolving production systems at scale
Strong programming background in a modern language (Python and Go are primary, but prior experience is not required)
Experience designing, building, and operating large-scale distributed systems
Strong understanding of reliability engineering concepts (e.g. incident management, observability, and failure modes)
Experience with test automation, including performance and functional testing
Ability to influence engineering practices through clear technical communication, reviews, and collaboration
Strong interpersonal skills and ability to work effectively across teams
Familiarity with modern software engineering processes and delivery practices
Self-driven and comfortable operating with a high degree of autonomy and ambiguity
Bonus Points For:
Experience with containerized and cloud-native systems (Docker, Kubernetes, AWS)
Familiarity with observability tooling and platforms (e.g. the Grafana stack)
Experience working with Python, Go, JavaScript and/or Jsonnet
Experience building or operating event-driven or asynchronous systems
Experience defining or applying SLIs/SLOs, error budgets, or reliability metrics
Interest in, or experience with, building testing frameworks or developer tooling
Tech Stack
AWS
Cloud
Distributed Systems
Docker
Grafana
JavaScript
Kubernetes
Python
Go
Benefits
Equity
Bonus (if applicable)
30 days annual leave covering Grafana Shutdown Days