Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
Establish reliability frameworks such as SLIs/SLOs and error budgets, and use them to guide prioritization and engineering trade-offs.
Provide visibility into system health through clear operational metrics and reliability reporting.
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
Influence product and system direction through design reviews, architectural discussions, and cross-team collaboration.
Share knowledge through clear, high-quality documentation and technical communication—internally and, where appropriate, externally—to help teams build and operate systems more effectively.
As the reliability foundation matures, grow into broader application and product development leadership, contributing architectural and technical depth beyond operations.
Requirements
Strong experience with DevOps/SRE practices, including operating and evolving production systems at scale
Strong programming background in a modern language (Python and Go are primary, but prior experience is not required)
Experience designing, building, and operating large-scale distributed systems
Strong understanding of reliability engineering concepts (e.g. incident management, observability, and failure modes)
Experience with test automation, including performance and functional testing
Ability to influence engineering practices through clear technical communication, reviews, and collaboration
Strong interpersonal skills and ability to work effectively across teams
Familiarity with modern software engineering processes and delivery practices
Self-driven and comfortable operating with a high degree of autonomy and ambiguity
Tech Stack
Cloud
Distributed Systems
Python
Go
Benefits
100% Remote, Global Culture
Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment.
Transparent Communication – Expect open decision-making and regular company-wide updates.
Innovation-Driven – Autonomy and support to ship great work and try new things.
Open Source Roots – Built on community-driven values that shape how we work.
Empowered Teams – High trust, low ego culture that values outcomes over optics.
Career Growth Pathways – Defined opportunities to grow and develop your career.
Approachable Leadership – Transparent execs who are involved, visible, and human.
Passionate People – Join a team of smart, supportive folks who care deeply about what they do.
In-Person onboarding
We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it.
Balance is Key
We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect.