Grafana Labs is a remote-first, open-source powerhouse. The company is seeking a Staff Software Engineer to join the Grafana Cloud k6 squad, responsible for building and operating a performance testing product that ensures resilient, high-performing systems.
Responsibilities:
- Contribute hands-on to the codebase by designing and implementing production-quality software
- Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems
- Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability
- Help mature SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management
- Establish reliability frameworks such as SLIs/SLOs and error budgets, and use them to guide prioritization and engineering trade-offs
- Provide visibility into system health through clear operational metrics and reliability reporting
- Participate in the on-call rotation as a primary escalation point and contribute to incident resolution
- Influence product and system direction through design reviews, architectural discussions, and cross-team collaboration
- Share knowledge through clear, high-quality documentation and technical communication—internally and, where appropriate, externally—to help teams build and operate systems more effectively
- As the reliability foundation matures, grow into broader application and product development leadership, contributing architectural and technical depth beyond operations
Requirements:
- Strong programming background in a modern language (Python and Go are primary, but prior experience is not required)
- Experience designing, building, and operating large-scale distributed systems
- Strong experience with SRE practices, including operating and evolving production systems at scale
- Strong understanding of reliability engineering concepts (e.g. incident management, observability, and failure modes)
- Strong experience of defining or applying SLIs/SLOs, error budgets, or reliability metrics
- Experience with test automation, including performance and functional testing
- Ability to influence engineering practices through clear technical communication, reviews, and collaboration
- Strong interpersonal skills and ability to work effectively across teams
- Familiarity with modern software engineering processes and delivery practices
- Self-driven and comfortable operating with a high degree of autonomy and ambiguity
- Experience participating in blameless incident response and writing high-quality post-incident reviews
- Experience with containerized and cloud-native systems (Docker, Kubernetes, AWS)
- Familiarity with observability tooling and platforms (e.g. the Grafana stack)
- Experience working with Python, Go, JavaScript and/or Jsonnet
- Experience building or operating event-driven or asynchronous systems
- Interest in, or experience with, building testing frameworks or developer tooling