Grafana Labs is a remote-first, open-source powerhouse. The company is seeking a Staff Software Engineer to join the Grafana Cloud k6 squad, responsible for building and operating a performance testing product that ensures resilient, high-performing systems.

Responsibilities:

Contribute hands-on to the codebase by designing and implementing production-quality software
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems
Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability
Help mature SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management
Establish reliability frameworks such as SLIs/SLOs and error budgets, and use them to guide prioritization and engineering trade-offs
Provide visibility into system health through clear operational metrics and reliability reporting
Participate in the on-call rotation as a primary escalation point and contribute to incident resolution
Influence product and system direction through design reviews, architectural discussions, and cross-team collaboration
Share knowledge through clear, high-quality documentation and technical communication—internally and, where appropriate, externally—to help teams build and operate systems more effectively
As the reliability foundation matures, grow into broader application and product development leadership, contributing architectural and technical depth beyond operations

Requirements:

Strong programming background in a modern language (Python and Go are primary, but prior experience is not required)
Experience designing, building, and operating large-scale distributed systems
Strong experience with SRE practices, including operating and evolving production systems at scale
Strong understanding of reliability engineering concepts (e.g. incident management, observability, and failure modes)
Strong experience of defining or applying SLIs/SLOs, error budgets, or reliability metrics
Experience with test automation, including performance and functional testing
Ability to influence engineering practices through clear technical communication, reviews, and collaboration
Strong interpersonal skills and ability to work effectively across teams
Familiarity with modern software engineering processes and delivery practices
Self-driven and comfortable operating with a high degree of autonomy and ambiguity
Experience participating in blameless incident response and writing high-quality post-incident reviews
Experience with containerized and cloud-native systems (Docker, Kubernetes, AWS)
Familiarity with observability tooling and platforms (e.g. the Grafana stack)
Experience working with Python, Go, JavaScript and/or Jsonnet
Experience building or operating event-driven or asynchronous systems
Interest in, or experience with, building testing frameworks or developer tooling

Staff Software Engineer - Grafana Cloud k6 | USA | Remote

Key skills

About this role

Responsibilities:

Requirements: