Grafana Labs is a remote-first, open-source powerhouse with over 20 million users of its visualization tool. They are seeking a Senior Backend Engineer to operate and evolve multi-cloud streaming clusters and related database infrastructure, ensuring reliability, scalability, and performance of high-throughput systems.
Responsibilities:
- Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure
- Diagnosing and eliminating cross-layer failure modes (e.g., object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.)
- Designing safe upgrade and rollout strategies at scale
- Improving observability, automation, and operational ergonomics
- Partnering closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance
- Working directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, compression trade-offs, etc
- Serving as a primary escalation point and on-call for relevant incidents
- Owning the relationship with all system vendors, including WarpStream Labs and others
- Reviewing and defining SLOs for shared database infrastructure, proactively reducing error budgets through improvements to monitoring, automation, scaling strategies, and system design
- Improving the diagnosability of core streaming and database systems in production, where possible
- Implementing solutions that ensure reliability, scalability, and performance of high-throughput, multi-cloud infrastructure
- Developing fault-tolerant patterns that account for distributed system realities such as storage latency, partition imbalance, noisy neighbors, and control-plane dependencies
- Planning and executing safe upgrades and rollouts across dozens of production clusters
- Collaborating with database and platform engineering leaders to influence architecture, roadmap priorities, and long-term strategy
- Participating in PR review and contributing to design documents, automation, tooling, and code improvements that reduce operational risk
- Sharing best practices and distributed systems knowledge with partner teams
- Participating in incident response, from investigation through resolution and post-incident reviews (PIR)
Requirements:
- 6+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles
- Experience operating distributed systems in production (e.g., streaming systems, analytical databases, large-scale storage backends). Examples of these include Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, or Cassandra
- Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
- Solid understanding of distributed systems design and large-scale system trade-offs
- Proficiency in at least one programming language (Go preferred, but not required)
- Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior
- Experience participating in blameless incident response and writing high-quality post-incident reviews
- Clear communicator who can collaborate across teams and work autonomously
- Curious, pragmatic, action-oriented, and kind (this is important!)