Grafana Labs is a remote-first, open-source powerhouse that helps companies manage their observability strategies. The Staff Software Engineer will be responsible for building and maintaining the Cloud Observability stack, enabling customers to collect and visualize metrics from various systems and applications.
Responsibilities:
- Design and implement high-quality, scalable integrations for various infrastructure components, applications, and data ingestion pipelines
- Create middleware components and libraries that simplify development and maintenance of observability solutions
- When necessary, represent Grafana Labs in open source forums, working groups, and events
- Work with product teams, in addition to design and docs, to develop features that align with wider product strategy and customer needs
- Lead the technical direction and vision of the team, contributing to strategic discussions and future development of observability solutions
- Work with other departments including Sales, Product, and Support teams to deliver a holistic product experience
- Take ownership of the services you’re running by deploying well tested clean code
- Embrace our open-source culture and contribute to other projects that may not directly fall within your team’s scope
Requirements:
- Strong 8+ years of experience with at least one programming language - any major language (Python, .NET, Java, Go, Rust, etc) is acceptable
- Demonstrated working experience in operating high-scale production systems running on Kubernetes and monitoring it, including on-call participation, incident response, and postmortem practices
- Familiarity with observability tooling (e.g. Grafana)
- Strong understanding of time-series data, metrics cardinality challenges, and cost/performance tradeoffs/optimizations in observability systems
- Experience in a hands-on technical leadership role - setting technical direction, leading project teams, and influencing architectural decisions beyond your immediate team
- Deep understanding of distributed systems concepts including scalability, consistency, high availability, and failure modes in large-scale systems
- Experience writing clean, maintainable, robust, and performant software
- Experience with delivering projects from start to finish in a self-driven manner
- Excellent problem-solving and debugging skills
- Strong mentoring and leadership skills
- Experience operating or scaling Prometheus in high-cardinality, multi-tenant environments
- Experience working with OpenTelemetry Collector pipelines or similar telemetry ingestion systems
- Certified Kubernetes Administrator (CKA)/ Certified Kubernetes Application Developer (CKAD) or any other Kubernetes related certification from CNCF
- Experience developing Kubernetes operators, controllers, or custom resources
- Strong understanding of metrics collection, visualization, and alerting concepts
- Experience contributing to or maintaining open source projects, with evidence of successful pull requests and community collaboration
- Experience designing and building observability backends for various systems and applications