Grafana Labs is a remote-first, open-source powerhouse with over 20 million users globally. They are seeking a Staff Software Engineer to enhance their Cloud Observability platform, focusing on metrics, logs, and traces integration, while collaborating with teams to improve infrastructure monitoring capabilities.
Responsibilities:
- Design and implement high-quality, scalable integrations for various infrastructure components, applications, and data ingestion pipelines
- Create middleware components and libraries that simplify development and maintenance of observability solutions
- When necessary, represent Grafana Labs in open source forums, working groups, and events
- Work with product teams, in addition to design and docs, to develop features that align with wider product strategy and customer needs
- Lead the technical direction and vision of the team, contributing to strategic discussions and future development of observability solutions
- Work with other departments including Sales, Product, and Support teams to deliver a holistic product experience
- Take ownership of the services you’re running by deploying well tested clean code
- Embrace our open-source culture and contribute to other projects that may not directly fall within your team’s scope
Requirements:
- Strong 8+ years of experience with at least one programming language - any major language (Python, .NET, Java, Go, Rust, etc) is acceptable
- Demonstrated working experience in operating high-scale production systems running on Kubernetes and monitoring it, including on-call participation, incident response, and postmortem practices
- Familiarity with observability tooling (e.g. Grafana)
- Strong understanding of time-series data, metrics cardinality challenges, and cost/performance tradeoffs/optimizations in observability systems
- Experience in a hands-on technical leadership role - setting technical direction, leading project teams, and influencing architectural decisions beyond your immediate team
- Deep understanding of distributed systems concepts including scalability, consistency, high availability, and failure modes in large-scale systems
- Experience writing clean, maintainable, robust, and performant software
- Experience with delivering projects from start to finish in a self-driven manner
- Excellent problem-solving and debugging skills
- Strong mentoring and leadership skills
- Experience operating or scaling Prometheus in high-cardinality, multi-tenant environments
- Experience working with OpenTelemetry Collector pipelines or similar telemetry ingestion systems
- Certified Kubernetes Administrator (CKA)/ Certified Kubernetes Application Developer (CKAD) or any other Kubernetes related certification from CNCF
- Experience developing Kubernetes operators, controllers, or custom resources
- Strong understanding of metrics collection, visualization, and alerting concepts
- Experience contributing to or maintaining open source projects, with evidence of successful pull requests and community collaboration
- Experience designing and building observability backends for various systems and applications