AWSAzureCloudDistributed SystemsGoogle Cloud PlatformGrafanaKubernetesTerraformGoGolangGCPGoogle CloudHelmSaaSLeadershipCommunicationRemote Work
About this role
Role Overview
Design, build, and operate reconciliation systems, including the SSS backend, to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration
Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient
Improve operational efficiency by reducing deployment complexity (e.g., aiming for single PR regional SSS deployment) and contributing to the Stack Config Reconciliation project
Manage rollout mechanisms for provisioned plugins, dashboards, data sources, Grafana versions, release channels, and stack-level configuration
Support new region and cluster rollouts, including the operational paths required to bring stacks online safely in new Grafana Cloud regions
Improve incident response and recovery paths for stack misalignment, reconciliation failures, plugin rollout issues, and Hosted Grafana integration failures
Partner with Product, Hosted Grafana, Infrastructure, Support, and adjacent AppCore squads on customer-impacting stack lifecycle work
Contribute to roadmap planning, technical design, OnCall improvements, and long-term simplification of stack operations
You will help own the production behavior of the systems you build. That includes improving runbooks, dashboards, alerts, reconciliation safety, rollout controls, and recovery procedures. You should be comfortable debugging across service boundaries and making careful changes in systems that affect customer stacks.
Requirements
You have at least 1 year of fully remote work experience
You have some experience working on a SaaS platform and are familiar with common distributed systems concepts (e.g., scalability, multi-tenancy, HA).
Have professional experience with Golang and be willing to work across both backend service and application code
Care deeply about developer and user experience and the quality of the products that you work on
Have some experience contributing to the delivery of projects, from initial brainstorming to shipping a product to the customer.
You write clean, well-tested software that other engineers can understand, operate, and maintain
Can take on well-defined tasks, break them down, and execute iteratively to deliver working solutions and gather feedback.
You are willing to collaborate across teams and ensure your work is aligned with the needs of other squads and external stakeholders.
Familiarity with Kubernetes in AWS, GCP, or Azure, and exposure to infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
Experience participating in blameless incident response and contributing to post-incident reviews.
Tech Stack
AWS
Azure
Cloud
Distributed Systems
Google Cloud Platform
Grafana
Kubernetes
Terraform
Go
Benefits
100% Remote, Global Culture
Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment.
Transparent Communication – Expect open decision-making and regular company-wide updates.
Innovation-Driven – Autonomy and support to ship great work and try new things.
Open Source Roots – Built on community-driven values that shape how we work.
Empowered Teams – High trust, low ego culture that values outcomes over optics.
Career Growth Pathways – Defined opportunities to grow and develop your career.
Approachable Leadership – Transparent execs who are involved, visible, and human.
Passionate People – Join a team of smart, supportive folks who care deeply about what they do.
In-Person onboarding
We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it.
Balance is Key
We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect.