Role Overview

Own end-to-end development of multi-agent AI systems, from architecture and implementation through testing, deployment, and ongoing operation
Build modular, composable agentic systems using orchestration frameworks (LangChain, CrewAI, Anthropic MCP, or similar) that operate 24/7 across teams
Develop reusable agentic skills that agents invoke across interfaces (Slack, dashboards, internal apps, CLIs)
Implement observability and feedback loops including logging, performance metrics, prompt iteration, model evaluation, and cost management
Establish governance and compliance standards for AI workflows including access controls, audit trails, PII handling, and human-in-the-loop escalation paths
Build MCP servers, APIs, CLIs, and microservices connecting AI models to business systems (BigQuery, Slack, CRMs, email, calendars, analytics tools)
Architect data flows for retrieval-augmented generation (RAG), connecting LLMs to internal knowledge bases, customer data, and real-time business context
Build serverless or containerized services (GCP Cloud Functions, Cloud Run) that scale with usage and integrate with Grafana's cloud infrastructure
Partner with RevOps, Demand Generation, Regional Marketing, and SDR teams to scope high-impact automation problems, identify bottlenecks, and build solutions with measurable business outcomes
Design and deploy workflows using orchestration tools (n8n, Workato, or custom platforms) with CI/CD, testing, and production reliability standards
Build systems designed for self-service with documentation, playbooks, and enablement materials that let partner teams operate independently.

Requirements

8+ years of software engineering experience with depth in backend development, systems integration, or data/analytics engineering
2+ years hands-on experience applying LLMs/AI to production workflows, not just prototypes
Strong proficiency in Python and JavaScript/Node.js with Git-based workflows, code review practices, and testing discipline
Hands-on experience with LLM frameworks and patterns including prompt engineering, RAG, function calling/tool use, structured output parsing, and evaluation
Experience building and operating multi-agent systems at scale including agent decomposition, orchestration patterns (sequential chains, router/dispatcher, parallel fan-out), state management, and production monitoring
You diagnose business problems before writing code. You think in workflows and outcomes, not just functions.
Deep familiarity with Google Cloud Platform, BigQuery, and serverless/containerized services (Cloud Functions, Cloud Run)
Understanding of LLM failure modes and production mitigations including confidence thresholds, fallback logic, human escalation, and cost/latency management
Proven ability to identify high-leverage problems, push back on low-impact requests, and deliver end-to-end with minimal direction
Fluent with AI-assisted development tools (GitHub Copilot, Cursor, Claude Code). You use AI to build AI systems
Clear technical communicator—you can explain complex systems in simple terms to both engineers and business stakeholders.

Tech Stack

BigQuery
Cloud
Google Cloud Platform
Grafana
JavaScript
Microservices
Node.js
Python

Benefits

100% Remote, Global Culture
Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment.
Transparent Communication – Expect open decision-making and regular company-wide updates.
Innovation-Driven – Autonomy and support to ship great work and try new things.
Open Source Roots – Built on community-driven values that shape how we work.
Empowered Teams – High trust, low ego culture that values outcomes over optics.
Career Growth Pathways – Defined opportunities to grow and develop your career.
Approachable Leadership – Transparent execs who are involved, visible, and human.
Passionate People – Join a team of smart, supportive folks who care deeply about what they do.
In-Person onboarding
We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it.
Balance is Key
We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect. *We will comply with local legislation where applicable.

Staff AI Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits