ClickHouse is a leading private cloud company recognized for its innovative approach to real-time analytics and AI workloads. They are seeking an AI Product Engineer to develop agentic capabilities on their observability platform, focusing on enhancing developer experience and incident investigation.

Responsibilities:

Build agents that investigate incidents. They surface anomalies, answer "why is production broken?", and use ClickStack as their substrate
Write skills, not just prompts. Build a library of reusable skills that captures how our team debugs, finds root causes, writes ClickHouse queries, and runs incident response, so agents pick up the right playbook instead of starting from scratch
Own the agent stack end-to-end. Context engineering, tool design, evals, tracing, cost. You're responsible for whether the agent works in production
Make ClickStack a great place to run AI workloads. Build the MCP servers, SDKs, and integrations that let customers' agents read telemetry, take action, and stay observable themselves
Work in the open. Collaborate with OSS contributors and customers, debug their problems alongside them, and feed what you learn back into the product
Tackle the hard parts. Latency, cost, context window limits, eval coverage, hallucinations on real telemetry

Requirements:

5+ years of software engineering experience, including 1–2 years on LLM-powered systems or agents in production
Strong backend skills in TypeScript/Node.js and/or Python. Comfortable in both, even if one is primary
Hands-on experience building agents: multi-step tool use, planning, memory, error recovery. You've shipped them and dealt with the failure modes
Experience designing skills (Markdown-based workflow encodings, Anthropic-style or similar) and a clear view on when a skill, a tool, or both is the right fit
Experience with MCP: building servers, designing tools, and thinking through auth, scoping, and observability for agentic systems
Strong evals practice: golden sets, LLM-as-judge, regression detection
SQL proficiency — you can write ClickHouse queries directly
Comfort with Docker and Kubernetes
Active in open source and the developer community
Built or operated production agents in observability, incident response, or SRE
Strong opinions on agent observability — tracing, cost attribution, eval pipelines, OpenTelemetry for agents — and ideas on how to improve it
Experience with prompt caching, context compaction, or other techniques relevant to running agents on production telemetry volumes
Experience with columnar databases and event ingestion pipelines
Contributed to or maintained an open source AI/agent project
Familiarity with Go, Rust, or other systems languages for integrations and high-throughput infra

AI Product Engineer - ClickStack

Key skills

About this role

Responsibilities:

Requirements: