Ship the hardest implementation work yourself — the human-in-the-loop routing, the public/private gateway access controls, the early agent harnesses
Design and implement the human-in-the-loop routing system: queue mechanics, reviewer assignment, back-pressure handling, run resumption semantics
Implement the execution wrapper that enforces human-in-the-loop polices at execution time
Build the safeguards — refusal policies, prompt-injection protections, public/private MCP exposure controls — that make our agents safe to deploy at scale
Review PRs (human
and code-agent-authored) at a depth that builds shared judgment about what good agent code looks like
Mentor engineers through hard implementation problems; close gaps in the team's shared knowledge

6+ years of professional Python with deep production experience operating services, not just shipping them
2+ years operating LLM systems in production: prompt/context engineering, tool/function calling, structured outputs, RAG, evaluation, observability
Demonstrated experience implementing oversight mechanisms — human-in-the-loop routing, refusal policies, autonomy boundaries — in systems where the cost of an agent error is real
Strong written communication: you'll be authoring implementation specs that other engineers (and code agents) build against, and the spec is the work
Extensive knowledge of LangChain/LangGraph — or a comparable framework like AgentCore Strands, CrewAI, or Semantic Kernel — and a clear view of when to use which
Experience with LLM observability tools: Amazon CloudWatch, LangSmith, Langfuse, MLflow, or OpenTelemetry
Experience designing evaluation frameworks (RAGAS, DeepEval, LLM-as-judge, multi-turn regression)
Solid SQL, fluency with at least one cloud platform (AWS preferred), Git, Docker, and modern API frameworks
A hands-on disposition — you want to ship the hard parts yourself, not just write specs about them
Experience reviewing code authored by junior engineers, contractors, or AI agents — and giving feedback that produces better code next time
A considered view on the failure modes of overusing AI — cognitive offloading, organizational skill loss, agent-mediated drift in decision-making — and the conviction to design against them

Lead AI Engineer

Key skills