TrueFoundry is building foundational infrastructure for production AI systems. They are seeking a Senior AI/ML Engineer to design and own core components that enable enterprise customers to run production agentic AI safely and efficiently, focusing on orchestration, observability, and integration with upstream tooling.
Responsibilities:
- Architect and implement scalable agent orchestration patterns (graph-based executors, state management, multi-agent coordination) for production workloads
- Own critical integrations: model adapters, LLM gateway hooks, vector DBs, tools & external APIs, and the platform’s LLMops flows
- Build and improve tracing, benchmarking and observability for LLMs and agents — token/cost accounting, latency p95, throughput, and correctness checks
- Drive design for safety/guardrails: moderation hooks, human-in-the-loop checkpoints, replayable audit trails and policy enforcement
- Mentor junior engineers, run design reviews, and improve engineering practices (testing, CI/CD, chaos testing for agents)
- Work directly with strategic customers to prototype complex agentic solutions and translate them into product features