Arcana is building AI agents that synthesize information across heterogeneous sources and deliver structured, reasoned answers in real time. They are seeking a Senior AI Engineer to optimize inference pipelines, design agent architectures, and manage evaluation frameworks for their AI systems.
Responsibilities:
- Drive TTFT below 400ms for multi-step agent pipelines
- Streaming optimization: first token to user while sub-agents are still running
- KV cache strategy, prompt compression, dynamic context window management
- Multi-provider routing: model selection by latency, cost, and task type across OpenAI, Anthropic, Gemini, and open-weight models
- Design and implement Plan-Execute-Synthesize pipelines that run sub-agents in parallel DAGs, not sequential chains
- Build reliable orchestration on top of Temporal: retries, timeouts, partial failure recovery, idempotency
- Structured output enforcement: JSON schema validation, retry loops on malformed LLM output, graceful degradation
- Tool call design: schema design that LLMs actually follow reliably across providers
- Own the eval framework end to end: ground truth datasets, automated scoring pipelines, regression detection on every PR
- LLM-as-judge pipelines for qualitative output assessment
- Latency regression testing - p50/p95/p99 tracked across every deployment
- Adversarial test case design: ambiguous queries, missing data, conflicting sources, malformed tool responses
- Model serving and cold start optimization
- Async worker architecture for parallel sub-agent execution
- Observability: trace every token, every tool call, every synthesis step
Requirements:
- You've built something that runs in production at a meaningful scale and you understand why it's fast (or why it isn't)
- You've worked on inference pipelines where TTFT was the primary metric and you moved it meaningfully
- You've built multi-step agent systems and you know where they break not from reading papers but from watching them fail in production
- You've written eval harnesses from scratch and you have opinions about what makes a ground truth dataset actually useful
- You've debugged LLM non-determinism in production and built systems resilient to it
- You've worked with streaming LLM responses and built infrastructure around partial output handling
- Stack familiarity: Go, Python, Temporal, Kafka, PostgreSQL, Docker
- You've fine-tuned models but haven't shipped inference systems
- You've used LangChain/LlamaIndex but haven't built the layer underneath
- Strong ML research background without systems exposure