Function Health is dedicated to empowering individuals to live healthier lives through innovative technology. They are seeking a Staff AI Engineer to design and implement stateful multi-agent systems, integrating LLMs and multimodal models into various workflows while ensuring reliability and performance.
Responsibilities:
- Architect and build stateful, graph-based agent workflows with tool use, planning, and memory
- Integrate LLMs and multimodal models via structured I/O (JSON Schema, Pydantic validators) and function/tool calling
- Build high-reliability APIs and streaming services for real-time inference, speech, and vision
- Own production readiness: tracing, logging, metrics, rate limiting, circuit breakers, and SLOs
- Stand up eval pipelines: offline golden sets, LLM-as-judge with human rubrics, online A/B, and regression tests in CI
- Implement retrieval and memory: hybrid search, vector and graph retrieval, semantic caches, and long-horizon context
- Optimize cost/latency: model routing, prompt and tool selection, quantization, and KV cache/prefill strategies
- Lead cloud-native deployments on Kubernetes with GPU autoscaling, canary/shadow releases, and feature flags
- Partner cross-functionally to translate research into robust production systems and iterate quickly behind evaluation gates
- Mentor engineers through code reviews, design docs, and architecture decisions
Requirements:
- 2.5+ years building agentic AI systems; 6+ years as a full-stack or ML engineer, building production backends or ML systems in Python, Go, or similar
- Fluency with agentic orchestration (e.g., LangGraph, PydanticAI, DSPy, LlamaIndex) and tool/function calling
- Experience integrating frontier LLMs and multimodal models via managed APIs or self-hosted serving
- Deep understanding of model serving and inference optimization (vLLM/Triton/TGI/SGLang, batching, KV cache reuse)
- Strong with API design and backend frameworks (FastAPI, Flask) and event-driven architectures
- Data systems expertise with PostgreSQL and Redis, including caching, token streaming, and throughput tuning
- Retrieval and memory: vector databases (pgvector, Pinecone, Weaviate, Milvus), hybrid search, and graph/knowledge storage
- Production evals: LLM-as-judge, human-in-the-loop, rubric design, and CI-integrated regression tests
- Observability and SRE: OpenTelemetry traces, metrics, structured logs, SLOs, dashboards, and on-call triage
- Cloud-native delivery: Kubernetes, Terraform, Docker, GPU scheduling/autoscaling on AWS or GCP
- CI/CD proficiency with GitHub Actions and test automation for prompts, tools, and agents
- Clear, concise communication and high ownership in fast-paced environments
- Real-time multimodal systems: streaming ASR, low-latency TTS, WebRTC, and vision pipelines
- Post-training/fine-tuning: DPO/ORPO, RLHF, preference data generation, and safety alignment
- RAG expertise beyond basics: Graph RAG, multi-hop retrieval, rerankers, query planning, and freshness policies
- Safety and governance: policy-as-code, red-teaming, PII handling, audit logs, and role-based tool authorization
- Regulated data experience (HIPAA, SOC 2, GDPR) and data residency controls
- Personalization at inference time, long-term memory agents, session state, and episodic memory stores
- Experience with consumer-scale AI apps, high-traffic systems, or on-device/edge acceleration (WebGPU)