Enable fast, safe experimentation: Implement automated evaluation pipelines (offline + online) with golden sets, rubrics, and regression detection.
Support CI/CD for prompt and model changes, with rollback and approval gates.
Collaborate cross-functionally: Partner with product pods to instrument RAG pipelines and prompt versioning.
Work with deep learning and data teams to integrate structured and unstructured retrieval into LLM workflows.
Optimize performance and cost: Profile latency, token usage, and caching strategies.
Build observability and monitoring for LLM calls, embeddings, and agent behaviors.
Ensure reliability and safety: Implement guardrails (toxicity, PII filters, jailbreak detection). Maintain policy enforcement and audit logging for AI usage.
Requirements
5+ years of experience in applied ML, NLP, or ML infrastructure engineering
Strong coding skills in Python and experience with frameworks like LangChain, LlamaIndex, or Haystack
Solid understanding of retrieval-augmented generation (RAG), embeddings, vector databases, and evaluation methodologies
Experience deploying models or AI systems in production environments (AWS, GCP, or Azure)
Familiarity with prompt management, LLM observability, and CI/CD automation for AI workflows