AIMachine LearningMLLLMLarge Language ModelsAgentic
About this role
Role Overview
Design and ship production-grade machine learning systems powering conversational and agentic AI experiences
Build systems that interpret user intent, manage context across multi-turn interactions, and handle ambiguity reliably at scale
Develop and evolve agentic workflows including memory, context management, and multi-step tool orchestration
Create evaluation frameworks, including LLM-as-judge pipelines, to measure quality and guide iteration
Partner closely with product, engineering, and design to deliver seamless, user-facing experiences
Balance experimentation with production rigor, ensuring performance, latency, and reliability at Spotify scale
Continuously improve agent behavior through tight feedback loops between evaluation and real-world usage
Requirements
5+ years of experience building and shipping machine learning systems in production environments
experienced with large language models and have worked on real-world applications beyond experimentation; shipped and maintained large scale systems with LLMs
deep understanding of challenges in conversational or agentic systems, such as context handling and multi-step reasoning
know how to evaluate ML systems rigorously and have experience designing metrics or evaluation pipelines
comfortable debugging complex interactions between models, tools, and system constraints like latency
care about building reliable, scalable systems that deliver high-quality user experiences
enjoy working cross-functionally and contributing to a collaborative, inclusive team environment