Airbnb was born in 2007 and has grown to over 5 million hosts worldwide. The Senior Staff Machine Learning Engineer will lead the technical direction for ML evaluation, focusing on the data flywheel powering CSxAI products, and work closely with cross-functional teams to enhance machine learning models and systems.
Responsibilities:
- Set technical direction and lead execution for ML evaluation and the end-to-end data flywheel powering CSxAI products
- Define how we measure quality, turn feedback into learning signals, and continuously improve models and products safely and efficiently
- Partner closely with product, engineering, design, and operations to build evaluation systems that are trusted, scalable, and actionable
- Work with large scale structured and unstructured data; explore, experiment, build and continuously improve Machine Learning models and pipelines for Airbnb product, business and operational use cases
- Work collaboratively with cross-functional partners including product managers, operations and data scientists, to identify opportunities for business impact; understand, refine, and prioritize requirements for machine learning, and drive engineering decisions
- Hands-on develop, productionize, and operate Machine Learning models and pipelines at scale, including both batch and real-time use cases
- Leverage third-party and in-house Machine Learning tools & infrastructure to develop reusable, highly differentiating and high-performing Machine Learning systems, enable fast model development, low-latency serving and ease of model quality upkeep
- Define evaluation strategy and success metrics for GenAI systems, aligning offline evaluation with online business and customer experience outcomes
- Build and scale evaluation frameworks (golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge where appropriate) with strong controls for bias, drift, and reliability
- Design the data flywheel: instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance to support continuous improvement
- Lead cross-functional quality initiatives across product, ops, and engineering, driving clarity on what 'good' looks like and how teams act on evaluation results
- Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing (pre-deploy and post-deploy)
- Drive technical decisions and architecture for evaluation and data infrastructure, balancing speed, rigor, cost, and safety
Requirements:
- PhD in Computer Science, Mathematics, Statistics, or related technical field (or equivalent practical experience)
- 10+ years building, testing, and shipping ML/AI systems end-to-end; including 2+ years of experience with GenAI/LLM systems in production
- 5+ years leading large, ambiguous technical initiatives as a senior IC, influencing roadmap and engineering/science direction across teams
- Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing)
- Hands-on experience with GenAI systems, including orchestration, retrieval, tool calling, memory, etc
- Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, and governance)
- Solid ML fundamentals and best practices (model selection, training/serving, monitoring, reliability, and model lifecycle management)
- Experience applying ML/AI to customer support workflows (e.g., agent assist, classification/routing, resolution recommendation, QA)
- Experience building robust evaluation platforms for agent behavior validation, safety/guardrails, and continuous improvement
- Proven ability to take evaluation and data flywheel work from incubation to production, iterating quickly while maintaining scientific rigor
- Strong curiosity and ability to absorb new techniques (e.g., judge models, preference optimization, synthetic data generation) and apply them pragmatically