Design and execute system-level measurement frameworks for foundational model improvements spanning offline evaluation benchmarks, online A/B experiments, and longitudinal impact tracking across surfaces.
Define, and own the success metrics that quantify foundational model value.
Build causal inference methodologies to isolate the incremental impact of individual model components within a complex, multi-model production system where changes co-occur and interact.
Work cross-functionally to build relationships, proactively communicate key findings, and collaborate closely with ML Engineers, Applied Scientists, Homefeed and Surface teams to ensure measurement rigor is embedded in every model launch.
Relentlessly focus on impact, whether through sharpening investment decisions with data, raising the bar for launch criteria, accelerating experimentation velocity, or surfacing hidden inefficiencies in the model ecosystem.
Requirements
5+ years of experience analyzing data in a fast-paced, data-driven environment with proven ability to apply scientific methods to solve real-world problems on web-scale data.
Strong interest and hands-on experience in one or more of: ML system evaluation, recommender system measurement, A/B experimentation at scale, causal inference
Deep familiarity with large-scale recommendation or ranking systems and their evaluation including an understanding of how representation learning, retrieval, ranking, and re-ranking stages interact and compound in production.
Experience designing and executing A/B experiments for complex ML systems, including multi-surface holdouts, metric decomposition, long-run effect estimation, and interference/spillover mitigation.
Strong quantitative programming (Python) and data manipulation skills (SQL/Spark); experience with ML pipelines, feature stores, and large-scale experimentation platforms.
Ability to work independently, drive ambiguous projects end-to-end, and operate with high ownership in a fast-moving research-to-production environment.
Excellent written and verbal communication skills, with the ability to translate complex system-level findings into clear narratives for technical and non-technical partners including leadership-level investment recommendations.
A team player eager to partner across teams to turn measurement insights into better models and faster launches.