Microsoft is a company dedicated to empowering individuals and organizations to achieve more. In this role, you will lead the human-in-the-loop evaluation program for M365 Copilot, ensuring it provides the best AI assistant experience by developing evaluation frameworks and managing workforce operations.
Responsibilities:
- Define what great looks like for human data—bringing your own knowledge and perspective on the best approaches to collecting, interpreting, and applying human feedback across product decisions
- Lead the human-in-the-loop evaluation program—designing evaluation frameworks, scorecards, and quality benchmarks that measure and continuously raise the bar on M365 Copilot's response quality, helpfulness, and user experience
- Build and manage evaluation workforce operations, including vendor partnerships, annotator onboarding, qualification, training, and continuous performance management
- Partner with data scientists and engineers to scope evaluation needs, define task instructions, calibrate annotators, and ensure evaluation data is reliable and repeatable
Requirements:
- Bachelor's Degree AND 8+ years experience in product/service/program management or software development + OR equivalent experience
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
- 3+ years of human data or human-in-the-loop evaluation experience
- Experience at an AI research organization or AI data services provider
- 4+ years experience taking a product, feature, or experience to market (e.g., design, addressing product market fit, and launch, internal tool/framework)
- 6+ years experience improving product metrics for a product, feature, or experience in a market (e.g., growing customer base, expanding customer usage, avoiding customer churn)
- Experience building and managing workforce programs, including vendor partnerships and annotation operations at scale
- Proficiency in evaluation pipeline design, annotation frameworks, and quality governance
- Track record of translating human feedback and evaluation signals into measurable product impact