Lead and grow a team of ML engineers focused on production ML systems
Lead model improvements in response to production issues, product feedback, and new research or platform advancements
Lead production release processes for ML services, including release planning, CI/CD, staged rollouts, and rollback procedures
Build and operate observability and on-call practices for ML features, including monitoring, alerting, dashboards, incident response, and post-incident reviews
Develop and maintain scalable evaluation frameworks, datasets, and automated regression tests to prevent quality regressions
Lead reliability, performance, and cost improvements for inference and serving, including capacity planning and meeting SLAs (latency, throughput, availability)
Partner with researchers, product, and platform teams to define quality bars and production readiness, including Trusted AI requirements
Establish and evolve production standards and governance across ML features (testing, evaluation methodology, release gates, model versioning and lineage)
Partner with platform and product teams to integrate ML capabilities into products
Requirements
BS/MS in CS/Engineering or equivalent experience
Experience building and operating software systems, including production ML systems
People leadership experience, or strong technical leadership experience (mentoring, setting direction, driving delivery)
Experience with cloud infrastructure and production observability (AWS, Azure, or GCP)
Experience with CI/CD, reproducible deployments, and operating services in production
Strong written communication and documentation skills