The Associated Press is an independent global news organization dedicated to factual reporting. The Machine Learning Operations Engineer will own the production lifecycle of machine-learning systems, ensuring reliable, secure, and cost-effective operation of ML workloads in production environments.

Responsibilities:

Design, deploy, and operate end‑to‑end production ML pipelines across Dev, QA, and Prod environments
Set up and manage AWS SageMaker pipelines, endpoints, and monitoring for large scale inference workloads, including embedding generation, named entity recognition, reranking, and video processing
Own GPU and CPU infrastructure selection, scaling, and optimization, including instance benchmarking, autoscaling behavior, and load testing
Deploy, monitor, and operate inference services that support hundreds of thousands of queries per day across text, image, and video pipelines
Establish standardized ML deployment patterns at AP, including containerization and orchestration strategies, environment isolation (Dev / QA / Prod) and versioned promotion, rollback, and recovery mechanisms
Implement monitoring, alerting, drift detection, and evaluation metrics for production ML systems, tracking latency, error rates, throughput, and model/data drift
Enable A/B testing and controlled rollout strategies for ML models in production, in partnership with engineering and product teams
Partner closely with ML Engineers, Data Scientists, DevOps, and Platform teams to operationalize new models and pipeline improvements, promote systems across environments safely, and ensure deployments meet reliability, scale, and cost targets
Manage high-throughput I/O and data movement for large collections of media assets (text, images, video), avoiding CPU, network, and storage bottlenecks
Reduce operational risk by enforcing reproducibility, observability, security, and cost controls across all production ML systems

Requirements:

5+ years of experience deploying and operating ML inference systems in production
Strong experience with AWS SageMaker, including pipelines, endpoints, monitoring, and multi‑environment deployments
Expertise deploying ML models using PyTorch and TensorFlow from an operational and serving perspective
Proven experience with model deployment and orchestration, including containerized inference and autoscaling
Experience selecting, evaluating, and optimizing compute resources (GPU/CPU) for production ML workloads
Experience setting up monitoring, evaluation metrics, and A/B testing frameworks for ML systems in production
Ability to collaborate effectively with ML Engineers, Data Scientists, and platform teams in a shared ownership model
Operational experience supporting ML systems involving transformer‑based NLP models (e.g., BERT‑family models), computer vision models, and ranking and reranking systems
Familiarity operating systems that use common ML model types such as convolutional and feed‑forward neural networks, ranking algorithms, and approximate Nearest Neighbor methods (e.g., HNSW)
Experience running ML workloads over large‑scale text, image, and video datasets

Machine Learning Operations Engineer

Key skills

About this role

Responsibilities:

Requirements: