Build and maintain ML pipelines for training, evaluation, and deployment using tools like Databricks, MLFlow, Airflow, DBT, Sagemaker, Tecton
Support AI scientist creating reproducible, containerized model training environments (on-demand and scheduled), and manage compute at scale (e.g., spot/GPU autoscaling)
Define and implement observability and alerting for ML systems (model drift, data quality, feature coverage, etc.)
Design and scale data ingestion and feature transformation flows using batch (e.g., Spark/BigQuery) and streaming (Kafka or equivalent)
Contribute to internal Python libraries and platform tooling that accelerate experimentation and deployment for all model teams
Ensure ML services are modular, testable, and monitored from day one
Exploration and productionization of LLM-based features (e.g., retrieval pipelines, prompt evaluation, model serving)
Requirements
Proven experience designing and deploying ML systems in production (5+ years in relevant roles)
Proficiency in Python and SQL, and orchestration tools (Airflow, Kubeflow, Dagster, etc.)
Experience with modern cloud platforms (preferably GCP or AWS), Kubernetes, and CI/CD workflows
Understanding of ML model lifecycles: training, validation, deployment, and monitoring