Architect and implement end-to-end ML systems (data pipelines, feature engineering, model training, deployment, and monitoring).
Design scalable, low-latency model serving infrastructure integrated with Kubernetes and cloud-native systems.
Build intelligent automation solutions including predictive autoscaling, anomaly detection, seasonality-aware forecasting, and capacity optimization.
Engineer safe and reliable ML-driven automation that operates in high-availability environments.
Own model lifecycle management, including validation, experiment tracking, model registry, monitoring, drift detection, and rollback strategies.
Collaborate closely with platform, SRE, and infrastructure teams to embed ML capabilities into production systems.
Drive engineering best practices around ML system reliability, observability, testing, and performance.
Contribute to architectural decisions and mentor engineers on ML systems design.
Requirements
Option 1: Bachelors degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 5 years' experience in an analytics related field.
Option 2: Masters degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 3 years' experience in an analytics related field.
Option 3: 7 years' experience in an analytics or related field.
Strong proficiency in one or more programming languages commonly used in ML engineering, such as Python, Go, or Java.
Strong experience with ML frameworks such as Scikit-learn, PyTorch, TensorFlow, or similar.
Strong SQL skills and experience working with large-scale datasets.
Hands-on experience training, validating, and deploying machine learning models in production across domains such as recommendation systems, forecasting, anomaly detection, classification, or similar applied ML use cases.
Experience building and maintaining end-to-end ML pipelines (data ingestion, feature engineering, training, evaluation, deployment, monitoring).
Experience with model serving architectures (REST/gRPC inference services, batch inference, streaming inference).
Hands-on experience with ML lifecycle platforms such as MLflow, Ray, Kubeflow, Airflow, or similar orchestration systems.
Experience with experiment tracking, model registry, CI/CD for ML, feature management, and automated retraining workflows.
Experience designing robust evaluation frameworks for traditional ML systems (offline validation, backtesting, shadow testing, A/B testing, and production performance monitoring).
Strong experience working with observability data (metrics, logs, traces) and time-series analysis in distributed systems.
Hands-on experience deploying and operating ML systems on Kubernetes, including containerization using Docker.
Experience working with major cloud platforms (AWS, GCP, or Azure) and cloud-native services.
Tech Stack
Airflow
AWS
Azure
Cloud
Distributed Systems
Docker
Google Cloud Platform
GRPC
Java
Kubernetes
Python
PyTorch
Ray
Scikit-Learn
SQL
Tensorflow
Go
Benefits
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment.