AirflowApacheAWSPythonPyTorchScikit-LearnSQLMachine LearningMLscikit-learnMLOpsXGBoostApache AirflowLambdaS3IAMCloudWatchGlueSageMakerCI/CDRemote Work
About this role
Role Overview
Build and maintain reproducible model training workflows on AWS (SageMaker, S3, Glue, etc.), making retraining, rollback, and experimentation routine rather than heroic.
Deploy and operate real-time and batch inference services with full CI/CD pipelines, versioning, and safe rollout strategies (canary, shadow, A/B) so changes are deliberate and observable.
Instrument production models for performance, data drift, latency, and errors — and automate retraining triggers when models drift out of tolerance.
Maintain model lineage, auditability, and traceability to meet the compliance, governance, and reporting needs of the regulated gaming industry.
Enforce least-privilege IAM, encryption, and secure data access patterns across the entire ML platform.
Treat cost as a first-class engineering metric — right-size infrastructure, balance batch vs. real-time workloads, and continually reduce platform spend without sacrificing reliability.
Collaborate with engineers, data scientists, and product teams to translate business problems into ML solutions, communicate tradeoffs clearly, and iterate based on feedback.
Continuously explore new AWS services, ML frameworks, and deployment patterns to improve reliability, observability, and developer velocity on the ML platform.
Requirements
3+ years of experience in machine learning engineering, MLOps, or a closely related discipline.
Hands-on experience with AWS ML and data services — SageMaker (training, endpoints, pipelines), S3, Lambda, Step Functions, CloudWatch, MWAA (Apache Airflow).
Experience working with time series data, including feature engineering, seasonality handling, and temporal train/test splits.
Strong Python skills and familiarity with common ML frameworks (scikit-learn, PyTorch, XGBoost, or equivalent).
Experience building and maintaining CI/CD pipelines for ML systems.
Demonstrated ability to monitor and debug production ML systems — latency, drift, errors, and data quality — and drive issues to root cause.
Comfort with SQL and working with structured data at scale.
Able to work collaboratively across teams, assume positive intent, and communicate clearly with both technical and non-technical stakeholders.
Track record of self-directed learning and technical growth in areas like AWS, ML frameworks, or deployment patterns.