Harnham is a leading recruitment specialist in Data and AI, currently partnering with a top-tier gaming and interactive entertainment company. The Senior ML Platform Engineer will be responsible for designing and implementing systems for model deployment, monitoring, and lifecycle management, collaborating closely with teams to enhance the machine learning platform.
Responsibilities:
- Design and implement ML inference infrastructure for real-time and nearline/batch model serving, including CPU/GPU-aware orchestration and automated deployment pipelines
- Partner with research and product/game teams to understand needs and build generalizable, reusable platform solutions
- Contribute to CI/CD workflows for ML artifacts, enabling fast iteration and safe promotion from development to production
- Build and maintain tooling for environment and dependency management (e.g., Poetry/Conda lock files, secure container image builds) to ensure reproducible ML runtimes
- Implement observability and monitoring capabilities (latency, resource utilization, drift detection, reliability signals)
- Support best practices for production deployment including multi-version models, blue/green rollouts, shadow deployments, and robust rollback strategies
- Continuously improve developer experience through thoughtful tooling, documentation, and iterative platform enhancements
- Contribute to long-term platform architecture and cross-team infrastructure initiatives
Requirements:
- 4+ years of software engineering experience, with substantial time in platform or infrastructure teams
- Experience building and operating distributed systems or production ML platforms, ideally with a focus on model serving
- Strong experience with cloud-native systems (Kubernetes, containerization, autoscaling, observability stacks)
- Hands-on experience with one or more inference serving frameworks (e.g., Triton, KServe/KFServing, TorchServe, BentoML, Seldon, or similar)
- Familiarity with GPU orchestration, performance tuning, and cost-aware scheduling
- Strong background in CI/CD automation, Infrastructure-as-Code (e.g., Terraform), and artifact management
- Strong Python ecosystem experience, package management (Poetry/Conda), and awareness of security/vulnerability scanning practices
- Strong communication skills and the ability to collaborate effectively across teams