
Job title: Sr Software Dev Engineer/ML Ops
Duration: 12 months Contract
Location: Remote
Qualifications:
How does this role fit within the team/department?
This role sits within core MLOps and ML infra systems that enable researchers and engineers. Focus is on ensuring scalable, efficient, and standardized ML workflows.
Overview of the team:
Platform team enabling research + engineering via shared ML systems and platforms.
Key team goals:
Scalable ML workflows and pipelines, ML infra, efficient automation of ML workflows, robust data sampling / feature generation platforms, standardized training and deployment, GKE training and serving infra, knowledge distillation pipelines, foundational training tools
Must-have skills/qualifications (technical, soft skills, certifications, tools):
Ideal experience level (years, leadership, industries)
5-10+ years in large-scale ML systems
Responsibilities:
Primary responsibilities (daily/weekly):
Scalable ML workflows and pipelines,
ML infra,
efficient automation of ML workflows,
robust data sampling / feature generation platforms,
standardized training and deployment
GKE training and serving infra,
knowledge distillation pipelines,
foundational training tools
Desired personality or work style:
Ownership-driven, pragmatic, fast-moving, strong collaborator with research + eng
Key attributes or values sought in the candidate:
Reliability and consistency focus, scalability mindset, cost awareness, methodical approach
Challenges in hiring for this role in the past (if applicable):
Candidate needs to have deep ML + mlops/infra + large scale + distributed systems expertise
What has worked well in hiring for similar roles?
Prioritizing hands-on MLEs with proven large-scale ML platform experience
Any additional details or red flags to note about the role or candidate?
Avoid pure DevOps without ML training experience or pure theoretical profiles
Key projects or initiatives for the role:
Scalable ML workflows and pipelines, ML infra, efficient automation of ML workflows, robust data sampling / feature generation platforms, standardized training and deployment, GKE training and serving infra, knowledge distillation pipelines, foundational training tools
Success metrics or KPIs for this role:
Time to market, Training efficiency, infra uptime, cost optimization, pipeline reliability, onboarding speed
How is success measured?
Time to market, Faster experimentation cycles, reduced cost, stable deployments, high platform adoption