AWSAzureCloudGoogle Cloud PlatformPythonPyTorchTensorflowMLDeep LearningTensorFlowGCPGoogle CloudCI/CDRemote Work
About this role
Role Overview
Build, maintain, and optimize production-grade ML pipelines, enabling seamless transitions from experimentation to production.
Define and implement strategies for model versioning, rollout, rollback, and lifecycle management to ensure robust and reproducible ML systems
Define and enforce serving-layer SLAs – latency, availability, GPU utilization, TTFT, ITL – and build observability and alerting
Apply software engineering best practices including testing, CI/CD integration, and reproducibility to ML workflows, improving iteration speed for ML engineers without compromising reliability.
Ensure ML systems are secure, cost-efficient, and scalable, partnering with DevOps on infrastructure standards while owning ML-specific operational concerns.
Collaborate cross-functionally with ML, Data, Product, and DevOps teams to translate ML requirements into production-ready systems and influence technical planning and roadmap decisions.
Requirements
Bachelor’s or Master’s degree in Computer Science, Data Science, or a related field, or equivalent experience.
5-8+ years of experience in Software Engineering, ML Engineering, Platform Engineering, or Infrastructure Engineering with direct ownership of production ML serving systems.
Hands-on experience deploying and maintaining LLMs and deep learning models, in production environments.
Strong Python skills and software engineering fundamentals with infrastructure depth. Familiarity with ML frameworks (PyTorch, Tensorflow or similar) is preferred.
Experience with cloud platforms such as AWS, GCP, or Azure, and familiarity with ML lifecycle tooling, including model registries and experimentation platforms.
Familiarity with inference optimization at the hardware and systems level – batching strategies, memory management, quantization tradeoffs, CPU/GPU interaction patterns.
Demonstrated ability to reason about tradeoffs between latency, cost, throughput, and reliability at the systems as well as operational level.
Experience in high-growth startup environments and an ability to thrive in a fast-paced, evolving technical landscape.