Build and automate cloud infrastructure provisioning, scaling, and deployments using industry-standard tools and infrastructure-as-code practices
Architect and implement end-to-end MLOps pipelines for packaging, deploying, and monitoring large-scale ML services
Build and integrate telemetry agents to capture operational, performance, and inference metrics across distributed ML services
Build backend dashboards and observability workflows that surface quality, performance, traffic, and reliability insights for ML services
Lead the development of Agentic Ops solutions to optimize large-scale ML production workflows, reduce MTTR, and increase service engineering productivity
Develop and maintain robust CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins) enabling automated model conversion, optimization (PTQ/QAT), and artifact packaging
Drive standards in reliability, cost optimization, and operational readiness across service deployments
Requirements
8+ years of experience in DevOps, SRE, or cloud infrastructure engineering roles
Demonstrated experience designing and managing MLOps lifecycles, including model deployment, inference optimization, and production monitoring
Strong knowledge of CI/CD methodologies and tools such as GitOps, Docker, Terraform, GitHub Actions, GitLab CI, or Jenkins
Hands-on expertise with Kubernetes orchestration, including frameworks such as Kubeflow, Argo Workflows, or similar systems
Strong programming skills in Python, with experience building automation tooling for ML or DevOps workflows
Proficiency with observability and monitoring platforms (e.g., Prometheus, Grafana, Splunk, New Relic) for building reliable production systems
Experience optimizing distributed architectures for cost efficiency, reliability, and performance
Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow) and model optimization tools such as ONNX, TensorRT, TFLite, AOT, etc., is a strong plus.