Sumo Logic, Inc. helps make the digital world secure, fast, and reliable by unifying critical security and operational data through its Intelligent Operations Platform. As a Senior Machine Learning Engineer - MLOps/LLMOps, you will design and build scalable infrastructure for ML and LLM systems, collaborating with teams to operationalize AI/ML solutions.
Responsibilities:
- Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning, model training, evaluation, deployment, and monitoring
- Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing, validation, and rollback capabilities
- Develop infrastructure-as-code (IaC) for reproducible, version-controlled ML environments
- Architect model serving infrastructure with auto-scaling, A/B testing, and canary deployment capabilities
- Build platforms for LLM fine-tuning, prompt management, and experimentation at scale
- Implement evaluation frameworks for LLM performance, quality, safety, and cost optimization
- Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails
- Establish LLM observability: token usage tracking, latency monitoring, prompt/response logging, and cost attribution
- Own uptime, reliability, and performance of ML/LLM services (SLIs/SLOs)
- Implement comprehensive monitoring, alerting, and incident response for ML systems
- Participate in on-call rotations and drive post-incident reviews to improve system resilience
- Build automation and tooling to reduce toil and accelerate ML development velocity
- Partner with ML Engineers and Data Scientists to translate research into production-ready systems
- Collaborate with platform and infrastructure teams on cloud architecture and resource optimization
- Mentor team members on MLOps best practices, production ML patterns, and operational excellence
- Drive technical decisions on tooling, frameworks, and architectural patterns
Requirements:
- Education: B.S./M.S./Ph.D. in Computer Science, Engineering, or related technical field
- Experience: 4+ years of software engineering experience with 2+ years focused on MLOps/LLMOps
- Production experience with ML model serving frameworks (e.g., TensorFlow Serving, TorchServe, Triton)
- Hands-on with ML experiment tracking and model registry tools (MLflow, Weights & Biases, Kubeflow)
- Proficiency in workflow orchestration (Airflow, Prefect, Kubeflow Pipelines, Metaflow)
- Experience with LLM deployment, fine-tuning, and evaluation frameworks (e.g., vLLM, LangChain, LlamaIndex)
- Knowledge of prompt engineering, RAG architectures, and LLM application patterns
- Familiarity with LLM observability tools (e.g., LangSmith, Arize, WhyLabs)
- Strong experience with major cloud providers (AWS, GCP, or Azure) and ML-specific services (SageMaker, Vertex AI, Azure ML, Bedrock)
- Proficiency in containerization (Docker, Kubernetes) and infrastructure-as-code (Terraform, CloudFormation, Pulumi)
- Experience with microservices architecture and API development (REST, gRPC)
- Strong programming skills in Python, terraform and Helm; familiarity with Go, Java, or Rust is a plus
- Deep understanding of CI/CD practices and tools (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
- Experience with monitoring and observability stacks (Prometheus, Grafana, DataDog, ELK)
- Track record of managing production systems with defined SLIs/SLOs
- Experience with on-call rotations, incident management, and reliability engineering practices
- Experience building internal ML platforms or developer tooling used by multiple teams
- Hands-on with distributed training frameworks (Ray, Horovod, DeepSpeed)
- Knowledge of model optimization techniques (quantization, distillation, pruning)
- Familiarity with feature stores (Feast, Tecton) and data versioning tools (DVC, LakeFS)
- Understanding of ML security best practices, model governance, and compliance requirements
- Experience with cost optimization and resource management for large-scale ML workloads
- Contributions to open-source MLOps/LLMOps projects
- Background in applied ML or data science with practical model development experience