Sumo Logic, Inc. empowers the people who power modern, digital business. As a Staff Machine Learning Engineer – AI Tech Lead, you will lead the design and delivery of advanced AI systems for Security Operation Center, focusing on the evaluation and productionization of agentic AI technologies.
Responsibilities:
- Lead and partner with fellow leadership members and teams on technical evaluation and adoption of cutting-edge agentic AI platforms, including Anthropic (Claude), LangChain/LangGraph, AWS Bedrock, and other emerging agent frameworks
- Architect, prototype, and productionize multi-agent AI systems for Agentic SOC use cases, including detection, triage, investigation, and response workflows
- Own the design of core agent architecture components, including planning, execution, tool orchestration, memory, context engineering, and long-running agent workflows
- Lead AI agent evaluation systems, including offline and online evaluation pipelines, golden datasets, synthetic data generation, human- and LLM-based judging, and continuous quality monitoring
- Drive LLM fine-tuning and alignment efforts to improve domain-specific reasoning, accuracy, and reliability for security and observability use cases
- Design scalable LLMOps and AI agent infrastructure, including inference routing, latency optimization, cost control, and production observability for agent systems
- Partner with product, security, and data platform leadership and teams to deliver end-to-end AI agent capabilities from prototype to customer-facing production systems
- Lead and partner on technical direction and mentorship for AI engineers working on agentic AI and LLM systems
- Define and implement best practices for AI safety, reliability, evaluation, and monitoring in production agentic systems
- Operate as a senior technical owner in ambiguous problem spaces—setting technical direction, breaking down complex problems, and driving delivery across teams
Requirements:
- B.Tech, M.Tech, or Ph.D. in Computer Science, Machine Learning, Data Science, or a related technical field
- 5+ years of hands-on industry experience building, operating, and leading production ML/AI systems, with demonstrated technical leadership and ownership
- Strong foundation in machine learning, distributed systems, data pipelines, and large-scale system design
- Deep industry understanding of LLMs, prompt engineering, context engineering, agentic AI design patterns, and reasoning workflows
- Strong proficiency in Python and modern ML/AI ecosystems
- Experience designing and operating evaluation frameworks for ML/LLM systems (offline + online)
- Proven ability to lead complex technical initiatives across teams and influence architecture decisions
- Excellent communication skills and ability to translate complex AI systems into business impact
- Hands-on experience building and scaling agentic AI systems or multi-agent architectures in production
- Experience with modern agent frameworks such as LangGraph, LangChain, CrewAI, or similar
- Experience with major foundation model platforms such as Anthropic, OpenAI, AWS Bedrock, or Vertex AI
- Experience with LLM fine-tuning pipelines (SFT, RLHF/RLAIF, preference learning, domain adaptation)
- Strong background in LLMOps, including inference optimization, latency/cost management, observability, and production monitoring
- Experience with ML infrastructure and tooling such as PyTorch, MLflow, Airflow, Docker, Kubernetes, and cloud platforms (AWS/GCP/Azure)
- Experience applying AI/ML to security, observability, or large-scale log/telemetry data is a strong plus