Sumo Logic, Inc. helps make the digital world secure, fast, and reliable by unifying critical security and operational data through its Intelligent Operations Platform. As a Staff Software Engineer on the Core AI Platform team, you will lead the design and development of a foundational platform that powers Dojo AI agents, focusing on architecture, implementation, and ensuring secure and efficient interactions with enterprise systems.

Responsibilities:

Architect MCP-first platforms
Design scalable, fault-tolerant infrastructure for hosting and operating MCP servers
Define standards for MCP server onboarding, versioning, and interoperability
Build federated context systems
Enable agents to retrieve and reason over context from multiple internal and external MCP servers
Design secure, low-latency context propagation and caching strategies
Lead agent-to-tool communication design
Build resilient tool invocation frameworks that handle partial failures gracefully
Ensure deterministic execution paths where possible in probabilistic AI systems
Enable conversational agent ecosystems
Architect integrations with Slack, Teams, and similar platforms for real-time agent interactions
Design event-driven systems for message ingestion, agent response, and feedback loops
Drive technical leadership
Lead architecture and design reviews across AI, platform, and product teams
Mentor engineers and establish best practices for building AI infrastructure
Operate at scale
Continuously improve platform scalability, reliability, latency, and cost efficiency
Own production readiness, incident response patterns, and operational excellence

Requirements:

B.S. in Computer Science or related discipline (M.S. preferred)
8+ years of experience building large-scale, distributed backend systems
Deep distributed systems expertise
Microservices, async/event-driven systems, and fault-tolerant architectures
Strong backend programming skills
Java, Scala, Go, or Python with solid object-oriented design principles
Concurrency & async programming
Multi-threading, non-blocking I/O, and message-driven architectures
API & protocol design
Experience designing extensible APIs and protocol-based integrations
Production systems experience
Operating 24x7 multi-tenant services with SLAs and on-call ownership
MCP (Model Context Protocol) expertise
Hands-on experience building or operating MCP servers or similar agent protocols
Federated systems
Experience integrating with external services across trust boundaries
Agent & LLM platforms
Experience building AI agent infrastructure (LangChain, LangGraph, CrewAI, AutoGen, etc.)
AWS cloud-native
EC2, ECS/EKS, Lambda, SQS, DynamoDB, CloudWatch
Infrastructure as Code
Terraform, OpenAPI, CI/CD pipelines
Security
OAuth, token exchange, secrets management, and multi-tenant isolation
Tool calling / plugin systems
Designed extensible tool registries or function-calling frameworks
Communication platforms
Slack, Microsoft Teams, or webhook-based event systems
Observability
Distributed tracing, metrics, structured logging (OpenTelemetry a plus)

Staff Software Engineer – Core AI Platform (MCP & Agent Infrastructure)

Key skills

About this role

Responsibilities:

Requirements: