Sumo Logic, Inc. helps make the digital world secure, fast, and reliable by unifying critical security and operational data through its Intelligent Operations Platform. As a Staff Software Engineer on the Core AI Platform team, you will lead the design and development of a foundational platform that powers Dojo AI agents, focusing on architecture, implementation, and ensuring secure and efficient interactions with enterprise systems.
Responsibilities:
- Architect MCP-first platforms
- Design scalable, fault-tolerant infrastructure for hosting and operating MCP servers
- Define standards for MCP server onboarding, versioning, and interoperability
- Build federated context systems
- Enable agents to retrieve and reason over context from multiple internal and external MCP servers
- Design secure, low-latency context propagation and caching strategies
- Lead agent-to-tool communication design
- Build resilient tool invocation frameworks that handle partial failures gracefully
- Ensure deterministic execution paths where possible in probabilistic AI systems
- Enable conversational agent ecosystems
- Architect integrations with Slack, Teams, and similar platforms for real-time agent interactions
- Design event-driven systems for message ingestion, agent response, and feedback loops
- Drive technical leadership
- Lead architecture and design reviews across AI, platform, and product teams
- Mentor engineers and establish best practices for building AI infrastructure
- Operate at scale
- Continuously improve platform scalability, reliability, latency, and cost efficiency
- Own production readiness, incident response patterns, and operational excellence
Requirements:
- B.S. in Computer Science or related discipline (M.S. preferred)
- 8+ years of experience building large-scale, distributed backend systems
- Deep distributed systems expertise
- Microservices, async/event-driven systems, and fault-tolerant architectures
- Strong backend programming skills
- Java, Scala, Go, or Python with solid object-oriented design principles
- Concurrency & async programming
- Multi-threading, non-blocking I/O, and message-driven architectures
- API & protocol design
- Experience designing extensible APIs and protocol-based integrations
- Production systems experience
- Operating 24x7 multi-tenant services with SLAs and on-call ownership
- MCP (Model Context Protocol) expertise
- Hands-on experience building or operating MCP servers or similar agent protocols
- Federated systems
- Experience integrating with external services across trust boundaries
- Agent & LLM platforms
- Experience building AI agent infrastructure (LangChain, LangGraph, CrewAI, AutoGen, etc.)
- AWS cloud-native
- EC2, ECS/EKS, Lambda, SQS, DynamoDB, CloudWatch
- Infrastructure as Code
- Terraform, OpenAPI, CI/CD pipelines
- Security
- OAuth, token exchange, secrets management, and multi-tenant isolation
- Tool calling / plugin systems
- Designed extensible tool registries or function-calling frameworks
- Communication platforms
- Slack, Microsoft Teams, or webhook-based event systems
- Observability
- Distributed tracing, metrics, structured logging (OpenTelemetry a plus)