Docker, Inc is a leading brand in developer tooling, trusted by millions of users worldwide. They are seeking a Staff Software Engineer to join the Agentic Platform team, focusing on building the foundational infrastructure for AI-driven workflows and ensuring the reliability and scalability of agentic systems.
Responsibilities:
- Design and operate the core agent execution runtime responsible for scheduling, state management, and lifecycle management of long-running agentic workflows
- Build robust multi-agent coordination patterns: task handoff, agent memory (short-term and long-term), tool use, and workflow branching at scale
- Develop context window management strategies and session persistence layers for stateful agent interactions
- Build tooling for prompt engineering as a first-class engineering discipline — versioning, testing, and evaluation of prompts at scale
- Build platform capabilities that support developers working in AI-assisted coding workflows, including IDE integrations, local-first development environments, and fast iteration loops
- Own and operate Agentic Platform services in AWS or OCI infrastructure provisioning, scaling, cost management, and reliability
- Provision and manage cloud infrastructure using Terraform; manage Kubernetes application packaging and deployment with Helm
- Participate in the 24/7 on-call rotation
- This role may require participation in a 24/7 on-call rotation for the Agentic Platform; carry genuine pager responsibility for the services you build and operate
- Define and uphold SLOs; lead incident response, blameless post-mortems, and drive continuous reliability improvements
- Instrument systems for observability: distributed tracing, structured logging, metrics dashboards, and alerting
- As a Staff Engineer, partner with engineering leadership to set technical direction and serve as a guide and mentor as the team grows
- Drive architectural decisions that balance velocity with long-term maintainability across a distributed, cloud-native stack
- Collaborate cross-functionally with product managers, designers, and partner engineering teams to integrate agentic capabilities into the broader developer platform
- Contribute to a culture of engineering excellence through design reviews, RFC processes, and mentorship
Requirements:
- 8+ years of professional, hands-on, full-time software engineering experience in backend, infrastructure, or platform engineering
- Cloud Platform Expertise (AWS/OCI/Azure/GCP): Proven, hands-on experience operating production services in AWS or Oracle Cloud Infrastructure compute, networking, managed services, IAM, and cost management. This is a must-have; the Agentic Platform is a cloud-native service running 24/7
- Service Ownership in a Cloud Setting: You have owned production services end-to-end — on-call, incident response, SLO definition, and post-mortems. You don't just build; you run what you build
- Distributed Systems Design: Deep understanding of fault tolerance, consistency, observability, and scalability in cloud-native environments
- Backend Engineering Proficiency: Strong proficiency in at least one backend language used for systems work — Go, Python, Rust, or Java
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
- Go: Professional proficiency in Go — Docker's primary language for backend systems
- Infrastructure as Code: Experience with Terraform for cloud infrastructure provisioning and Helm for Kubernetes application packaging and deployment
- Data Infrastructure: Experience with PostgreSQL and Redis / Pub-Sub patterns for state management, caching, and event-driven agent workflows
- MCP & Agent Tooling: Experience with MCP (Model Context Protocol) server design and integration
- Container & Orchestration: Docker, Kubernetes, or equivalent — especially in the context of agent sandboxing and secure code execution environments
- AI-assisted development tools: Familiarity with Cursor, Claude Code, Copilot, Windsurf, etc. and the developer personas using them
- Agent Evaluation: Experience with LLM-as-judge frameworks, behavioral regression testing, and golden dataset management
- Agent Systems Experience: Hands-on experience building or operating AI agent systems — including multi-agent orchestration, tool use, memory systems, or agent evaluation frameworks
- Open Source: Contributions or community engagement on relevant open source projects