Docker, Inc is a leading brand in developer tooling, trusted by millions of users worldwide. They are seeking a Staff Software Engineer to join the Agentic Platform team, focusing on building the foundational infrastructure for AI-driven workflows and ensuring the reliability and scalability of agentic systems.

Responsibilities:

Design and operate the core agent execution runtime responsible for scheduling, state management, and lifecycle management of long-running agentic workflows
Build robust multi-agent coordination patterns: task handoff, agent memory (short-term and long-term), tool use, and workflow branching at scale
Develop context window management strategies and session persistence layers for stateful agent interactions
Build tooling for prompt engineering as a first-class engineering discipline — versioning, testing, and evaluation of prompts at scale
Build platform capabilities that support developers working in AI-assisted coding workflows, including IDE integrations, local-first development environments, and fast iteration loops
Own and operate Agentic Platform services in AWS or OCI infrastructure provisioning, scaling, cost management, and reliability
Provision and manage cloud infrastructure using Terraform; manage Kubernetes application packaging and deployment with Helm
Participate in the 24/7 on-call rotation
This role may require participation in a 24/7 on-call rotation for the Agentic Platform; carry genuine pager responsibility for the services you build and operate
Define and uphold SLOs; lead incident response, blameless post-mortems, and drive continuous reliability improvements
Instrument systems for observability: distributed tracing, structured logging, metrics dashboards, and alerting
As a Staff Engineer, partner with engineering leadership to set technical direction and serve as a guide and mentor as the team grows
Drive architectural decisions that balance velocity with long-term maintainability across a distributed, cloud-native stack
Collaborate cross-functionally with product managers, designers, and partner engineering teams to integrate agentic capabilities into the broader developer platform
Contribute to a culture of engineering excellence through design reviews, RFC processes, and mentorship

Requirements:

8+ years of professional, hands-on, full-time software engineering experience in backend, infrastructure, or platform engineering
Cloud Platform Expertise (AWS/OCI/Azure/GCP): Proven, hands-on experience operating production services in AWS or Oracle Cloud Infrastructure compute, networking, managed services, IAM, and cost management. This is a must-have; the Agentic Platform is a cloud-native service running 24/7
Service Ownership in a Cloud Setting: You have owned production services end-to-end — on-call, incident response, SLO definition, and post-mortems. You don't just build; you run what you build
Distributed Systems Design: Deep understanding of fault tolerance, consistency, observability, and scalability in cloud-native environments
Backend Engineering Proficiency: Strong proficiency in at least one backend language used for systems work — Go, Python, Rust, or Java
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
Go: Professional proficiency in Go — Docker's primary language for backend systems
Infrastructure as Code: Experience with Terraform for cloud infrastructure provisioning and Helm for Kubernetes application packaging and deployment
Data Infrastructure: Experience with PostgreSQL and Redis / Pub-Sub patterns for state management, caching, and event-driven agent workflows
MCP & Agent Tooling: Experience with MCP (Model Context Protocol) server design and integration
Container & Orchestration: Docker, Kubernetes, or equivalent — especially in the context of agent sandboxing and secure code execution environments
AI-assisted development tools: Familiarity with Cursor, Claude Code, Copilot, Windsurf, etc. and the developer personas using them
Agent Evaluation: Experience with LLM-as-judge frameworks, behavioral regression testing, and golden dataset management
Agent Systems Experience: Hands-on experience building or operating AI agent systems — including multi-agent orchestration, tool use, memory systems, or agent evaluation frameworks
Open Source: Contributions or community engagement on relevant open source projects

Staff Software Engineer, Agentic Platform (West Coast)

Key skills

About this role

Responsibilities:

Requirements: