Docker, Inc. is a leading company in developer tooling, trusted by millions of users. They are seeking a Staff Software Engineer to enhance their infrastructure by developing self-service capabilities and improving operational workflows, ensuring teams can efficiently provision and deploy applications.
Responsibilities:
- Take ambiguous infrastructure problems and turn them into proposals the org can rally around, then drive them through RFCs and architecture reviews across teams
- Design self-service capabilities and platform APIs (primarily in Go) for onboarding, provisioning, deployment, observability defaults, and day-2 operations, with contracts and docs teams actually use
- Set delivery standards using Terraform, GitOps with Argo CD, progressive rollout, and good testing, including building the continuous-deployment flow we're missing today
- Evolve the multi-tenant EKS foundations toward better reliability, security, scale, and cost: Envoy Gateway ingress, traffic routing, and the multi-region, cross-account connectivity we need
- Improve SLOs, alerting, and incident follow-up on Grafana Cloud so production gets safer and less dependent on heroics
- We're actively investing in AI-assisted and agentic workflows to cut operational toil. We care that they stay safe, auditable, and human-reviewed. You'll help shape where these earn their place and where they don't
- Alert enrichment and incident context-gathering: assembling the relevant signals, history, and runbook so the on-call engineer starts with context instead of a blank page
- Runbook-assisted diagnosis and remediation recommendations, with a human in the loop on anything that changes production
- Onboarding and readiness assistants that answer the questions our experts answer today
- Operational ownership is part of the job. You'll join the rotation after onboarding and shadowing. As a Staff engineer, you'll also improve the health of on-call itself, with better alerts, stronger runbooks, less toil, and blameless postmortems aimed at prevention
Requirements:
- 8+ years of professional, hands-on, full-time software engineering experience in backend, infrastructure, or platform engineering
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
- Strong software engineering in Go or a similar language: design, testing, debugging, review, long-term maintainability
- A track record designing, shipping, and operating cloud services or infrastructure platforms in production. We hire for skill and impact, not years
- Deep expertise in at least one of: Kubernetes, networking, cloud platforms, reliability engineering, or developer platforms, plus solid Linux, networking, and production-ops fundamentals
- Experience setting technical direction and leading work that needs cross-team alignment
- Clear written and verbal communication in a remote environment (RFCs, design docs, incident writeups)
- EKS and ingress/CNI/service-mesh experience; observability with OpenTelemetry/Prometheus/Grafana; CI/CD and progressive delivery (GitHub Actions, Argo CD, canaries); experience leading migrations or adoption programs across teams