Docker, Inc. is a leading platform for app development, and they are seeking a Software Engineer for their Infrastructure Platform team. The role focuses on building and operating cloud-native services that enhance developer experiences through automation and self-service capabilities.
Responsibilities:
- Build and operate internal platform services and APIs in Go, including provisioning, quotas and policies, cost insights, and platform workflows
- Deliver golden paths for self-serve onboarding and day-2 operations, including access, deployment setup, observability defaults, and governance guardrails
- Partner with teams to drive adoption through clear docs, examples, and measurable outcomes
- Codify infrastructure with Terraform and GitOps practices, and contribute to platform tooling in Go
- Define and improve SLOs, alerting, and operational readiness
- Participate in incident response and preventive follow-ups
- Help standardize safe delivery patterns, including testing gates, canaries, and rollback triggers, so deployments are routine and low-risk
- Operate and scale multi-tenant EKS clusters and traffic and ingress systems to deliver secure, reliable routing
- Evaluate and adopt improvements with a bias toward incremental rollout and measurable impact
- Build and iterate on agentic workflows that reduce operational toil, including triage support, context gathering, safe runbook execution, and remediation suggestions
- Integrate automation into delivery and operations in a way that is safe, observable, and auditable
- Join an on-call rotation after onboarding and shadowing, and participate in incident response during your shifts
Requirements:
- 4+ years of backend software engineering experience building large-scale cloud or distributed systems
- Strong software development skills in Go or a similar language, including design, testing, debugging, and code review
- Experience shipping and operating cloud services in production, often 3+ years. We hire for skill and impact, not years alone
- Solid foundation in Linux, networking fundamentals, and cloud security
- Experience building operational automation, including AI-assisted or agentic workflows, with an emphasis on safety, guardrails, and auditability
- Clear written and verbal communication in a remote environment, including RFCs, incident writeups, and async collaboration
- Kubernetes and EKS experience, plus ingress, CNI, service mesh, and familiarity with L4 and L7 load balancing
- Observability tooling such as OpenTelemetry, Prometheus, and Grafana, plus alerting and SLO practice
- CI/CD and progressive delivery, including GitHub Actions or Argo CD, canaries, and automated rollback
- Cost optimization at scale, including FinOps and capacity modeling
- Distributed systems, containers, and Go-based platform tooling