The Home Depot is seeking a Senior Software Engineer for Site Reliability Engineering to build and operate internal platforms that ensure the reliability and observability of store systems. The role involves designing, developing, and maintaining tools for various development and reliability teams, focusing on automation and operational efficiency.
Responsibilities:
- Develops, tests, deploys, and maintains software, with a clear understanding of the value the software is to provide; Takes on new opportunities and tough challenges with a sense of urgency, high energy and enthusiasm; Consistently achieves results, even under tough circumstances; Develops test suites (functional, destructive, etc) to enable success, rapid deployment of code to production; Takes a broad view when approaching issues; using a global lens
- Learns through successful and failed experiment when tackling new problems; Actively seeks ways to grow and be challenged using both formal and informal development channels
- Collaborates with other team members in agile processes; Creates new and better ways for the organization to be successful; Works the Product Team to ensure user stories are valuable, developer ready, easy to understand and testable; Delivers multi-mode communications that convey a clear understanding of the unique needs of different audiences; Adapts approach and demeanor in real time to match the shifting demands of different situations; Relates openly and comfortably with diverse groups of people
- Helps grow junior engineers by providing guidance on modern software development frameworks, and leading technical discussions
Requirements:
- Must be eighteen years of age or older
- Must be legally permitted to work in the United States
- 3 years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Infrastructure Engineering
- Hands-on experience with Google Cloud Platform (GCP), including GKE, GCS, BigQuery, Cloud Pub/Sub, Cloud Logging, IAM, and Workload Identity
- Strong Kubernetes experience: deploying and managing workloads on GKE or similar managed Kubernetes services, writing and debugging Helm charts, managing namespaces, RBAC, service accounts, and troubleshooting issues
- Experience with infrastructure-as-code tools, particularly Terraform for cloud resource management
- Proficiency in one or more of: Go, Python, JavaScript/TypeScript, YAML
- Experience with observability platforms: deploying, configuring, or operating log aggregation, distributed tracing, metrics, dashboarding, or continuous profiling
- Practical understanding of SLOs, SLIs, and error budgets
- Experience with synthetic monitoring or performance testing frameworks (k6, Playwright, Selenium, Locust, or similar)
- Familiarity with incident management and on-call practices: Blameless post-mortems, runbook development, and incident communication
- Experience with CI/CD pipelines using GitHub Actions, Spinnaker, ArgoCD, or similar
- Experience with automation to reduce operational toil: building self-service tooling, writing scripts or bots to handle repetitive tasks, or developing internal developer platforms
- Experience writing clear technical documentation, runbooks, and onboarding guides
- Comfort working on a small team with broad ownership
- The knowledge, skills and abilities typically acquired through the completion of a bachelor's degree program or equivalent degree in a field of study related to the job
- 3-5 years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Infrastructure Engineering
- Experience with other major cloud providers (AWS, Azure) is also valuable
- Familiarity with cdk8s (CDK for Kubernetes) or similar programmatic IaC tools is a plus
- Bonus if you've built or operated a synthetic testing platform rather than just consumed one
- Familiarity with AI-assisted development tools (GitHub Copilot, LLM-based automation, MCP servers) is a plus