Doghouse Recruitment is seeking a Senior/Staff Site Reliability Engineer to join their client's team building a cloud platform for high-throughput, compute-heavy workloads. The role involves owning production reliability, defining SLIs/SLOs, and improving deployment safety while working in a bare-metal environment.
Responsibilities:
- Define SLIs/SLOs
- Run error budget conversations
- Ship changes that reduce incidents and improve latency (p95/p99)
- Build automation to kill toil
- Improve deployment safety (canary/rollback)
- Turn observability into signal rather than noise
Requirements:
- Extensive Production Engineering experience running bare metal / on-prem / data center infrastructure (not public cloud only)
- Deep hands-on expertise in Linux systems debugging and performance (CPU, memory, IO, - level behaviors)
- Strong understanding of networking (DNS/TCP/TLS, latency, packet loss, congestion, troubleshooting under load)
- Strong Kubernetes experience beyond manifests: scheduler behavior, autoscaling edge cases, kubelet pressure/evictions, etcd/control plane
- Experience with Terraform, Docker, Helm, and modern CI/CD practices
- Strong coding skills are required for this role either in Go, and/or Python, beyond automation scripting - Real engineering capability is a must
- Experience in Low Latency environments