Key skills

Production Engineering - Bare Metal InfrastructureLinux Systems DebuggingLinux Performance AnalysisNetworking - DNSNetworking - TCPNetworking - TLSNetworking - LatencyNetworking - Packet LossNetworking - CongestionKubernetes Scheduler BehaviorKubernetes AutoscalingKubernetes Kubelet Pressure/EvictionsKubernetes Etcd/Control PlaneTerraformDockerHelmCI/CD PracticesGo ProgrammingPython ProgrammingLow Latency EnvironmentsPythonGoKubernetesCI/CD

About this role

Doghouse Recruitment is seeking a Senior/Staff Site Reliability Engineer to join their client's team building a cloud platform for high-throughput, compute-heavy workloads. The role involves owning production reliability, defining SLIs/SLOs, and improving deployment safety while working in a bare-metal environment.

Responsibilities:

Define SLIs/SLOs
Run error budget conversations
Ship changes that reduce incidents and improve latency (p95/p99)
Build automation to kill toil
Improve deployment safety (canary/rollback)
Turn observability into signal rather than noise

Requirements:

Extensive Production Engineering experience running bare metal / on-prem / data center infrastructure (not public cloud only)
Deep hands-on expertise in Linux systems debugging and performance (CPU, memory, IO, - level behaviors)
Strong understanding of networking (DNS/TCP/TLS, latency, packet loss, congestion, troubleshooting under load)
Strong Kubernetes experience beyond manifests: scheduler behavior, autoscaling edge cases, kubelet pressure/evictions, etcd/control plane
Experience with Terraform, Docker, Helm, and modern CI/CD practices
Strong coding skills are required for this role either in Go, and/or Python, beyond automation scripting - Real engineering capability is a must
Experience in Low Latency environments

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: