Doghouse Recruitment is a cloud technology company focused on AI infrastructure, enabling organizations to build and scale AI and ML solutions. They are seeking a Senior Software Engineer to build network automation and observability systems for their global GPU fleet, working in a fast-paced environment with cutting-edge technology.
Responsibilities:
- Build and maintain the services and tools that keep their global network of thousands of GPU nodes running smoothly
- Build tooling that sits between the network core and the cloud platform running on top
- Create monitoring and alerting that gives the team clear visibility and helps resolve issues faster
- Make network changes less risky through solid review processes and safeguards
- Work closely with network engineers and SREs to turn day-to-day pain points into reliable internal tools
Requirements:
- 10+ years of professional software engineering experience, or equivalent practical background
- Proficiency in Go, or a genuine readiness to switch; Python is also welcome
- You don't need to be a network expert, but a genuine interest in infrastructure and networking is expected
- Strong communication skills and the ability to work autonomously in a fast-paced, high-trust environment
- Background in network engineering or SRE: someone who understands operational realities, not just code
- Experience at companies operating at hyperscale: Cloudflare, major cloud providers, or similar
- Familiarity with Prometheus-compatible monitoring stacks (e.g. VictoriaMetrics) or large-scale telemetry systems
- Exposure to Juniper or other vendor networking equipment
- Comfort debugging OSS projects and contributing fixes across languages