HappyRobot is the infrastructure for enterprises to build and orchestrate AI workforces. The Infrastructure Engineer will lead efforts in scaling operational resilience, ensuring system stability, and enhancing debugging workflows.
Responsibilities:
- Own the stability, observability, and debugging workflows that keep our systems running smoothly
- Be the go-to person for untangling complex failures in real time
- Design tools that turn chaos into clarity
- Help shift from reactive to proactive operations
- Reduce incident load, build internal tooling, and directly improve developer focus and system uptime
Requirements:
- 3+ years of hands-on experience debugging production systems (logs, traces, incidents, etc.)
- Strong problem-solving skills and ability to dive into unfamiliar backend codebases
- Strong Go and Kubernetes experience
- Familiarity with observability and monitoring tools (e.g., Datadog, Prometheus, Sentry)
- Clear, calm communication under pressure — especially during live incidents
- Experience working with distributed systems or services at scale
- Built or maintained internal tooling for on-call teams or reliability workflows
- Familiarity with deployment pipelines, CI/CD, or infra-as-code
- Experience improving system observability (e.g., custom metrics, traces, log pipelines)