About this role

HappyRobot is the infrastructure for enterprises to build and orchestrate AI workforces. The Infrastructure Engineer will lead efforts in scaling operational resilience, ensuring system stability, and enhancing debugging workflows.

Responsibilities:

Own the stability, observability, and debugging workflows that keep our systems running smoothly
Be the go-to person for untangling complex failures in real time
Design tools that turn chaos into clarity
Help shift from reactive to proactive operations
Reduce incident load, build internal tooling, and directly improve developer focus and system uptime

Requirements:

3+ years of hands-on experience debugging production systems (logs, traces, incidents, etc.)
Strong problem-solving skills and ability to dive into unfamiliar backend codebases
Strong Go and Kubernetes experience
Familiarity with observability and monitoring tools (e.g., Datadog, Prometheus, Sentry)
Clear, calm communication under pressure — especially during live incidents
Experience working with distributed systems or services at scale
Built or maintained internal tooling for on-call teams or reliability workflows
Familiarity with deployment pipelines, CI/CD, or infra-as-code
Experience improving system observability (e.g., custom metrics, traces, log pipelines)

Infrastructure Engineer

Key skills

About this role

Responsibilities:

Requirements: