HiBob is a company focused on AI-driven operations, and they are seeking a Senior Site Reliability Engineer to enhance production stability and automation. The role involves collaborating with global DevOps teams to manage AWS/Kubernetes environments and improve operational excellence.
Responsibilities:
- Design, build, and operate production-grade Kubernetes infrastructure on AWS
- Developing Ai Agents to handle incidents and root cause analysis
- Build and maintain GitOps-based CI/CD pipelines using GitHub Actions and ArgoCD
- Develop internal DevOps tooling and developer self-service platforms
- Own monitoring, observability, and operational excellence using Datadog
- Collaborate with engineering teams to improve delivery speed and reliability
Requirements:
- 5+ years of experience as a SRE or DevOps Engineer (this is a hard requirement)
- Extensive experience managing live, high-traffic SaaS environments; developer-only backgrounds without ops experience will not be a fit
- Proven mastery of Kubernetes and AWS in production settings
- A strong understanding of or direct experience with AI/LLM technologies
- Hands-on experience with Datadog for monitoring and incident response
- Ability to work independently without direct daily oversight, managing production incidents and on-call responsibilities
- Located in the East Coast time zone to provide coverage overlap with our global teams
- Advanced proficiency in Python (preferred) or Go for automation