HiBob is seeking a Senior Site Reliability Engineer to bridge the gap between AI innovation and production stability. The role involves collaborating with global DevOps teams to automate workloads while ensuring the reliability of AWS/Kubernetes environments.
Responsibilities:
- Design, build, and operate production-grade Kubernetes infrastructure on AWS
- Developing Ai Agents to handle incidents and root cause analysis
- Build and maintain GitOps-based CI/CD pipelines using GitHub Actions and ArgoCD
- Develop internal DevOps tooling and developer self-service platforms
- Own monitoring, observability, and operational excellence using Datadog
- Collaborate with engineering teams to improve delivery speed and reliability
Requirements:
- 5+ years of experience as a Senior SRE or Production Engineer (this is a hard requirement)
- Deep Production Expertise: You must have extensive experience managing live, high-traffic SaaS environments; developer-only backgrounds without ops experience will not be a fit
- Cloud & Orchestration: Proven mastery of Kubernetes and AWS in production settings
- Coding/Scripting: Advanced proficiency in Python (preferred) or Go for automation; we need more than just Bash skills
- AI Knowledge: A strong understanding of or direct experience with AI/LLM technologies
- Observability: Hands-on experience with Datadog for monitoring and incident response
- Autonomy: Ability to work independently without direct daily oversight, managing production incidents and on-call responsibilities
- Time Zone: Located in the East Coast time zone to provide coverage overlap with our global teams
- Advanced proficiency in Python (preferred) or Go for automation