HiBob is seeking a Senior Site Reliability Engineer to bridge the gap between AI innovation and production stability. The role involves collaborating with global DevOps teams to automate workloads while ensuring the reliability of AWS/Kubernetes environments.

Responsibilities:

Design, build, and operate production-grade Kubernetes infrastructure on AWS
Developing Ai Agents to handle incidents and root cause analysis
Build and maintain GitOps-based CI/CD pipelines using GitHub Actions and ArgoCD
Develop internal DevOps tooling and developer self-service platforms
Own monitoring, observability, and operational excellence using Datadog
Collaborate with engineering teams to improve delivery speed and reliability

Requirements:

5+ years of experience as a Senior SRE or Production Engineer (this is a hard requirement)
Deep Production Expertise: You must have extensive experience managing live, high-traffic SaaS environments; developer-only backgrounds without ops experience will not be a fit
Cloud & Orchestration: Proven mastery of Kubernetes and AWS in production settings
Coding/Scripting: Advanced proficiency in Python (preferred) or Go for automation; we need more than just Bash skills
AI Knowledge: A strong understanding of or direct experience with AI/LLM technologies
Observability: Hands-on experience with Datadog for monitoring and incident response
Autonomy: Ability to work independently without direct daily oversight, managing production incidents and on-call responsibilities
Time Zone: Located in the East Coast time zone to provide coverage overlap with our global teams
Advanced proficiency in Python (preferred) or Go for automation

Senior Site Reliability Engineer - Remote EST

Key skills

About this role

Responsibilities:

Requirements: