HavocAI is a leader in collaborative autonomy, focusing on solving complex human problems through advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the availability and performance of mission-critical services while collaborating with various teams to enhance operational maturity and reliability standards.

Responsibilities:

Design and evolve reliability architecture for distributed and cloud-hosted systems
Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
Partner with platform and application teams to design systems for reliability, scalability, and operability
Identify and mitigate systemic reliability risks across infrastructure and services
Lead incident response processes including on-call rotations, escalation, and post-incident reviews
Conduct root cause analysis for complex production incidents and drive long-term improvements
Improve operational readiness through runbooks, automation, and resilience testing
Reduce operational toil through tooling, automation, and process improvements
Design and maintain observability systems for metrics, logging, tracing, and alerting
Ensure services and data pipelines are observable, debuggable, and performant in production
Drive performance analysis and tuning across infrastructure and service layers
Build automation to improve system reliability, deployment safety, and recovery processes
Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
Support and improve Kubernetes-based environments and containerized workloads
Collaborate with security teams to ensure secure and resilient system design
Participate in disaster recovery planning and testing
Maintain strong operational practices around access control, secrets management, and change management

Requirements:

7+ years of experience in SRE, infrastructure, or systems engineering roles
Strong experience operating large-scale distributed production systems
Deep understanding of Linux systems, networking, and distributed systems fundamentals
Hands-on experience with Kubernetes and container orchestration
Programming or scripting experience in Go, Python, or similar languages
Experience designing and operating observability systems for production environments
Proven ability to lead incident response and reliability improvements
Strong communication skills and ability to collaborate across engineering teams
Must be a US Citizen
Must be Eligible to obtain a Government Clearance - if required
Experience supporting autonomy, robotics, simulation, or real-time systems
Familiarity with AWS and large-scale cloud infrastructure
Experience with chaos engineering, fault injection, or resilience testing
Knowledge of CI/CD systems and progressive delivery practices
Experience working in high-reliability or safety-critical environments

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: