HavocAI is a leader in collaborative autonomy, focusing on solving complex human problems through advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the availability and performance of mission-critical services while collaborating with various teams to enhance operational maturity and reliability standards.
Responsibilities:
- Design and evolve reliability architecture for distributed and cloud-hosted systems
- Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
- Partner with platform and application teams to design systems for reliability, scalability, and operability
- Identify and mitigate systemic reliability risks across infrastructure and services
- Lead incident response processes including on-call rotations, escalation, and post-incident reviews
- Conduct root cause analysis for complex production incidents and drive long-term improvements
- Improve operational readiness through runbooks, automation, and resilience testing
- Reduce operational toil through tooling, automation, and process improvements
- Design and maintain observability systems for metrics, logging, tracing, and alerting
- Ensure services and data pipelines are observable, debuggable, and performant in production
- Drive performance analysis and tuning across infrastructure and service layers
- Build automation to improve system reliability, deployment safety, and recovery processes
- Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
- Support and improve Kubernetes-based environments and containerized workloads
- Collaborate with security teams to ensure secure and resilient system design
- Participate in disaster recovery planning and testing
- Maintain strong operational practices around access control, secrets management, and change management
Requirements:
- 7+ years of experience in SRE, infrastructure, or systems engineering roles
- Strong experience operating large-scale distributed production systems
- Deep understanding of Linux systems, networking, and distributed systems fundamentals
- Hands-on experience with Kubernetes and container orchestration
- Programming or scripting experience in Go, Python, or similar languages
- Experience designing and operating observability systems for production environments
- Proven ability to lead incident response and reliability improvements
- Strong communication skills and ability to collaborate across engineering teams
- Must be a US Citizen
- Must be Eligible to obtain a Government Clearance - if required
- Experience supporting autonomy, robotics, simulation, or real-time systems
- Familiarity with AWS and large-scale cloud infrastructure
- Experience with chaos engineering, fault injection, or resilience testing
- Knowledge of CI/CD systems and progressive delivery practices
- Experience working in high-reliability or safety-critical environments