Peraton is a next-generation national security company that drives missions of consequence spanning the globe. They are seeking a Site Reliability Engineer, Manager responsible for ensuring the availability, reliability, and performance of complex systems in a multi-vendor environment, while leading reliability initiatives and collaborating with multiple stakeholders.

Responsibilities:

Reliability Architecture and Automation: Design, implement, and oversee reliability frameworks, including SLOs, error budgets, and automated incident response systems. Develop and maintain CI/CD pipelines to ensure seamless deployment and procedural efficiency
Observability and Monitoring: Lead the creation and enhancement of observability platforms using metrics, logging, and tracing tools. Utilize modern technologies like OpenTelemetry, AI/ML for anomaly detection, and streaming data platforms to proactively detect and resolve issues
Multi-Vendor Collaboration: Coordinate with external vendors and internal teams to integrate and manage diverse systems and tools. Ensure consistent reliability standards and practices are maintained across different technology stacks and service providers
Incident Management and Risk Mitigation: Drive incident response strategy by leading root cause analysis, post-mortem reviews, and continuous improvement efforts. Identify potential risks and implement mitigation strategies to prevent service disruptions
Technical Leadership: Mentor site reliability and engineering teams, fostering a culture of reliability, automation, and continuous learning. Advocate for best practices in system design and reliability engineering
Cross-Functional Partnership: Work closely with product development, DevOps, and security teams to integrate reliability into the software development lifecycle. Influence platform strategy and roadmap based on reliability insights
Strategic Influence: Collaborate with senior stakeholders and vendors on long-term reliability goals. Prepare executive-level presentations that translate technical challenges into business impact
Agile and DevOps Practices: Lead and refine agile workflows to enhance team productivity and reliability outcomes. Champion DevOps methodologies to align development and cloud services efforts

Requirements:

Extensive experience (10+ years) in site reliability engineering or related roles, preferably in multi-vendor and complex environments
Deep knowledge of cloud-native infrastructure, container orchestration (e.g., Kubernetes), and automation tools such as Terraform, Ansible, or Chef
Proficiency in observability technologies, such as Prometheus, Grafana, OpenTelemetry, log aggregation systems, etc
Strong programming and scripting skills for automation and tooling (Python, Go, or similar)
Expertise in defining and implementing SLIs, SLOs, and error budgets
Excellent communication skills for collaboration with diverse teams and external vendors
Proven ability to lead large-scale reliability initiatives and mentor engineering teams
Strategic thinker with a focus on aligning reliability engineering with business priorities and customer experience
U.S. Citizenship required
Ability to obtain agency clearance (public trust)
Top Secret clearance preferred

Site Reliability Engineer, Manager

Key skills

About this role

Responsibilities:

Requirements: