Peraton is a next-generation national security company that drives missions of consequence spanning the globe. They are seeking a Site Reliability Engineer, Manager responsible for ensuring the availability, reliability, and performance of complex systems in a multi-vendor environment, while leading reliability initiatives and collaborating with multiple stakeholders.
Responsibilities:
- Reliability Architecture and Automation: Design, implement, and oversee reliability frameworks, including SLOs, error budgets, and automated incident response systems. Develop and maintain CI/CD pipelines to ensure seamless deployment and procedural efficiency
- Observability and Monitoring: Lead the creation and enhancement of observability platforms using metrics, logging, and tracing tools. Utilize modern technologies like OpenTelemetry, AI/ML for anomaly detection, and streaming data platforms to proactively detect and resolve issues
- Multi-Vendor Collaboration: Coordinate with external vendors and internal teams to integrate and manage diverse systems and tools. Ensure consistent reliability standards and practices are maintained across different technology stacks and service providers
- Incident Management and Risk Mitigation: Drive incident response strategy by leading root cause analysis, post-mortem reviews, and continuous improvement efforts. Identify potential risks and implement mitigation strategies to prevent service disruptions
- Technical Leadership: Mentor site reliability and engineering teams, fostering a culture of reliability, automation, and continuous learning. Advocate for best practices in system design and reliability engineering
- Cross-Functional Partnership: Work closely with product development, DevOps, and security teams to integrate reliability into the software development lifecycle. Influence platform strategy and roadmap based on reliability insights
- Strategic Influence: Collaborate with senior stakeholders and vendors on long-term reliability goals. Prepare executive-level presentations that translate technical challenges into business impact
- Agile and DevOps Practices: Lead and refine agile workflows to enhance team productivity and reliability outcomes. Champion DevOps methodologies to align development and cloud services efforts
Requirements:
- Extensive experience (10+ years) in site reliability engineering or related roles, preferably in multi-vendor and complex environments
- Deep knowledge of cloud-native infrastructure, container orchestration (e.g., Kubernetes), and automation tools such as Terraform, Ansible, or Chef
- Proficiency in observability technologies, such as Prometheus, Grafana, OpenTelemetry, log aggregation systems, etc
- Strong programming and scripting skills for automation and tooling (Python, Go, or similar)
- Expertise in defining and implementing SLIs, SLOs, and error budgets
- Excellent communication skills for collaboration with diverse teams and external vendors
- Proven ability to lead large-scale reliability initiatives and mentor engineering teams
- Strategic thinker with a focus on aligning reliability engineering with business priorities and customer experience
- U.S. Citizenship required
- Ability to obtain agency clearance (public trust)
- Top Secret clearance preferred