Peraton is a next-generation national security company focused on delivering trusted solutions and technologies. They are seeking a Site Reliability Engineer, Supervisor, responsible for ensuring the availability and performance of complex software systems, leading automation efforts, and collaborating with various teams to enhance system reliability.

Responsibilities:

Ensure high availability and responsiveness of services by designing and implementing monitoring, alerting, and automated remediation tools
Analyze system metrics and logs to identify areas for improvement and optimize system performance
Develop and maintain scripts, configuration management, and infrastructure-as-code to automate deployment, scaling, and management of infrastructure
Lead efforts to reduce toil through automation and reliability engineering best practices
Participate in on-call rotations to respond to incidents promptly
Conduct thorough root cause analysis and collaborate with engineering teams to implement preventive measures
Partner with software developers, product managers, and infrastructure teams to embed reliability into the software development lifecycle
Provide guidance on system architecture, capacity planning, and disaster recovery strategies
Mentor junior SREs and engineers on reliability engineering principles, tools, and technical excellence
Lead by example in coding standards, system design, and incident response
Articulate technical issues and reliability impacts to non-technical stakeholders
Drive alignment on priorities and continuous improvements across teams
Lead reliability-related projects and initiatives, managing timelines, resources, and stakeholder communication to deliver impactful results
Promote agile practices to enhance team efficiency
Advocate for continuous learning and process refinement in system reliability

Requirements:

6 years of experience, may have lead experience
Strong software engineering background with proficiency in languages such as Python, Go, or similar
Deep understanding of distributed systems, cloud infrastructure (AWS, Azure, GCP), container orchestration (Kubernetes), and monitoring tools (Prometheus, Grafana, OpenTelemetry)
Experience defining and implementing SLOs, SLIs, and error budgets to measure and maintain service reliability
Excellent problem-solving skills with a proactive approach to incident prevention and resolution
Strong communication skills to effectively collaborate with diverse teams and present reliability insights
5+ years of experience in site reliability engineering, systems engineering, or related roles with a proven track record of delivering scalable, reliable systems
U.S. Citizenship required
Ability to obtain agency clearance (public trust)
Top Secret clearance preferred

Site Reliability Engineer, Supervisor

Key skills

About this role

Responsibilities:

Requirements: