Peraton is a next-generation national security company focused on delivering trusted solutions and technologies. They are seeking a Site Reliability Engineer, Supervisor, responsible for ensuring the availability and performance of complex software systems, leading automation efforts, and collaborating with various teams to enhance system reliability.
Responsibilities:
- Ensure high availability and responsiveness of services by designing and implementing monitoring, alerting, and automated remediation tools
- Analyze system metrics and logs to identify areas for improvement and optimize system performance
- Develop and maintain scripts, configuration management, and infrastructure-as-code to automate deployment, scaling, and management of infrastructure
- Lead efforts to reduce toil through automation and reliability engineering best practices
- Participate in on-call rotations to respond to incidents promptly
- Conduct thorough root cause analysis and collaborate with engineering teams to implement preventive measures
- Partner with software developers, product managers, and infrastructure teams to embed reliability into the software development lifecycle
- Provide guidance on system architecture, capacity planning, and disaster recovery strategies
- Mentor junior SREs and engineers on reliability engineering principles, tools, and technical excellence
- Lead by example in coding standards, system design, and incident response
- Articulate technical issues and reliability impacts to non-technical stakeholders
- Drive alignment on priorities and continuous improvements across teams
- Lead reliability-related projects and initiatives, managing timelines, resources, and stakeholder communication to deliver impactful results
- Promote agile practices to enhance team efficiency
- Advocate for continuous learning and process refinement in system reliability
Requirements:
- 6 years of experience, may have lead experience
- Strong software engineering background with proficiency in languages such as Python, Go, or similar
- Deep understanding of distributed systems, cloud infrastructure (AWS, Azure, GCP), container orchestration (Kubernetes), and monitoring tools (Prometheus, Grafana, OpenTelemetry)
- Experience defining and implementing SLOs, SLIs, and error budgets to measure and maintain service reliability
- Excellent problem-solving skills with a proactive approach to incident prevention and resolution
- Strong communication skills to effectively collaborate with diverse teams and present reliability insights
- 5+ years of experience in site reliability engineering, systems engineering, or related roles with a proven track record of delivering scalable, reliable systems
- U.S. Citizenship required
- Ability to obtain agency clearance (public trust)
- Top Secret clearance preferred