System One is a leader in delivering outsourced services and workforce solutions across North America, and they are seeking a Site Reliability Engineer (SRE) to support the SBA Disaster Lending Platform modernization effort. The role focuses on establishing SRE practices in AWS cloud environments, improving system resilience, and collaborating with various teams to enhance operational excellence.
Responsibilities:
- Help establish and mature SRE practices within an Agile Scrum delivery environment
- Support system design reviews to identify reliability risks, failure points, scalability concerns, and opportunities for automation
- Improve operational readiness by contributing to code reviews, deployment reviews, monitoring practices, and reliability-focused engineering standards
- Support incident management activities, including troubleshooting, root-cause analysis, mitigation planning, and post-incident improvements
- Build and maintain automation to improve reliability, reduce manual effort, and support self-healing cloud infrastructure
- Support AWS cloud platform operations across monitoring, logging, security, scalability, and availability
- Work with CI/CD and Infrastructure as Code tools to support repeatable, secure, and reliable deployments
- Create and maintain clear technical documentation for systems, processes, runbooks, and operational procedures
- Collaborate with cross-functional teams and stakeholders to promote DevOps, automation, and reliability best practices
Requirements:
- Minimum of four years of experience supporting the reliability, scalability, security, and operational excellence of AWS cloud platforms
- Bachelor's degree required, or four additional years of relevant experience in lieu of a degree
- Hands-on experience with CI/CD and Infrastructure as Code tools such as Terraform, Ansible Automation Platform, GitLab, Artifactory, and Packer
- Strong scripting and automation experience using Python, PowerShell, and Bash; Python experience is preferred
- Experience supporting Windows and Linux environments
- Strong understanding of networking concepts, cloud troubleshooting, monitoring, logging, and incident response
- Experience designing, deploying, or supporting cloud-based systems with a focus on reliability, scalability, security, and performance
- Knowledge of source control best practices
- Experience working in Agile delivery environments, including Scrum, Kanban, SAFe, or similar methodologies
- Strong analytical, troubleshooting, and problem-solving skills, including the ability to resolve complex technical issues in high-pressure situations
- Strong communication skills and the ability to collaborate effectively with technical teams, stakeholders, and cross-functional partners
- Must be authorized to work in the United States without sponsorship and able to obtain a Public Trust clearance
- Current or prior government contracting experience
- Red Hat, CompTIA, AWS, or related technical certifications
- Experience mentoring technical teams or helping promote DevOps/SRE practices across engineering groups