Ensure 99.99% uptime for customer-facing services by proactively monitoring and maintaining the health of supporting systems, contributing directly to customer satisfaction and trust.
Act in key support roles during major incidents (e.g., Sev0, Sev1) and participate in technical incident reviews for problem management.
Contribute to Problem Management by populating and participating in Root Cause Analyses (RCAs) and handing them off to the Global Solutions team.
Ensure all work carried out by the Site Reliability team aligns with the company’s internal compliance policies and directives.
Collaborate with technical staff to solve complex technical issues and customer concerns.
Lead and mentor other team members in staying abreast of industry innovations and technologies, and assist in team development growth.
Thrive in a fast-paced environment, solving sophisticated issues quickly and successfully balancing multiple priorities.
Automate the detection and resolution of recurring issues in the production environment.
Help create and improve current processes to reduce operational and engineering toil, including the implementation of AI-driven automation for routine tasks.

Citizenship: U.S. citizen (U.S. born or naturalized) who does not hold dual citizenship.
Education: Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field.
Experience: Systems engineering experience in enterprise-scale internet service engineering or support role.
Technical Skills: Expertise in TCP/IP related technologies (networking protocols, network programming, etc.).
Expertise in CLI enterprise support of Unix variants (Linux/Solaris/BSD), with significant exposure to Red Hat Enterprise Linux and Solaris.
Strong understanding of monitoring security systems and administration.
Experience provisioning, operating, and running AWS/C2S based infrastructure and systems.
Proficiency in scripting with Python, Go, or other languages.
Communication: Strong written and oral communication skills.
Incident Management: Past experience in Incident Management and a good understanding of ITIL service operations.
Availability: Ability to participate in a 24/7 on-call rotation supporting large data center operations and be available for shift work.

Site Reliability Engineer, GovCloud Incident Response

Key skills