University of Washington is a leading health system that includes a top-rated medical school and an internationally recognized research center. They are seeking a Site Reliability Engineer 3 to enhance the reliability of their infrastructure and applications while supporting the transition to cloud environments and modern orchestration platforms.
Responsibilities:
- Apply software engineering methodologies to infrastructure and operational systems to reduce toil and increase reliability
- Support day-to-day operations of public-facing web and mobile applications, including uptime management, vulnerability assessment, patching, and lifecycle maintenance
- Lead and support the transition toward container-based infrastructure and modern orchestration platforms
- Support strategic expansion of services into cloud environments where appropriate, balancing cost, security, compliance, and performance
- Participate in security reviews and implement required mitigations in alignment with institutional and regulatory standards
- Design and support highly available cloud-native application deployments
- Implement and maintain comprehensive monitoring and observability frameworks
- Develop and test disaster recovery and business continuity strategies
- Participate in an on-call rotation (approximately one weekend per month) to support critical services
Requirements:
- Bachelor's degree in Computer Science, Information Technology, Business Administration, or related field or equivalent combination of education/experience
- 4+ years of professional experience systems engineering, infrastructure operations, DevOps, Site Reliability Engineering, or related field must include the below:
- Supporting Linux-based server environments (e.g., RHEL or similar distributions)
- Working within at least one major cloud environment (AWS, Azure, or Google Cloud)
- Building, deploying, or maintaining containerized applications
- Using version control systems (e.g., Git)
- Supporting web application infrastructure (e.g., web servers, application stacks)
- Supporting authentication or single sign-on (SSO) solutions
- Supporting network or firewall configurations
- Participating in production support and incident response activities
- Ability to troubleshoot infrastructure and application issues in production environments
- Ability to communicate effectively with technical and non-technical stakeholders
- Ability to manage multiple priorities in a team-based operational environment