SafeRide Health is a technology and services company dedicated to improving non-emergency medical transportation. They are seeking an experienced Site Reliability Engineering Manager to lead a team focused on ensuring the reliability and scalability of user-facing services and production systems.
Responsibilities:
- Keeping systems and services running smoothly with minimal downtime by focusing on availability, reliability, and scalability
- Developing and maintaining tools and scripts to automate repetitive tasks such as deployments, configuration management, and monitoring
- Implementing and managing monitoring and alerting systems to provide visibility into system performance and quickly detect potential issues
- Responding to, diagnosing, and resolving system incidents, including conducting post-mortems to prevent future occurrences
- Monitoring system resource usage to forecast future needs and scale systems accordingly to handle increasing user load
- Collaborating with stakeholders to identify operational risks and implementing strategies to reduce their likelihood and impact
- Analyzing metrics from operating systems and applications to identify areas for performance improvement
- Provide direction to a team of direct reports and matrixed resources in alignment with Site Reliability objectives
- Manage performance of SRE team members through regular 1:1s, coaching sessions, performance reviews, and performance management when necessary
Requirements:
- Minimum of 8 years progressive experience in an IT, Software Engineering, Technology Operations, or Business Continuity role
- Minimum of 3 years of hands-on experience in a Site Reliability, DevOps, or IT Observability role
- Minimum of 2 years direct supervisory experience leading technology professionals
- Demonstrated proficiency with production monitoring and alerting tools (DataDog is a major plus!)
- Basic proficiency in an AWS containerized environment running infrastructure as code