SafeRide Health is a technology and services company dedicated to improving non-emergency medical transportation. They are seeking a Site Reliability Engineer to develop processes that support software delivery excellence and operational discipline, ensuring high availability and reliability of user-facing services and production systems.
Responsibilities:
- Keeping systems and services running smoothly with minimal downtime by focusing on availability, reliability, and scalability
- Developing and maintaining tools and scripts to automate repetitive tasks such as deployments, configuration management, and monitoring
- Implementing and managing monitoring and alerting systems to provide visibility into system performance and quickly detect potential issues
- Responding to, diagnosing, and resolving system incidents, including conducting post-mortems to prevent future occurrences
- Monitoring system resource usage to forecast future needs and scale systems accordingly to handle increasing user load
- Collaborating with stakeholders to identify operational risks and implementing strategies to reduce their likelihood and impact
- Analyzing metrics from operating systems and applications to identify areas for performance improvement
Requirements:
- Minimum of 5 years progressive experience in an IT, Software Engineering, Technology Operations, or Business Continuity role
- Minimum of 2 years of hands-on experience in a Site Reliability, DevOps, or IT Observability role
- Demonstrated proficiency with production monitoring and alerting tools (DataDog is a major plus!)
- Basic proficiency in an AWS containerized environment running infrastructure as code
- Expertise in major cloud platforms such as AWS and Azure
- Deep knowledge of operating systems, networking, storage, and distributed systems
- Experience with tools for infrastructure as code (e.g., Terraform), containerization (e.g., Docker), and APM/monitoring (e.g., Prometheus, DataDog, New Relic, Grafana, Splunk)
- Proficiency in coding languages like Python, Ruby, and JavaScript for developing automation and managing infrastructure
- Strong communication and collaboration skills to work effectively with development, operations, and other cross-functional teams