CentralReach is a leading provider of autism and IDD care software for Applied Behavior Analysis (ABA), and they are seeking a Senior Site Reliability Engineer to enhance their cloud platforms. The role involves ensuring system reliability, performance, and efficiency while collaborating with software development teams and implementing modern reliability practices.
Responsibilities:
- Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting and maintaining SLOs, SLIs and Error Budgets, creating dashboards
- Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's
- Manage site stability, performance, reliability, and maintain uptime for production environments
- Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns
- Strive for automation to reduce toil and increase development velocity
- Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed
- Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach
- Document resolution run books and standard operating procedures
- Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation
- Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams
- Implementation of reliability and observability tools (like New Relic, Prometheus, Grafana etc.,)
- Collaborates with Security team and other platform engineering teams to build reliable, maintainable, and scalable solutions that improve our security posture
Requirements:
- Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider
- Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.)
- Experience implementing observability plans around logs, metrics, and traces
- Experience in an agile development team developing software
- Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation)
- Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef
- Strong experience with containerization technology and/or Kubernetes
- Experience with Release automation, system administration, configuration management
- Experience with programming languages (Java, Python, Go, etc.)
- Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts
- Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports
- Strong analytical and programming skills (Python, Go, Java etc.)
- Deep understanding around best practices for modern cloud security
- Proven experience building observability for security concerns, such as privilege escalations and bot detection