CentralReach is a leading provider of autism and IDD care software for Applied Behavior Analysis (ABA), and they are seeking a Senior Site Reliability Engineer to enhance their cloud platforms. The role involves ensuring system reliability, performance, and efficiency while collaborating with software development teams and implementing modern reliability practices.

Responsibilities:

Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting and maintaining SLOs, SLIs and Error Budgets, creating dashboards
Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's
Manage site stability, performance, reliability, and maintain uptime for production environments
Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns
Strive for automation to reduce toil and increase development velocity
Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed
Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach
Document resolution run books and standard operating procedures
Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation
Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams
Implementation of reliability and observability tools (like New Relic, Prometheus, Grafana etc.,)
Collaborates with Security team and other platform engineering teams to build reliable, maintainable, and scalable solutions that improve our security posture

Requirements:

Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider
Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.)
Experience implementing observability plans around logs, metrics, and traces
Experience in an agile development team developing software
Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation)
Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef
Strong experience with containerization technology and/or Kubernetes
Experience with Release automation, system administration, configuration management
Experience with programming languages (Java, Python, Go, etc.)
Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts
Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports
Strong analytical and programming skills (Python, Go, Java etc.)
Deep understanding around best practices for modern cloud security
Proven experience building observability for security concerns, such as privilege escalations and bot detection

Sr. Site Reliability Engineer, Security

Key skills

About this role

Responsibilities:

Requirements: