Five9 is a leading provider of cloud contact center software, bringing the power of cloud innovation to customers worldwide. The Site Reliability Engineer (SRE) will build and maintain highly reliable, scalable systems, focusing on automation, monitoring, and system reliability while collaborating with various teams to ensure service availability and performance.

Responsibilities:

Design and implement comprehensive dashboards covering OS/platform level monitoring and application-level monitoring
Establish and maintain SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets for the service
Build alerting systems and performance monitoring to proactively identify and resolve issues before they impact users
Participate in on-call rotations and lead incident response efforts, including post-mortem analysis and remediation
Maintain the official on-call routing
Assign and track application level problems to the engineering team
Maintain continuous integration and deployment pipelines working with our cloud and on-premise deployment teams
Develop and maintain infrastructure using tools like Terraform, Ansible, or similar
Automate system configuration and ensure consistency across environments
Ensure security scanning systems are in place and review escalated vulnerabilities
Maintain proper authentication, authorization, and audit logging systems
Ensure systems meet regulatory requirements and industry standards
Participate in security incident response and remediation efforts
Monitor and optimize cloud resource usage and costs looking for planned and unplanned resource changes
Analyze usage patterns and plan for future capacity needs
Provide recommendations for cost-effective architecture and resource allocation
Build and maintain common services like notification systems, caching layers, and message queues or third-party software stacks
Manage database reliability, performance, and scaling (where not handled by dedicated DB teams)
Implement and maintain service discovery, load balancing, and network policies
Create and maintain tools and platforms that improve developer productivity and system reliability

Requirements:

3+ years managing large-scale production environments
Comfortable with 24/7 on-call responsibilities and incident response
Strong Linux/Unix system administration skills
Understanding of TCP/IP, DNS, load balancing, and network security
Experience with SQL and NoSQL databases in production environments
Proficiency in at least two of: Python, Shell, PHP, Java, or similar languages
Experience with one of AWS, GCP, or Azure infrastructure and services
Hands-on experience with Docker, Kubernetes, and container orchestration
Experience with Prometheus, Grafana, ELK stack, or similar tools
Proficiency with Terraform, CloudFormation, or similar tools
Expert-level Git usage and collaborative development practices
Experience defining and maintaining service level objectives
Understanding of error budget concepts and implementation
Track record of identifying and eliminating repetitive manual work
Experience with performance testing and capacity management
Bachelor's degree in Computer Science, Engineering, or equivalent experience
Experience with microservices architecture and distributed systems
Knowledge of security best practices and compliance frameworks
Experience with chaos engineering and reliability testing
Previous experience in an SRE or DevOps role at a technology company
Contributions to open-source projects or technical communities

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: