Five9 is a leading provider of cloud contact center software, bringing the power of cloud innovation to customers worldwide. The Site Reliability Engineer (SRE) role involves building and maintaining highly reliable, scalable systems, focusing on automation, monitoring, and system reliability while collaborating with various teams to ensure service availability.
Responsibilities:
- Design and implement comprehensive dashboards
- Establish and maintain SLIs, SLOs, and error budgets for the service
- Build alerting systems and performance monitoring to proactively identify and resolve issues before they impact users
- Participate in on-call rotations and lead incident response efforts, including post-mortem analysis and remediation
- Maintain continuous integration and deployment pipelines working with our cloud and on-premise deployment teams
- Develop and maintain infrastructure using tools like Terraform, Ansible, or similar
- Automate system configuration and ensure consistency across environments
- Ensure security scanning systems are in place and review escalated vulnerabilities
- Maintain proper authentication, authorization, and audit logging systems
- Ensure systems meet regulatory requirements and industry standards
- Monitor and optimize cloud resource usage and costs looking for planned and unplanned resource changes
- Analyze usage patterns and plan for future capacity needs
- Provide recommendations for cost-effective architecture and resource allocation
- Build and maintain common services like notification systems, caching layers, and message queues or third-party software stacks
- Manage database reliability, performance, and scaling
- Implement and maintain service discovery, load balancing, and network policies
- Create and maintain tools and platforms that improve developer productivity and system reliability
Requirements:
- Production Systems: 3+ years managing large-scale production environments
- On-call Experience: Comfortable with 24/7 on-call responsibilities and incident response
- System Administration: Strong Linux/Unix system administration skills
- Networking: Understanding of TCP/IP, DNS, load balancing, and network security
- Database Systems: Experience with SQL and NoSQL databases in production environments
- Programming Languages: Proficiency in at least two of: Python, Shell, PHP, Java, or similar languages
- Cloud Platforms: Experience with one of AWS, GCP, or Azure infrastructure and services
- Containerization: Hands-on experience with Docker, Kubernetes, and container orchestration
- Monitoring & Observability: Experience with Prometheus, Grafana, ELK stack, or similar tools
- Infrastructure as Code: Proficiency with Terraform, CloudFormation, or similar tools
- Version Control: Expert-level Git usage and collaborative development practices
- SLI/SLO Management: Experience defining and maintaining service level objectives
- Error Budget Policy: Understanding of error budget concepts and implementation
- Toil Reduction: Track record of identifying and eliminating repetitive manual work
- Capacity Planning: Experience with performance testing and capacity management
- Bachelor's degree in Computer Science, Engineering, or equivalent experience
- Experience with microservices architecture and distributed systems
- Knowledge of security best practices and compliance frameworks
- Experience with chaos engineering and reliability testing
- Previous experience in an SRE or DevOps role at a technology company
- Contributions to open-source projects or technical communities