Five9 is a leading provider of cloud contact center software, bringing the power of cloud innovation to customers worldwide. They are seeking a Site Reliability Engineer (SRE) to help build and maintain highly reliable, scalable systems, focusing on automation, monitoring, and system reliability. The role combines software engineering and operations expertise to ensure services meet reliability targets while enabling rapid development and deployment.
Responsibilities:
- Observability & Monitoring
- Dashboards & Metrics: Design and implement comprehensive dashboards. These dashboards cover OS/platform level monitoring and application-level monitoring. These dashboards are broken into primary (RED) and secondary indicators (USE)
- Availability & Reliability: Establish and maintain SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets for the service
- Performance Monitoring: Build alerting systems and performance monitoring to proactively identify and resolve issues before they impact users
- Incident Response: Participate in on-call rotations and lead incident response efforts, including post-mortem analysis and remediation. Maintain the official on-call routing. Assign and track application level problems to the engineering team
- Infrastructure Automation & Deployment
- CI/CD Pipeline Management: Maintain continuous integration and deployment pipelines working with our cloud and on-premise deployment teams
- Infrastructure as Code: Develop and maintain infrastructure using tools like Terraform, Ansible, or similar
- Configuration Management: Automate system configuration and ensure consistency across environments. Provide recommendations for and implement best practices for configuration control
- Security & Compliance
- Security Automation: Ensure security scanning systems are in place and review escalated vulnerabilities
- Access Control: Maintain proper authentication, authorization, and audit logging systems
- Compliance Reporting: Ensure systems meet regulatory requirements and industry standards
- Security Incident Response: Participate in security incident response and remediation efforts
- Cost Optimization
- Resource Management: Monitor and optimize cloud resource usage and costs looking for planned and unplanned resource changes
- Capacity Planning: Analyze usage patterns and plan for future capacity needs
- Cost Analysis: Provide recommendations for cost-effective architecture and resource allocation
- Right-sizing: Implement automated scaling and resource optimization strategies
- Common Services & Platform Engineering:
- Shared Infrastructure: Build and maintain common services like notification systems, caching layers, and message queues or third-party software stacks
- Database Operations: Manage database reliability, performance, and scaling (where not handled by dedicated DB teams)
- Service Mesh & Networking: Implement and maintain service discovery, load balancing, and network policies
- Developer Tools: Create and maintain tools and platforms that improve developer productivity and system reliability
Requirements:
- 3+ years managing large-scale production environments
- Comfortable with 24/7 on-call responsibilities and incident response
- Strong Linux/Unix system administration skills
- Understanding of TCP/IP, DNS, load balancing, and network security
- Experience with SQL and NoSQL databases in production environments
- Proficiency in at least two of: Python, Shell, PHP, Java, or similar languages
- Experience with one of AWS, GCP, or Azure infrastructure and services
- Hands-on experience with Docker, Kubernetes, and container orchestration
- Experience with Prometheus, Grafana, ELK stack, or similar tools
- Proficiency with Terraform, CloudFormation, or similar tools
- Expert-level Git usage and collaborative development practices
- Experience defining and maintaining service level objectives
- Understanding of error budget concepts and implementation
- Track record of identifying and eliminating repetitive manual work
- Experience with performance testing and capacity management
- Bachelor's degree in Computer Science, Engineering, or equivalent experience
- Experience with microservices architecture and distributed systems
- Knowledge of security best practices and compliance frameworks
- Experience with chaos engineering and reliability testing
- Previous experience in an SRE or DevOps role at a technology company
- Contributions to open-source projects or technical communities