Redis is a company that built the product running fast applications globally. They are seeking a Site Reliability Engineer to work on large-scale systems, manage technical escalations, ensure system reliability, and collaborate with engineering teams.
Responsibilities:
- Handle Technical Escalations: Engage in complex troubleshooting and manage technical escalations within a Follow-the-Sun (FTS) support model, ensuring seamless global service coverage
- Ensure System Reliability: Leverage your software development and problem-solving expertise to create automation tools and runbooks, enhancing the reliability and stability of the Redis database on a leading cloud service provider
- Collaborate with Engineering Teams: Partner closely with engineering teams during service-impacting incidents, leading problem management efforts to maintain service continuity and stability
- Participate in On-Call Rotations: Be available for occasional weekend on-call shifts, providing critical support and ensuring service reliability
Requirements:
- B.S. in Computer Science, Information Technology, Software Engineering, or a related field
- At least 4 or more years of experience working on infrastructure/CloudOps/SRE domains
- At least 3 years of experience troubleshooting real time production systems
- At least 2 years of hands-on experience with cloud infrastructure
- Strong working knowledge in Linux/Unix
- Deep understanding of networking (TCP/IP) with emphasis on the various cloud providers
- Experience with alerting and monitoring systems (Prometheus, Grafana, ELK, Splunk, etc.)
- Experience in scripting languages but not limited to: Bash, Python
- Familiarity with source code version control tools like Git, Gitlab, SVN, etc
- Experience with containerization technologies and concepts
- Experience using and maintaining deployment and configuration management tools (Github actions, Jenkins, Ansible, Chef, etc.)
- Experience in analyzing and debugging production issues at scale
- Self-directed, ambitious, authentic, caring, and eager to learn new things
- Experience with 24/7 on-call duty. Availability to nights and weekends (follow the sun)
- FedRAMP certification is a plus
- Experience working with large-scale distributed systems
- Experience with NoSQL databases (especially Redis)
- Experience with infrastructure as code tools (Terraform, Pulumi, etc.)
- Linux and cloud certifications