Underdog is a rapidly growing sports company focused on enhancing the experience for sports fans. They are seeking a Senior Site Reliability Engineer to help define reliability and operational excellence as the company scales, focusing on incident response, observability, and system reliability.
Responsibilities:
- Own and maintain the incident response process, including defining procedures, tools, and best practices
- Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems
- Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs
- Develop and implement disaster recovery plans, including regular testing and regulatory compliance
- Collaborate with teams on architecture decisions to ensure high availability and scalability
- Manage launch and event planning for high-traffic occasions, focusing on infrastructure preparation and capacity management (a.k.a. Launch Readiness)
- Act as an internal expert and consultant for monitoring tools like Datadog and Pagerduty and infrastructure like AWS and Kubernetes
- Emphasis on automation and tooling to scale our workload
- Contribute across codebases in Ruby, Python, Go, TypeScript, Swift, and Kotlin as needed to support the initiatives described above
Requirements:
- Own and maintain the incident response process, including defining procedures, tools, and best practices
- Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems
- Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs
- Develop and implement disaster recovery plans, including regular testing and regulatory compliance
- Collaborate with teams on architecture decisions to ensure high availability and scalability
- Manage launch and event planning for high-traffic occasions, focusing on infrastructure preparation and capacity management (a.k.a. Launch Readiness)
- Act as an internal expert and consultant for monitoring tools like Datadog and Pagerduty and infrastructure like AWS and Kubernetes
- Emphasis on automation and tooling to scale our workload
- Contribute across codebases in Ruby, Python, Go, TypeScript, Swift, and Kotlin as needed to support the initiatives described above
- A strong written and verbal communicator
- Collaborative by nature
- Someone who enjoys using research, data, and experiments to make decisions; you believe 'Hope is not a strategy.'
- You enjoy working directly with customers (generally engineers or other people inside the company)
- You think long-term about what is best for the business and its customers
- You are excited to take ownership
- You are very comfortable around an IDE, working with multiple languages, multiple web application frameworks, AWS services, Kubernetes, PostgreSQL
- You can work independently to learn new languages/technologies as needed
- You enjoy deploying changes to production quickly, multiple times a week if necessary
- Experience with PostgreSQL SQL query optimization, tweaking autovacuum settings, table statistics, different index types, etc
- Experience with Redis / Valkey Optimization
- Experience with Datadog or similar observability tools
- Experience working as a web application developer, frontend or backend, especially in React and Ruby on Rails
- Experience with AWS cost optimization
- Read the Google SRE books or similar books, or have other forms of SRE training
- Actively leveraging the capabilities of AI to augment abilities and gain knowledge about interested domains