Dice is the leading career destination for tech experts at every stage of their careers, and they are seeking a Site Reliability Engineer for a fast-moving gaming technology company. The role focuses on ensuring the reliability, performance, and scalability of a real-money gaming platform, while partnering closely with backend engineers to design resilient systems and maintain production health across a distributed architecture.
Responsibilities:
- 50% Infrastructure and Platform Ownership - reliability, deployment, configuration, and production readiness
- 30% Observability and Incident Management - monitoring systems, incident response, and SLO management
- 20% Engineering Partnership and Automation - collaborating with backend teams, reducing manual intervention, and optimizing operations
Requirements:
- 5+ years of experience in SRE, DevOps, or infrastructure engineering
- Strong experience with Kubernetes, Docker, and cloud platforms with Google Cloud Platform preferred
- Deep knowledge of distributed systems and networking
- Experience building CI/CD pipelines and deployment automation
- Proficiency with observability tools including Grafana, Prometheus, Tempo, and Loki
- Experience managing production incidents and reliability processes including postmortems
- Strong troubleshooting and systems thinking skills
- Strong knowledge of microservices architecture
- Familiarity with Go
- Familiarity with service meshes such as Istio
- Familiarity with managing PostgreSQL at scale
- Experience defining and maintaining SLIs, SLOs, and error budgets aligned to contractual SLAs
- Background optimizing cloud infrastructure usage and cost efficiency
- Experience managing secrets, environment configuration, and deployment safety in regulated or high-availability environments
- Prior experience in gaming, fintech, or other mission-critical real-money platforms