Dice is the leading career destination for tech experts at every stage of their careers, and they are seeking a Site Reliability Engineer for a fast-moving gaming technology company. The role focuses on ensuring the reliability, performance, and scalability of a real-money gaming platform, while partnering closely with backend engineers to design resilient systems and maintain production health across a distributed architecture.

Responsibilities:

50% Infrastructure and Platform Ownership - reliability, deployment, configuration, and production readiness
30% Observability and Incident Management - monitoring systems, incident response, and SLO management
20% Engineering Partnership and Automation - collaborating with backend teams, reducing manual intervention, and optimizing operations

Requirements:

5+ years of experience in SRE, DevOps, or infrastructure engineering
Strong experience with Kubernetes, Docker, and cloud platforms with Google Cloud Platform preferred
Deep knowledge of distributed systems and networking
Experience building CI/CD pipelines and deployment automation
Proficiency with observability tools including Grafana, Prometheus, Tempo, and Loki
Experience managing production incidents and reliability processes including postmortems
Strong troubleshooting and systems thinking skills
Strong knowledge of microservices architecture
Familiarity with Go
Familiarity with service meshes such as Istio
Familiarity with managing PostgreSQL at scale
Experience defining and maintaining SLIs, SLOs, and error budgets aligned to contractual SLAs
Background optimizing cloud infrastructure usage and cost efficiency
Experience managing secrets, environment configuration, and deployment safety in regulated or high-availability environments
Prior experience in gaming, fintech, or other mission-critical real-money platforms

Site Reliability Engineer / Remote

Key skills

About this role

Responsibilities:

Requirements: