Dice is the leading career destination for tech experts at every stage of their careers, and they are seeking a Site Reliability Engineer to join their Global SRE team. In this role, you will blend software engineering and systems engineering to ensure the reliability and efficiency of large-scale digital products.
Responsibilities:
- Ensure the reliability, availability, and resiliency of digital products by designing and operating fault-tolerant systems
- Partner with product and platform teams to define and improve service health using operational and customer-experience metrics
- Design, implement, and maintain monitoring, alerting, logging, and tracing solutions that provide real-time visibility into system behavior and customer experience
- Analyze system performance, scalability, and capacity, and drive optimizations to improve efficiency and stability in cloud environments
- Build automation and tooling to support deployments, scaling, incident response, and operational workflows
- Participate in an on-call rotation as part of a globally distributed team, lead incident response efforts, troubleshoot production issues, conduct postmortems, and drive continuous improvement initiatives
- Collaborate with security and compliance partners to support secure, privacy-aware, and compliant operations
- Work closely with engineering teams to improve developer experience, operational maturity, and overall customer experience
Requirements:
- Experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
- Experience operating Kubernetes-based production systems
- Hands-on experience with AWS and infrastructure-as-code tools such as Terraform
- Experience designing and supporting CI/CD pipelines and automated deployments
- Proficiency in Python for automation, tooling, or backend services
- Solid understanding of distributed systems and networking concepts
- Experience with monitoring and observability platforms such as Datadog and CloudWatch