EPAM Systems is a leading global provider of digital platform engineering and development services. They are seeking a Lead Site Reliability Engineer to drive system reliability, observability, and performance monitoring for mission-critical digital trading products.
Responsibilities:
- Define and implement a strategic reliability vision for the trading portfolio, covering infrastructure, network connectivity, application performance, and throughput
- Lead and oversee a team of SRE engineers, providing technical direction, mentorship, and performance guidance
- Own and evolve the SLA/SLO/SLI framework, including error budgets and service health reporting
- Configure and optimize comprehensive monitoring and alerting systems across infrastructure and applications
- Drive observability best practices using APM and monitoring platforms (e.g., Dynatrace)
- Analyze application and infrastructure performance to isolate fault domains and determine root causes of critical incidents
- Lead major incident management, coordinate resolution efforts, and conduct blameless postmortems
- Participate in 24x7x365 support rotation and ensure operational excellence across the team
- Identify automation opportunities to improve reliability, scalability, and operational efficiency
Requirements:
- 8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering
- Proven leadership experience (technical lead or team lead), with ability to oversee and mentor engineers
- Strong hands-on experience with SLA/SLO/SLI definition, governance, and reporting
- Solid experience working in Microsoft Azure environments (IaaS, PaaS, networking, monitoring)
- Hands-on experience with Dynatrace (configuration, alerting, dashboards, performance analysis)
- Experience with observability, monitoring, and APM tools in production environments
- Ability to operate effectively under pressure in time-sensitive, high-impact environments