Avaya is an enterprise software leader that helps the world’s largest organizations and government agencies forge unbreakable connections. They are seeking a Site Reliability Engineer (SRE) to drive stability, reliability, and performance across their Azure and GCP-based platforms, focusing on operational excellence and proactive incident management.
Responsibilities:
- Serve as a key member of the 24×7 on-call rotation, responding to and managing incidents across production and pre-production environments
- Lead incident bridges, coordinate root cause analysis (RCA), and ensure post-incident reviews drive systemic improvements
- Maintain clear communication with cross-functional teams and leadership during major incidents
- Build, tune, and maintain observability dashboards (Azure Monitor, GCP Operations Suite, Prometheus, Grafana, Datadog, Log Analytics)
- Perform deep-dive troubleshooting of application and service-level issues using distributed tracing and log analysis (Grafana, Datadog) to pinpoint root causes beyond infrastructure
- Define SLOs, SLIs, and error budgets to proactively identify and mitigate reliability risks before customer impact
- Integrate AI-Ops tools for anomaly detection, predictive alerting, and automated incident correlation
- Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery
- Analyze trends to prevent recurring issues and support teams in resilience engineering
Requirements:
- 5+ years in Site Reliability, DevOps, Cloud Operations, or Customer support roles
- Demonstrated experience in application-level troubleshooting by analyzing logs and traces to identify bugs, performance bottlenecks, and error conditions
- Expertise in Azure and GCP cloud operations and distributed system reliability
- Understanding of Terraform, Ansible, and CI/CD pipelines (Jenkins, GitHub Actions)
- Experience with observability and AI-Ops tools (Azure Monitor, GCP Operations Suite, Grafana, Prometheus, Datadog, etc.)
- Solid grasp of incident management frameworks (P1–P3 handling, RCA, PIRs, on-call rotations)
- Excellent analytical, troubleshooting, and communication skills