NTT DATA North America is a leading global innovator of business and technology services. They are seeking an SRE Engineer / Site Reliability Engineer Specialist to manage observability and drive incident response, ensuring reliability and performance improvements across their platforms.
Responsibilities:
- Own and manage observability using New Relic (APM, infrastructure monitoring, dashboards, alerting)
- Define and implement SLIs/SLOs and alerting strategies
- Drive incident response, root-cause analysis (RCA), and post-mortems
- Administer GitHub Enterprise (repos, branch protections, access control)
- Design and maintain GitHub Actions CI/CD pipelines for Java/.NET applications
- Support engineering teams with build, deployment, and pipeline reliability improvements
- Contribute to code quality and security practices in CI/CD pipelines
- Troubleshoot issues across application, infrastructure, and CI/CD layers
- Drive continuous reliability and performance improvements
- Leverage or support adoption of AI/automation in SRE workflows (alerting, incident triage, or productivity tools like Copilot)
Requirements:
- 8+ Years required in SRE platforms
- 5+ years of Hands-on experience with New Relic (or similar APM tools)
- 5+ years of Strong understanding of SRE practices (SLI/SLO, alerting, incident management)
- 5+ years of Experience with GitHub Enterprise and GitHub Actions
- 5+ years of CI/CD pipeline experience for Java or .NET applications
- 5+ years of Strong Experience in troubleshooting and root-cause analysis skills
- 5+ years of Experience supporting production workloads (application + infrastructure)
- Basic exposure to AI-enabled tools (e.g., GitHub Copilot, observability insights, or automation tools)
- Advanced New Relic capabilities (synthetics, dashboards, query/SQL monitoring)
- Experience with JFrog Artifactory and/or Xray
- SonarQube integration for static code analysis
- GitHub Advanced Security (CodeQL, Dependabot, secret scanning)
- Experience supporting Angular or multi-stack pipelines
- Exposure to network monitoring
- Hands-on use of AI/ML for SRE or DevOps automation (alert noise reduction, anomaly detection)
- GitHub Copilot usage or governance
- Exposure to ServiceNow ITOM (event mgmt, CMDB, discovery)
- Experience with Databricks CI/CD pipelines
- Familiarity with AIOps concepts (predictive alerting, intelligent incident response)