NTT DATA North America is a leading business and technology services provider, dedicated to fostering innovation and client success. They are currently seeking a Site Reliability Engineer to ensure the reliability, scalability, and performance of mission-critical cloud platforms and applications across multi-cloud environments while implementing automation and monitoring practices.
Responsibilities:
- Design, implement, and maintain highly available, scalable, and fault-tolerant systems across multi-cloud platforms
- Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to ensure system reliability
- Perform root cause analysis (RCA) for incidents and implement corrective and preventive actions
- Develop and maintain disaster recovery (DR) and high availability (HA) strategies
- Collaborate with development teams to improve application resiliency and performance
- Develop and maintain Infrastructure-as-Code (IaC) using Terraform and GitHub for automated provisioning and deployment
- Automate operational tasks, including system scaling, patching, and configuration management
- Build CI/CD pipelines to improve deployment reliability and reduce manual intervention
- Ensure consistent and repeatable deployments across environments
- Design and implement observability solutions using Prometheus and Grafana for metrics, logging, and alerting
- Establish proactive monitoring and alerting strategies to detect and resolve issues before user impact
- Integrate monitoring solutions with ServiceNow for incident management and automated ticketing
- Analyze system performance and optimize resource utilization across cloud environments
- Lead incident response efforts, including triage, mitigation, and resolution of production issues
- Implement and improve incident management processes, including on-call rotations and escalation procedures
- Conduct post-incident reviews and drive continuous improvement initiatives
- Develop runbooks and operational playbooks for common failure scenarios
- Implement and enforce DevSecOps practices across all cloud environments
- Collaborate with security teams to ensure compliance with enterprise security standards, including identity and access management (IAM), secrets management (CyberArk), and secure access (Appgate)
- Support vulnerability management, patching, and system hardening efforts
- Ensure adherence to governance policies, including tagging, cost optimization (FinOps), and audit requirements
- Partner with cloud architects, developers, and platform engineers to improve system design and operational efficiency
- Evaluate and adopt new tools, technologies, and best practices to enhance reliability and performance
- Promote a culture of automation, reliability, and operational excellence
- Participate in knowledge sharing, training, and mentoring of team members
Requirements:
- 5+ years of experience in Site Reliability Engineering, DevOps, or cloud operations roles
- Hands-on experience with at least one major public cloud provider (Azure, AWS, GCP, or OCI); multi-cloud experience preferred
- Strong experience with: Infrastructure-as-Code using Terraform, Source control and CI/CD pipelines (GitHub), Monitoring and observability tools (Prometheus, Grafana), Incident management and production support
- Experience with scripting or programming (Python, Bash, or similar)
- Strong understanding of system architecture, networking, and distributed systems
- Experience with REST APIs and automation frameworks
- Experience integrating ServiceNow with monitoring and incident workflows
- Familiarity with CyberArk and Appgate for secure access and credential management
- Experience with containerization and orchestration (Docker, Kubernetes)
- Knowledge of MuleSoft or API integration platforms
- Understanding of FinOps and cloud cost optimization strategies
- Relevant certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer, Google Professional Cloud DevOps Engineer)