Cognizant is seeking a highly skilled Multi‑Cloud Site Reliability Engineer (SRE) to design, build, and operate secure, resilient, and scalable cloud platforms. The role focuses on ensuring platform reliability, availability, performance, security, and cost efficiency through automation and best practices in Site Reliability Engineering.

Responsibilities:

Design, build, and operate enterprise‑scale multi‑cloud platforms, with Microsoft Azure as the primary cloud, ensuring high availability, scalability, security, and resilience across Azure, GCP, and AWS
Apply Site Reliability Engineering (SRE) principles by defining and managing SLIs, SLOs, and error budgets, driving proactive reliability improvements, capacity planning, and reduction of operational toil
Develop and maintain infrastructure‑as‑code (IaC) using Terraform as the standard, along with ARM/Bicep, GCP Deployment Manager, and AWS CloudFormation to enable consistent, automated, and repeatable multi‑environment deployments
Implement automation and DevOps practices using Python and PowerShell to enhance platform stability, enable self‑healing capabilities, streamline operations, and support CI/CD pipelines for infrastructure and platform services
Design and enforce cloud governance, security, and compliance controls, including identity and access management, policy guardrails, disaster recovery, and business continuity strategies across all cloud providers
Establish robust observability and incident management practices, including monitoring, logging, alerting, and root‑cause analysis, to ensure rapid detection and resolution of reliability and performance issues
Collaborate with architecture, security, application, and operations teams, clearly communicating complex technical concepts, producing high‑quality documentation, and serving as a technical advisor on cloud reliability and operational best practices

Requirements:

5+ years of hands-on experience with Microsoft Azure (primary platform)
3+ years of experience with Google Cloud Platform (GCP)
1+ year of experience with Amazon Web Services (AWS)
Proven experience operating production, business-critical cloud workloads in enterprise environments
Strong expertise in multi-cloud architecture and design principles
Deep knowledge of infrastructure-as-code and automation tooling
Solid experience with: Cloud networking (VNETs/VPCs, routing, firewalls, load balancing)
Identity and access management (IAM)
Scripting and automation (Python, PowerShell)
Platform services (compute, storage, databases, messaging)
Kubernetes and container platforms
Observability and monitoring tools
Strong understanding of fundamental IT operations
Working knowledge of ITIL principles, incident, problem, and change management
Experience supporting 24x7 platforms with defined SLAs and operational processes
Experience designing enterprise cloud landing zones and reference architectures
Advanced experience with Terraform modules and multi-environment deployments
Knowledge of FinOps and cloud cost optimization strategies
Exposure to regulated or compliance-driven environments
Prior experience working in an SRE, Platform Engineering, or Cloud Center of Excellence (CCoE) team

Multi‑Cloud Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: