Cognizant is seeking a highly skilled Multi‑Cloud Site Reliability Engineer (SRE) to design, build, and operate secure, resilient, and scalable cloud platforms. The role focuses on ensuring platform reliability, availability, performance, security, and cost efficiency through automation and best practices in Site Reliability Engineering.
Responsibilities:
- Design, build, and operate enterprise‑scale multi‑cloud platforms, with Microsoft Azure as the primary cloud, ensuring high availability, scalability, security, and resilience across Azure, GCP, and AWS
- Apply Site Reliability Engineering (SRE) principles by defining and managing SLIs, SLOs, and error budgets, driving proactive reliability improvements, capacity planning, and reduction of operational toil
- Develop and maintain infrastructure‑as‑code (IaC) using Terraform as the standard, along with ARM/Bicep, GCP Deployment Manager, and AWS CloudFormation to enable consistent, automated, and repeatable multi‑environment deployments
- Implement automation and DevOps practices using Python and PowerShell to enhance platform stability, enable self‑healing capabilities, streamline operations, and support CI/CD pipelines for infrastructure and platform services
- Design and enforce cloud governance, security, and compliance controls, including identity and access management, policy guardrails, disaster recovery, and business continuity strategies across all cloud providers
- Establish robust observability and incident management practices, including monitoring, logging, alerting, and root‑cause analysis, to ensure rapid detection and resolution of reliability and performance issues
- Collaborate with architecture, security, application, and operations teams, clearly communicating complex technical concepts, producing high‑quality documentation, and serving as a technical advisor on cloud reliability and operational best practices
Requirements:
- 5+ years of hands-on experience with Microsoft Azure (primary platform)
- 3+ years of experience with Google Cloud Platform (GCP)
- 1+ year of experience with Amazon Web Services (AWS)
- Proven experience operating production, business-critical cloud workloads in enterprise environments
- Strong expertise in multi-cloud architecture and design principles
- Deep knowledge of infrastructure-as-code and automation tooling
- Solid experience with: Cloud networking (VNETs/VPCs, routing, firewalls, load balancing)
- Identity and access management (IAM)
- Scripting and automation (Python, PowerShell)
- Platform services (compute, storage, databases, messaging)
- Kubernetes and container platforms
- Observability and monitoring tools
- Strong understanding of fundamental IT operations
- Working knowledge of ITIL principles, incident, problem, and change management
- Experience supporting 24x7 platforms with defined SLAs and operational processes
- Experience designing enterprise cloud landing zones and reference architectures
- Advanced experience with Terraform modules and multi-environment deployments
- Knowledge of FinOps and cloud cost optimization strategies
- Exposure to regulated or compliance-driven environments
- Prior experience working in an SRE, Platform Engineering, or Cloud Center of Excellence (CCoE) team