Home
Jobs
Saved
Resumes
Lead Software Engineer, Cloud Site Reliability at iCert Global | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Lead Software Engineer, Cloud Site Reliability
iCert Global
Website
LinkedIn
Lead Software Engineer, Cloud Site Reliability
Pune, Maharashtra, India
Full Time
3 hours ago
No Sponsorship
Apply Now
Key skills
Azure
Cloud
Distributed Systems
Docker
Grafana
Kubernetes
Prometheus
Python
ServiceNow
Terraform
Bash
PowerShell
Analytics
BI
Power BI
AKS
Helm
Azure Monitor
Datadog
Leadership
Communication
About this role
Role Overview
Lead 24x7 NOC operations with mandatory rotational shifts ensuring system availability and SLA adherence
Act as Major Incident Manager (P1/P2 incidents), driving triage, war room coordination, and stakeholder communication
Implement and enhance observability practices across logs, metrics, and traces
Work with tools like Datadog and Azure Monitor for monitoring and alerting
Drive proactive monitoring, alert tuning, anomaly detection, and AIOps initiatives
Manage Azure infrastructure and AKS clusters, including troubleshooting, scaling, and performance tuning
Build automation and self-healing workflows using Terraform, ARM, Helm, Power Automate, and scripting
Collaborate with engineering teams to improve reliability, deployment pipelines, and cloud-native architecture
Develop dashboards and reports using Power BI and ServiceNow
Handle Monthly Business reviews and leadership reporting
Mentor team members and drive process standardization and operational excellence
Requirements
7–12 years of experience in CloudOps / SRE / NOC environments (24x7 operations)
Strong expertise in Azure Infrastructure (VMs, Networking, Storage)
Hands-on experience with Azure Kubernetes Service (AKS), Kubernetes, Docker
Strong experience with monitoring and observability tools (Datadog, Azure Monitor, Prometheus, Grafana)
Proven experience in Incident Management / Major Incident Handling, Monthly reporting
Experience with Infrastructure as Code (Terraform, ARM templates, Helm)
Scripting skills in PowerShell, Python, or Bash
Experience with ServiceNow (Incident, Problem, Change modules and dashboards)
Strong reporting and analytics experience using Power BI and exposure to tools like Power Automate
Good understanding of distributed systems and cloud-native architecture
Tech Stack
Azure
Cloud
Distributed Systems
Docker
Grafana
Kubernetes
Prometheus
Python
ServiceNow
Terraform
Benefits
Health insurance
Retirement plans
Paid time off
Flexible work arrangements
Professional development
Apply Now
Home
Jobs
Saved
Resumes