Define, measure, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for cloud services and infrastructure components.
Lead efforts to continuously improve system availability, fault tolerance, and disaster recovery capabilities.
Ensure proactive incident detection, efficient root cause analysis, and timely resolution of production incidents.
Drive automation efforts to reduce manual intervention and streamline cloud infrastructure management.
Design and implement monitoring, logging, and alerting solutions to track cloud infrastructure health, performance, and security.
Ensure that cloud infrastructure is built with security best practices in mind and meets all relevant compliance and regulatory requirements.
Work closely with development, DevOps, and operations teams to ensure cloud infrastructure aligns with application and business requirements.
Lead the incident response efforts for cloud infrastructure-related issues, ensuring that all incidents are managed effectively.

10 years of hands-on experience with cloud automation and configuration management tools (e.g., Terraform, AWS CloudFormation, Ansible, Puppet)
7+ years of experience in a Site Reliability Engineering (SRE), Infrastructure Engineering, or DevOps role, with at least 3+ years in a technical leadership capacity.
Deep knowledge of cloud services and technologies (e.g., EC2, S3, Lambda, Kubernetes, etc.)
Proficiency in scripting or programming languages (Python, Go, Bash, etc.)
Experience with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack)
Familiarity with Continuous Integration/Continuous Deployment (CI/CD) pipelines and cloud-native development practices.
Strong expertise in managing cloud infrastructure (AWS, Google Cloud, Azure) in production environments.
Experience with cloud-native architectures, microservices, and containerized environments (Kubernetes, Docker)
Strong understanding of cloud networking, storage, compute services, On-Prem and security best practices
Strong knowledge of Linux administration and internals
Effective communication skills, with the ability to translate technical concepts to non-technical stakeholders.

Lead Site Reliability Engineer

Key skills