Lead the execution and continuous improvement of SRE practices across assigned platforms and services, reinforcing a culture of reliability, efficiency, and operational ownership
Manage and evolve automation strategies that reduce operational toil, improve system reliability, and increase engineering productivity
Design, implement, and operate observability, monitoring, and alerting solutions that provide actionable insight into system health, availability, and performance
Own and lead high‑severity incident response for supported services, ensuring effective triage, coordination, root cause analysis, and completion of corrective and preventative actions
Analyze reliability, performance, and capacity metrics to identify risks, drive proactive improvements, and support long‑term system resilience
Partner with software engineering, product, and infrastructure teams to embed SRE principles throughout the development lifecycle and influence architecture and design decisions
Build, coach, and develop SRE managers and engineers, fostering technical excellence, career growth, and strong on‑call and operational practices
Support capacity planning, scalability assessments, and demand forecasting for critical systems and services
Ensure SRE processes, standards, and best practices are well documented, understood, and consistently applied
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
12+ years of overall engineering experience, including 5+ years in Site Reliability Engineering, DevOps, or a similar role
3+ years of experience leading engineering teams or managing senior technical contributors
Strong experience with observability and monitoring platforms such as Datadog, Prometheus, Dynatrace, Grafana, ELK, or similar
Proficiency in at least one programming language such as Python, Go, or Java
Hands‑on experience with cloud platforms (AWS, Azure, or GCP) and container orchestration technologies (Docker, Kubernetes)
Solid working knowledge of AWS services such as VPC, EC2, ELB, ECS, EKS, Lambda, IAM, CloudWatch, S3, SQS, SNS, Route53, and WAF
Experience with infrastructure‑as‑code tools such as Terraform, Ansible, or equivalents
Strong troubleshooting and problem‑solving skills in distributed systems environments
Working knowledge of security best practices and operational risk management
Experience with resilience testing, chaos engineering, or failure‑injection techniques