Design and implement high-availability, fault-tolerant architectures across on-prem and cloud platforms (AWS)
Lead multi-region DR planning, implementation, and testing, including RTO/RPO definition and validation
Define and enforce SLOs, SLIs, and error budgets to balance reliability with delivery velocity
Drive self-healing automation and proactive remediation strategies
Build and maintain infrastructure using Terraform and configuration management tools (e.g., Chef)
Develop automation to eliminate manual operational tasks (TOIL reduction)
Create reusable modules, pipelines, and guardrails for standardized deployments
Automate certificate lifecycle management, key rotation, and security updates
Design and implement end-to-end observability (metrics, logs, traces, synthetic monitoring)
Build dashboards, alerts, and runbooks to enable fast detection and resolution of incidents
Improve signal-to-noise ratio in alerting to reduce operational fatigue
Perform root cause analysis (RCA) and lead post-incident reviews with actionable follow-ups
Engineer and operate platforms on AWS, including services such as: EKS, EC2, RDS/Aurora, Lambda, API Gateway, CloudFront, WAF, ALB/NLB, CloudWatch, X-Ray, IAM, Secrets Manager
Lead cloud migrations and modernization initiatives, including legacy system refactoring