About this role

Responding to production incidents
Work with business partners responding to application specific questions
Promote availability, resilience, and stability
Build, manage, and optimize resilient, scalable cloud platforms
Lead and execute cloud migration initiatives
Ensure high availability, scalability, fault tolerance, and disaster recovery requirements
Conduct root cause analysis (RCA) for critical incidents
Collaborate closely with development, infrastructure, security, and business teams
Analyze and reverse-engineer existing applications

Bachelor’s degree or higher in a technology related field (like Engineering, Computer Science, Information Technology) required
Minimum 5 years of combined experience across Production Support, Application Development (Java), and Site Reliability Engineering (SRE)
3 years of hands-on experience with Amazon EKS and RDS
Implement and maintain CI/CD pipelines
Design, implement, and continuously improve observability solutions using tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, and Splunk
SQL and relational databases (Oracle or other RDBMS) to support application troubleshooting
Certification in public Cloud (AWS) or Kubernetes is a plus

Senior Site Reliability Engineer

Key skills