Respond to production incidents
Collaborate with business partners responding to application specific questions
Work with product teams to promote availability, resilience, and stability
Proactively identify performance bottlenecks, capacity risks, and failure points; recommend and implement remediation strategies
Instrument applications and infrastructure to provide end-to-end visibility into system health, performance, and reliability
Lead incident response , providing rapid triage and resolution during production outages or performance degradation
Collaborate closely with development, infrastructure, security, and business teams to align operational and business objectives

Bachelor’s degree or higher in a technology related field (like Engineering, Computer Science, Information Technology) required
Minimum 5 years of combined experience across Production Support, Application Development (Java), and Site Reliability Engineering (SRE) to ensure system stability, scalability, and performance
3 years of hands-on experience with Amazon EKS and RDS
Lead and execute cloud migration initiatives , ensuring minimal downtime, performance optimization, and adherence to architectural best practices
Implement and maintain CI/CD pipelines to enable reliable, automated, and secure application deployments
Design, implement, and continuously improve observability solutions , including: Monitoring Logging Alerting Distributed tracing using tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, and Splunk
Conduct root cause analysis (RCA) for critical incidents and drive corrective and preventive actions

Senior Site Reliability Engineer

Key skills