Serve as a primary responder for production incidents, owning triage through resolution — including root cause analysis, infrastructure remediation, and order automation recovery.
Work directly alongside the Manager, Consumer Technology Site Reliability, and Helpdesk to handle day-to-day triage and fix responsibilities.
Partner with development teams to evaluate production risk before deployment.
Monitor production environment and proactively surface anomalies, enhancement opportunities, and risk areas to leadership.
Assist with data cleanup and order recovery operations following production incidents.
Support testing and validation of infrastructure changes prior to production deployment.

5+ years in a Site Reliability Engineering, DevOps, or Production Support role at a software or e-commerce company.
Demonstrated ability to independently diagnose and resolve production incidents, including infrastructure-level failures (servers, queues, batch jobs, APIs).
Hands-on experience with AWS (EC2, CloudWatch, or equivalent) for day-to-day operational tasks.
Experience with Datadog, New Relic, PagerDuty, or equivalent platforms for monitoring, alerting, and incident detection.
Working knowledge of MySQL/relational databases for investigative queries and data validation.
Ability to read and analyze complex SQL queries to diagnose production data issues.
Familiarity with PHP, Python, Bash, or similar languages sufficient to read, debug, and modify production scripts and automation jobs.
Experience with Rundeck, cron, or equivalent batch job management and monitoring tools.

Senior Site Reliability Engineer

Key skills