Own and improve the reliability, stability, scalability, and performance of our core data platforms and services
Provide operational support for large-scale, distributed data systems, ensuring high availability and strong SLAs
Partner closely with full-stack, data, and platform engineering teams to deliver continuous improvements
Operate and support EMR and EMR Serverless (Python/Spark) workloads and data pipelines
Support and optimize Amazon Redshift and DynamoDB in high-throughput, production environments
Design, build, and evolve monitoring, alerting, and observability frameworks with a focus on symptoms, not just outages
Lead incident response, troubleshooting production issues across the full stack and coordinating with internal and external stakeholders
Perform root cause analysis (RCA) and readiness reviews; turn findings into durable fixes and automation
Create and maintain runbooks, SOPs, and operational documentation
Collaborate with engineering teams to optimize performance, reliability, and cost
Participate in an on-call rotation to respond to incidents impacting customer-facing systems
Recommend and influence the use of AWS managed services and architectural patterns
Continuously evaluate system performance, capacity, and cost to scale efficiently

4–6 years of experience building or operating systems across multiple architecture domains: application, data, integration, infrastructure, and security
4+ years of hands-on AWS experience, with strong production exposure to several of the following: Redshift, DynamoDB, EMR, EMR Serverless, EC2, S3 Lambda, Step Functions, EventBridge, RDS, IAM
Proven experience operating data platforms such as data lakes and data warehouses in production
Strong SQL skills and experience working with modern databases (e.g., Redshift, DynamoDB, Postgres, MySQL, Oracle)
4+ years of Python experience, including scripting, automation, or data workloads
Experience with CloudWatch, infrastructure monitoring, and alerting
Hands-on experience with incident management, uptime SLAs, and customer-impacting systems
Strong understanding of Git-based workflows (GitHub, Git Flow, or similar)
Experience working in Agile environments (Scrum / Kanban) using tools such as Jira and Confluence
Bachelor’s in Computer Science, Information Systems, Data/Analytics, or related; equivalent practical experience welcomed.

Medical, dental, vision and life insurance
Retirement savings – 401(k) plan with generous company matching contributions (up to 6%)
Tuition reimbursement up to $5,250/year
Business-casual environment that includes the option to wear jeans
Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
Paid volunteer time — 16 hours per calendar year
Leave of absence programs – including paid parental leave, paid short
and long-term disability, and Family and Medical Leave (FMLA)
Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play.

Site Reliability Engineer

Key skills