Own the reliability and stability of production data pipelines and data platform services
Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments
Investigate issues across distributed data systems (e.g., Spark/EMR workloads, ingestion pipelines, warehouse performance)
Lead or support incident response, including triage, mitigation, and long-term resolution
Perform root cause analysis (RCA) and implement durable fixes to prevent recurrence
Define and improve data SLAs (freshness, latency, completeness) and ensure adherence
Design and enhance monitoring, alerting, and observability for data systems
Develop automation and tooling to reduce operational toil and improve system resilience
Contribute to disaster recovery (DR) and resiliency planning, including backup validation and recovery workflows
Partner with engineering teams to improve pipeline design, reliability, and operational readiness
Create and maintain runbooks, SOPs, and operational documentation
Participate in occasional off-hours support for production data systems when required
Requirements
Strong experience working with production data platforms in AWS environments
Prior experience building data pipelines and seeing them through production, including exposure to real-world failures and operational challenges
Strong experience with Python and SQL in real data systems
Hands-on experience troubleshooting distributed data processing systems (e.g., Spark/EMR, Redshift, streaming systems)
Proven ability to debug and resolve production issues in data pipelines and data platforms
Experience with AWS data services (such as EMR, Redshift, DynamoDB, S3, or similar)
Experience handling production incidents and performing root cause analysis
Strong problem-solving mindset and ability to work through ambiguous production issues.
Tech Stack
Amazon Redshift
AWS
DynamoDB
Python
Spark
SQL
Benefits
Medical, dental, vision and life insurance
Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
Tuition reimbursement up to $5,250/year
Business-casual environment that includes the option to wear jeans
Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
Paid volunteer time — 16 hours per calendar year
Leave of absence programs – including paid parental leave, paid short
and long-term disability, and Family and Medical Leave (FMLA)
Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.