Own the reliability and stability of production data pipelines and data platform services.
Define, improve, and enforce data SLAs/SLOs for batch and streaming products, including freshness, latency, and completeness.
Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments.
Investigate issues across distributed data systems, including Spark/EMR workloads, ingestion pipelines, and warehouse performance.
Lead or support incident response, including triage, mitigation, and long-term resolution.
Perform root cause analysis and implement durable fixes to prevent recurrence.
Design and enhance monitoring, alerting, and observability for data systems.
Develop automation and tooling to reduce operational toil and improve system resilience.
Contribute to disaster recovery and resiliency planning, including backup validation and recovery workflows.
Partner with engineering teams to improve pipeline design, reliability, and operational readiness.
Create and maintain runbooks, Standard Operating Procedures, and operational documentation.
Participate in occasional off-hours support for production data systems when required.

Bachelor’s degree in Computer Science, Information Systems, Data Science, or a related field
5+ years of experience in data engineering or analytics platform roles, including 3+ years operating in a production cloud data warehouse environment such as Redshift or Snowflake
3+ years of experience building AWS data pipelines and supporting them through production, including exposure to real-world failures and operational challenges
3+ years of experience working with production data platforms in AWS environments, with a focus on anomaly detection, reconciliation, and end-to-end validation
3+ years of experience with Python and SQL in real data systems
Hands-on experience troubleshooting distributed data processing systems such as Spark/EMR, Redshift, and streaming systems
Proven ability to debug and resolve production issues in data pipelines and data platforms
Experience with AWS data services such as EMR, Redshift, DynamoDB, S3, or similar
Proven ability to handle production incidents and perform root cause analysis
Strong problem-solving mindset and ability to work through ambiguous production issues

Medical, dental, vision and life insurance
Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
Tuition reimbursement up to $5,250/year
Business-casual environment that includes the option to wear jeans
Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
Paid volunteer time — 16 hours per calendar year
Leave of absence programs – including paid parental leave, paid short
and long-term disability, and Family and Medical Leave (FMLA)
Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.

Data Reliability Engineer

Key skills