Empower is dedicated to transforming financial lives by providing a flexible work environment and celebrating internal mobility. They are seeking a hands-on Data Reliability Engineer to ensure the reliability and operational excellence of their AWS-based data platform, focusing on troubleshooting and improving production data systems.

Responsibilities:

Own the reliability and stability of production data pipelines and data platform services
Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments
Investigate issues across distributed data systems (e.g., Spark/EMR workloads, ingestion pipelines, warehouse performance)
Lead or support incident response, including triage, mitigation, and long-term resolution
Perform root cause analysis (RCA) and implement durable fixes to prevent recurrence
Define and improve data SLAs (freshness, latency, completeness) and ensure adherence
Design and enhance monitoring, alerting, and observability for data systems
Develop automation and tooling to reduce operational toil and improve system resilience
Contribute to disaster recovery (DR) and resiliency planning, including backup validation and recovery workflows
Partner with engineering teams to improve pipeline design, reliability, and operational readiness
Create and maintain runbooks, SOPs, and operational documentation
Participate in occasional off-hours support for production data systems when required

Requirements:

Strong experience working with production data platforms in AWS environments
Prior experience building data pipelines and seeing them through production, including exposure to real-world failures and operational challenges
Strong experience with Python and SQL in real data systems
Hands-on experience troubleshooting distributed data processing systems (e.g., Spark/EMR, Redshift, streaming systems)
Proven ability to debug and resolve production issues in data pipelines and data platforms
Experience with AWS data services (such as EMR, Redshift, DynamoDB, S3, or similar)
Experience handling production incidents and performing root cause analysis
Strong problem-solving mindset and ability to work through ambiguous production issues
Experience handling real-world data issues such as pipeline delays or failures
Experience with backfills and reprocessing
Experience with late-arriving or incomplete data
Experience improving observability and alerting specifically for data systems
Experience influencing or guiding data pipeline reliability and operational practices
Exposure to streaming/event-driven systems (Kafka, Kinesis, CDC patterns)
Experience with disaster recovery, backup validation, and resiliency testing
Strong communication during incidents with both technical and non-technical stakeholders

Data Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: