Empower is dedicated to transforming financial lives by providing a flexible work environment and celebrating internal mobility. They are seeking a hands-on Data Reliability Engineer to ensure the reliability and operational excellence of their AWS-based data platform, focusing on troubleshooting and improving production data systems.
Responsibilities:
- Own the reliability and stability of production data pipelines and data platform services
- Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments
- Investigate issues across distributed data systems (e.g., Spark/EMR workloads, ingestion pipelines, warehouse performance)
- Lead or support incident response, including triage, mitigation, and long-term resolution
- Perform root cause analysis (RCA) and implement durable fixes to prevent recurrence
- Define and improve data SLAs (freshness, latency, completeness) and ensure adherence
- Design and enhance monitoring, alerting, and observability for data systems
- Develop automation and tooling to reduce operational toil and improve system resilience
- Contribute to disaster recovery (DR) and resiliency planning, including backup validation and recovery workflows
- Partner with engineering teams to improve pipeline design, reliability, and operational readiness
- Create and maintain runbooks, SOPs, and operational documentation
- Participate in occasional off-hours support for production data systems when required
Requirements:
- Strong experience working with production data platforms in AWS environments
- Prior experience building data pipelines and seeing them through production, including exposure to real-world failures and operational challenges
- Strong experience with Python and SQL in real data systems
- Hands-on experience troubleshooting distributed data processing systems (e.g., Spark/EMR, Redshift, streaming systems)
- Proven ability to debug and resolve production issues in data pipelines and data platforms
- Experience with AWS data services (such as EMR, Redshift, DynamoDB, S3, or similar)
- Experience handling production incidents and performing root cause analysis
- Strong problem-solving mindset and ability to work through ambiguous production issues
- Experience handling real-world data issues such as pipeline delays or failures
- Experience with backfills and reprocessing
- Experience with late-arriving or incomplete data
- Experience improving observability and alerting specifically for data systems
- Experience influencing or guiding data pipeline reliability and operational practices
- Exposure to streaming/event-driven systems (Kafka, Kinesis, CDC patterns)
- Experience with disaster recovery, backup validation, and resiliency testing
- Strong communication during incidents with both technical and non-technical stakeholders