Yahoo is a leading communication platform known for its Yahoo Mail service, which boasts hundreds of millions of users. They are seeking a Senior Data Engineer to architect and maintain high-reliability data pipelines while collaborating with cross-functional teams to define the global data ontology and improve data infrastructures.
Responsibilities:
- Architect: Partner with Data Science, Product, and Engineering to define the global data ontology for Yahoo Mail and lead the technical roadmap for core datasets
- Build Scalable Systems: Design, build, and maintain high-reliability batch and streaming data pipelines that populate our mission-critical data lakehouse
- Innovate Tooling: Develop automated frameworks and self-service tools that streamline how users interact with data products across the company. Leveraging autonomous AI agents and AI Tooling
- Data Governance: Establish standard methodologies for data operations, lifecycle management, and strict SLA management for all datasets within your ownership area
- Optimize Performance: Improve existing large-scale data infrastructures by applying advanced algorithmic concepts to optimize code and underlying data system stacks
- Collaborate: Act as a data consultant for complex cross-functional projects, ensuring integrated solutions across Mail engineering teams
Requirements:
- BS/MS in Computer Science, Engineering, or a relevant technical field
- 6+ years of experience building scalable ETL/ELT pipelines using industry-standard orchestration (Airflow, Composer, or Oozie)
- Deep expertise in SQL, PySpark, or Scala
- Proven track record of managing Multi-Terabyte/Petabyte datasets and solving large-scale challenges (e.g., skew mitigation, data sketches, and accumulation patterns)
- Professional experience with at least one major cloud provider (GCP, AWS, or Azure) and a strong command of GitOps workflows (CI/CD, PRs)
- Experience working within GDPR and other data privacy frameworks
- Exceptional communication skills with the ability to prioritize tasks in a high-pressure, fast-paced environment
- 3+ years of experience with Google Cloud Platform (BigQuery, Dataproc, Dataflow, Composer, Looker)
- Experience building data features specifically optimized for Machine Learning models and AI applications