Yahoo serves as a trusted guide for hundreds of millions of people globally, helping them achieve their goals online through our portfolio of iconic products. The ideal candidate will have strong distributed systems knowledge and AI/ML experience to design, build, and optimize scalable data pipelines, and infrastructure that power advanced analytics and machine learning solutions.
Responsibilities:
- Apply software engineering expertise to build high-performance, scalable data warehouses
- Be excited to learn and take ownership for large-scale projects spanning many tech stacks and environments
- Design, build, and launch efficient & reliable data pipelines to move and transform data on the scale of multiple petabyte(s) using the latest technologies
- Build real time analytics and ingestion pipelines capable of processing more than a million events per second and provide insights at sub-second latency
- Interact with product owners and end users to understand and solve new business requirements as they emerge
- Design and audit processes for ensuring the delivery of high-quality data through rigorous QA checks
- Have excellent data modeling skills to understand the nuances of various dimension and metric types in warehouses
- Design workflows to ingest, load and present new data sets for users
- Provide active support, be on rotation for on-call support on production pipelines (typically a couple of times each quarter)
- Define and manage SLA for all data sets in allocated areas of ownership
- Work with the production engineering / infrastructure team to drive resolution to production issues
Requirements:
- BS/MS in Computer Science and/or Mathematics/Statistics
- 4+ years experience in relevant software development with at least 2 years of professional Java and/or Python experience
- 2+ years experience in the Big Data pipeline and analytics space with experience across technology stacks
- 2+ years experience in custom ETL design using Big Data stack environments (Hadoop, MapReduce, Pig, Hive, AWS EMR, Apache Beam, Google Cloud Platform Dataflow, BigQuery), implementation and maintenance
- Experience or familiarity with some of the following tools: Kafka, Storm, Streaming (Spark,Dataflow), ElasticSearch
- Design, build, and maintain scalable data pipelines and ETL processes to support machine learning and AI initiatives on Google Cloud Platform (GCP)
- Implement and optimize data storage solutions using GCP services such as BigQuery, Cloud Storage, and Dataflow
- Ensure data quality, integrity, and security throughout the data lifecycle
- Collaborate with data scientists, analysts, and business stakeholders to understand data requirements and deliver actionable insights
- Monitor, troubleshoot, and maintain the health and performance of cloud-based data infrastructure
- Automate manual processes and repetitive tasks to improve efficiency and reduce errors
- Apply data governance and compliance best practices to protect sensitive information and meet regulatory standards
- Stay current with new GCP features, tools, and best practices to continuously enhance data management capabilities
- Document solutions, processes, and architectural decisions to facilitate knowledge sharing and maintainability
- Experience working with either MapReduce or any other Parallel data processing system
- Experience with schema design and dimensional data modeling
- Comfortable writing complex SQL queries
- Strong data mindset with a deep appreciation for analyzing data to identify product gaps and enhancements to improve user engagement and revenue growth
- Excellent communication skills and ability to tell insightful stories using data and also manage communication within internal teams and stakeholders