Yahoo is a leading company that guides millions globally through a portfolio of products. They are seeking a Software Data Engineer II to build innovative data solutions and enhance their cloud-first data ecosystem, focusing on designing robust data pipelines and collaborating with analytics teams.
Responsibilities:
- Design and develop robust, scalable, and resilient data pipelines using both streaming and batch processing technologies
- Collaborate with analytics and reporting teams to understand their needs and deliver well-structured data models
- Architect and implement a variety of data solutions, by applying concepts such as Lambda, Kappa, and streaming architectures, ETL/ELT
- Propose, prototype, and implement new technologies and best practices to improve our data ecosystem's efficiency and reliability
- Contribute to the full data lifecycle, from data ingestion and transformation to modeling and warehousing
- Participate in company Data Working Groups to architect data strategy for search and help define and enforce best practices for data quality, governance, and architecture
Requirements:
- B.S. or M.S. in Computer Science (or equivalent experience)
- 3+ years of related industry experience
- Strong background in data engineering, with a proven track record of designing and building scalable data platforms
- Proactive self-starter who can take the lead on projects, sell new ideas, and drive them to completion with a high degree of autonomy
- Deep understanding of data modeling principles and modern data architecture concepts
- Excited by the challenge of working with diverse datasets in both batch and real-time streaming modes
- Strong collaborator with excellent communication skills, capable of working with both technical and non-technical stakeholders
- Fundamental belief in leveraging AI and machine learning as a core component of every data solution, from design to implementation
- Deep expertise in GCP and/or AWS data services
- Experience with modern cloud warehouses like BigQuery, Snowflake, Redshift, Databricks, or Dremio
- Proficiency with Google Cloud Storage (GCS) and Amazon S3, Relational, Document and Wide Column Databases, columnar formats (parquet, orc, etc), schemas (avro, protocol buffers, etc), compression formats (gzip, snappy, bzip, etc.)
- Hands-on experience with distributed data processing engines such as Apache Flink, Apache Beam, Spark, or similar technologies
- Expertise in technologies like Cloud Functions, Cloud Run, Kubernetes, Dataflow, Dataproc, DataPleax, Glue, EMR, and orchestration tools like dbt or BigQuery Dataforms and Airflow / Cloud Workflows / Step Functions
- Exposure and practical experience with streaming platforms like Pub/Sub and Kafka