The New York Times is a world-renowned journalism organization committed to seeking the truth and helping people understand the world. They are seeking a Data Engineer to design and implement complex data pipelines, manage data storage across cloud platforms, and ensure high-quality data for analytics.
Responsibilities:
- Design, model, and implement complex ELT/ETL pipelines for the cleansed and curated data layers in the medallion architecture, taking full ownership of the data product's structure, partitioning, documentation, and performance characteristics
- Develop advanced data transformations using dbt (data build tool) for relational data modeling and PySpark for large-scale data processing within the Lakehouse, ensuring outputs meet strict Service Level Agreements and quality standards
- Collaborate across teams to define requirements and translate them into robust and scalable data models suitable for analytic consumption
- Manage the physical data storage across both GCP and AWS, selecting optimal file formats and designing efficient partitioning and clustering strategies
- Administer and tune Spark compute resources (e.g., Dataproc, EMR, or managed services) to optimize job execution time and cost
- Own core components of our centralized analytics environment, specifically focused on Hex, integrations, and the methods of data exposure and access controls; and support data activation strategies, ensuring seamless data consumption by analytic tools
- Optimize user queries and access patterns to maintain platform performance and cost efficiency
- Implement centralized data quality checks and observability mechanisms within the data pipeline to proactively identify and resolve data issues
- Contribute to the implementation of metadata management, data lineage, and role-based access control (RBAC) initiatives across the Lakehouse environment
- Demonstrate support and understanding of our value of journalistic independence and a strong commitment to our mission to seek the truth and help people understand the world
Requirements:
- 2+ years of hands-on experience in a Data Engineering, Data Warehousing, Analytics Engineering or equivalent role
- Proficiency in SQL and experience with complex, production-level data modeling (dimensional modeling, Kimball, OBT, or Data Vault)
- Demonstrated experience designing, developing, and deploying end-to-end data products through the full Software Development Lifecycle
- Experience with a Cloud Data Warehouse, like BigQuery
- Proficiency in Python for scripting and data manipulation, including knowledge of PySpark or other Spark APIs
- Familiarity with cloud services and data storage components in at least one major cloud provider (GCP or AWS)
- Experience with workflow orchestration tools (e.g., Airflow, Cloud Composer, or Prefect) and version control systems (Git)
- Experience operating in a dual-cloud environment (GCP/AWS)
- Experience with Infrastructure-as-Code (IaC) tools like Terraform
- Experience with advanced Lakehouse file formats like Iceberg or Delta Lake
- Familiarity with experimentation or A/B testing platforms and the data required to support them
- Experience in data product quality standards through integration advanced testing, quality checks, and monitoring into the CI/CD pipeline