Act as a hands-on technical lead who not only defines the architecture but also codes, deploys, and maintains scalable ETL pipelines and data structures
Spearhead the technical implementation of the Translational Data Lake data ingestion, managing the ingestion of complex datasets (genomics, proteomics, imaging, lab data, etc.) into modern cloud architectures
Lead data engineering projects beyond the Data Lake, designing bespoke integration solutions for diverse scientific data sources across the Research organization
Design and script automated procedures to normalize unformatted data from external vendors (CROs) into a structured Common Data Model (CDM)
Partner with various functions in Research and IT to align infrastructure with scientific needs, ensuring solutions are robust, FAIR-compliant, and scalable
Develop and communicate the technical vision for biomarker data integration and reuse
Architect and implement scalable ETL procedures, APIs and front-end tools for data access and visualization
Engage stakeholders to gather requirements and incorporate feedback into design
Lead user acceptance testing (UAT) and ensure high-quality deliverables
Collaborate with IT and Translational leads to align infrastructure and governance processes
Champion FAIR principles and interoperability across translational and clinical programs
Requirements
Bachelor’s or master’s degree in computer science, Data Engineering, Bioinformatics, or related field
8+ years of professional experience in data engineering or software architecture, with a focus on building production-grade data pipelines
Expert-level coding proficiency in Python with specific mastery of modern data engineering libraries (Pandas, PySpark, Dask, SQLAlchemy)
Advanced proficiency with SQL, workflow orchestration tools (Airflow, Dagster, or Prefect), and containerization (Docker/Kubernetes)
Deep experience with modern Data Lake and Lakehouse architectures (e.g., Azure Fabric, Databricks, Snowflake)
Solid understanding of data modeling, ETL processes, and schema design for complex datasets
Experience designing and deploying APIs for data access
Excellent communication skills to bridge the gap between IT infrastructure and scientific stakeholders.
Tech Stack
Airflow
Azure
Cloud
Docker
ETL
Kubernetes
Pandas
PySpark
Python
SQL
Benefits
Health benefits to include Medical, Dental and Vision
Company match 401k
Eligibility to participate in Employee Stock Purchase Plan
Eligibility to earn commissions/bonus based on company and individual performance