Definitive Healthcare is a leading company in healthcare data analytics, dedicated to providing meaningful intelligence to its clients. The Data Engineer will focus on building scalable data pipelines and managing complex healthcare datasets, ensuring data quality and integration across various sources.
Responsibilities:
- Develop and maintain robust data pipelines using Python, Spark, Databricks, SQL, and SSIS
- Implement and orchestrate ETL/ELT workflows using Apache Airflow and SSIS
- Build reliable, repeatable processes that support the ingestion and transformation of large healthcare datasets
- Integrate data from diverse sources (AWS, on‑prem, third‑party vendors) into our enterprise data platform
- Work with a wide range of file formats including CSV, XML, Parquet, Delta, and more
- Apply strong data quality, cleansing, and curation practices to ensure accuracy and consistency
- Optimize storage and compute resources for performance, cost, and scalability
- Automate observability and monitoring across data pipelines and workloads
- Implement and manage Unity Catalog for metadata, lineage, and access control
- Ensure adherence to data governance, security, and privacy standards
- Maintain clear documentation, data dictionaries, and lineage tracking
- Contribute to automation of data observability and governance workflows
- Tune and optimize Spark jobs for speed, reliability, and cost efficiency
- Diagnose and resolve performance bottlenecks across distributed systems
- Apply JVM tuning and Spark optimization techniques to improve throughput
- Support and enhance our Medallion architecture (bronze/silver/gold) to improve data quality and usability
- Ensure data is processed, enriched, and validated at each stage of the lifecycle
- Partner with data scientists, analysts, product teams, and business stakeholders to understand data needs
- Implement CI/CD pipelines to streamline deployment and testing of data assets
- Stay current with emerging technologies and bring forward recommendations to evolve our data platform
Requirements:
- Strong programming experience in SQL and Python or Scala
- Hands‑on experience with Apache Spark and Databricks
- Experience with Apache Airflow or similar orchestration tools
- Knowledge of data cleansing, curation, and quality frameworks
- Familiarity with Unity Catalog or other metadata management tools
- Understanding of data governance, security, and compliance best practices
- Experience working with AWS cloud services
- Proficiency with CI/CD tools (Jenkins, GitLab CI, etc.)
- Experience tuning Spark jobs and JVM‑based applications
- Experience implementing or working within a Medallion architecture
- Strong analytical and problem‑solving abilities
- Excellent communication and cross‑functional collaboration skills
- Ability to work independently and within a team environment
- High attention to detail and commitment to quality
- AWS certifications (e.g., AWS Certified Data Analytics)
- Experience with SQL and NoSQL databases
- Background in a fast‑paced, data‑centric SaaS or healthcare environment