H1 is dedicated to improving healthcare access through data-driven insights, and they are seeking a Senior Data Engineer to enhance their data platform. The role involves designing and scaling data systems, developing extraction frameworks, and collaborating with cross-functional teams to ensure high-quality data solutions.
Responsibilities:
- Work on developing strategies and frameworks to capture web data at scale
- Design, develop, and maintain scalable data extraction frameworks that ingest structured and unstructured data from diverse sources
- Build and optimize robust ETL/ELT pipelines using big data technologies, especially Apache Spark on cloud platforms (preferably AWS EMR)
- Improve the efficiency, reliability, and performance of data processing systems through thoughtful design and continuous optimization
- Transform, clean, and normalize complex datasets for downstream use, ensuring high standards of data quality and consistency
- Partner with senior engineers to evolve H1’s data architecture and infrastructure in support of product and platform scalability
- Lead data integration efforts across multiple systems, ensuring accuracy and seamless collaboration across teams
- Monitor and troubleshoot data flows and pipelines, proactively identifying and resolving performance issues
- Maintain clear documentation of systems, workflows, and processes to promote transparency and operational excellence
- Participate in code reviews and promote a culture of engineering excellence, mentorship, and continuous improvement
- Collaborate closely with cross-functional teams to align technical execution with business goals
Requirements:
- 5+ years professional experience in data engineering or software engineering, working with large-scale data systems and pipelines
- Strong proficiency in Python
- Proficiency in web scraping strategies and technologies: curl, network analysis, proxies and selenium/playwright
- Strong SQL skills and experience with PostgreSQL
- Experience with big data tools like Apache Spark, particularly on cloud platforms, with a preference for AWS EMR
- Experience with Docker or other containerization technologies
- Familiarity with model training and fine-tuning, particularly in NLP (Natural Language Processing) contexts
- Basic knowledge of network, security, and encryption protocols such as HTTP/HTTPS/TLS