Design, develop, optimize and maintain scalable data pipelines and transformations using Databricks, Apache Spark and SQL.
Implement data ingestion, transformation, and orchestration workflows to support back and where applicable real-time processing.
Perform data quality assurance activities, including identifying and resolving any inconsistencies in data flow, data outside legitimate ranges, and illogical data responses by developing data quality reports and investigation and resolution of data anomalies or errors by using a combination of software packages including SAS, Excel, and other software as warranted.
Use technical expertise, initiative, creativity, critical thinking, and strong communication and interpersonal skills daily to solve data quality problems in support of technical development efforts.
Implement data quality controls to ensure accuracy, completeness, and reliability of datasets.
Document data pipelines, transforms, business rules and data dependencies using appropriate technical documentation methods (e.g., data flow diagrams, data dictionaries, etc.).
Serve as liaison and coordinate with a multi-disciplinary team.
Collaborate with the program team to identify opportunities for process improvements, making strategic adjustments, and exploit opportunities focused on maximizing programmatic impact.
Communicate data issues, risks, and remediation approaches clearly to technical and non-technical team members.
Requirements
Must be able to obtain a Public Trust clearance
Bachelor’s degree in Computer Science, Data Engineering, Information Systems, or a related technical field (or equivalent experience)
Demonstrated experience as a Data Engineer in a production environment
Strong hands-on experience with Databricks, including Spark-based data processing
Proficiency in SQL and at least one programming language such as Python
Excellent communication skills: listening, writing, and experience interacting comfortably with scientists, epidemiologists, informaticians and developers.
Experience supporting analytics, reporting or machine learning workloads.