Altarum is building the future of data and AI infrastructure for public health, and they’re looking for a Principal Data Engineer – ML Platforms to help lead the way. In this role, you will design, build, and operationalize modern data and ML platform capabilities that power analytics, evaluation, AI modeling, and interoperability across all Altarum divisions.
Responsibilities:
- Design and operate modern, cloud-agnostic lakehouse architecture using object storage, SQL/ELT engines, and dbt
- Build CI/CD pipelines for data, dbt, and model delivery (GitHub Actions, GitLab, Azure DevOps)
- Implement MLOps systems: MLflow (or equivalent), feature stores, model registry, drift detection, automated testing
- Engineer solutions in AWS and AWS GovCloud today, with portability to Azure Gov or GCP
- Use Infrastructure-as-Code (Terraform, CloudFormation, Bicep) to automate secure deployments
- Build scalable ingestion and normalization pipelines for healthcare and public health datasets, including: FHIR R4 / US Core (strongly preferred), HL7 v2 (strongly preferred), Medicaid/Medicare claims & encounters (strongly preferred), SDOH & geospatial data (preferred), Survey, mixed-methods, and qualitative data
- Create reusable connectors, dbt packages, and data contracts for cross-division use
- Publish clean, conformed, metrics-ready tables for Analytics Engineering and BI teams
- Support Population Health in turning evaluation and statistical models into pipelines
- Define SLOs and alerting; instrument lineage & metadata; ensure ≥95% of data tests pass
- Perform performance and cost tuning (partitioning, storage tiers, autoscaling) with guardrails and dashboards
- Build production-grade pipelines for risk prediction, forecasting, cost/utilization models, and burden estimation
- Develop ML-ready feature engineering workflows and support time-series/outbreak detection models
- Integrate ML assets into standardized deployment workflows
- Build ingestion and vectorization pipelines for surveys, interviews, and unstructured text
- Support RAG systems for synthesis, evaluation, and public health guidance
- Enable Palladian Partners with secure, controlled-generation environments
- Translate R/Stata/SAS evaluation code into reusable pipelines
- Build templates for causal inference workflows (DID, AIPW, CEM, synthetic controls)
- Support operationalization of ARA’s applied research methods at scale
- Implement Model Context Protocol (MCP) and fairness/explainability tooling (SHAP, LIME)
- Ensure compliance with HIPAA, 42 CFR Part 2, IRB/DUA constraints, and NIST AI RMF standards
- Enforce privacy-by-design: tokenization, encryption, least-privilege IAM, and VPC isolation
- Develop runbooks, architecture diagrams, repo templates, and accelerator code
- Pair with data scientists, analysts, and SMEs to build organizational capability
- Provide technical guidance for proposals and client engagements
Requirements:
- 7–10+ years in data engineering, ML platform engineering, or cloud data architecture
- Expert in Python, SQL, dbt, and orchestration tools (Airflow, Glue, Step Functions)
- Deep experience with AWS + AWS GovCloud
- CI/CD and IaC experience (Terraform, CloudFormation)
- Familiarity with MLOps tools (MLflow, Sagemaker, Azure ML, Vertex AI)
- Ability to operate in regulated environments (HIPAA, 42 CFR Part 2, IRB)
- Experience with FHIR, HL7, Medicaid/Medicare claims, and/or SDOH datasets
- Databricks, Snowflake, Redshift, Synapse
- Event streaming (Kafka, Kinesis, Event Hubs)
- Feature store experience
- Observability tooling (Grafana, Prometheus, OpenTelemetry)
- Experience optimizing BI datasets for Power BI