Acumenz Consulting is seeking an AI/ML Data Engineer to be the sole data engineering owner within the AI Insight and Next Best Action engine team. The role involves designing and operating the full data platform lifecycle, managing data ingestion, transformation, and storage, while ensuring compliance with HIPAA regulations.
Responsibilities:
- Own and operate the raw ingestion pipeline: stream Health Data, Local Temp/News, Benefits Data, and Click-Thru External Files (CSV/Parquet) via Kafka and Dataflow Streaming Ingest into Cloud Storage Raw Buckets and BigQuery ingestion datasets
- Build and maintain the transformation pipeline: execute ETL/ELT jobs using Dataflow and Dataproc (Apache Spark / DataSpark) to produce cleaned and normalized BigQuery datasets from raw sources
- Drive normalization to curated data marts: produce denormalized 360-degree data views in BigQuery across Rx, Benefits, behavioural, and clinical signals for downstream Feature Store ingestion
- Own the Feature Store population pipeline: ingest Behaviour Signals, Insurance Coverage Signals, Clinical Signals, Engagement Signals, Rx Signals, and Contextual Features from curated BigQuery data marts into the GCP Feature Store
- Design and maintain ML training dataset pipelines for NBA and AI Insight models: offline batch paths, online serving feature paths, training/eval splits, and dataset versioning
- Integrate Adobe Analytics event data as a behavioural signal source, aligning it with clinical and benefits data for multi-source model training
- Operate and tune Dataproc Spark Jobs (DataSpark) for large-scale feature engineering and model training data preparation
- Monitor feature freshness, training data drift, and model data quality in partnership with Vertex AI pipelines
- Implement and monitor data quality checks, SLAs, and alerting across all pipeline stages; implement schema validation and anomaly detection
- Manage schema evolution, partitioning strategies, and cost optimization for BigQuery tables
- Design and implement disaster recovery (DR) zones for the data platform: define RTO/RPO targets, configure cross-region BigQuery dataset replication, set up Cloud Storage DR buckets, replicate Feature Store snapshots, and document and test failover runbooks
- Implement data archival strategies across all tiers: design lifecycle policies for BigQuery table expiration, Cloud Storage object tiering (Nearline/Coldline/Archive), archive historical ML training snapshots, and ensure HIPAA-mandated retention windows are met for PHI-regulated datasets
- Enforce HIPAA-compliant PHI handling across all pipeline and ML stages: apply PHI data classification, implement field-level encryption and masking for Protected Health Information, apply de-identification techniques (Safe Harbor or Expert Determination) before model training, manage access controls and audit logging per the HIPAA Security Rule, and ensure PHI is never written to unencrypted storage or non-compliant destinations
- Collaborate with backend domain engineers to define Kafka event schemas that feed ML operations
- Document pipeline architecture, data contracts, Feature Store definitions, training data lineage, DR runbooks, and archival policies
Requirements:
- 7–9 years of hands-on data engineering or ML data engineering experience in a production GCP environment
- Strong proficiency in Python, Java, or Node.js for pipeline development, feature engineering scripts, and automation
- Strong hands-on experience with BigQuery (partitioning, clustering, cost management, complex SQL, ML-optimized table design)
- Proficiency with Apache Kafka for real-time streaming ingestion
- Experience with Dataflow (Apache Beam) for both streaming and batch pipelines
- Proficiency with Apache Spark (PySpark or Scala); DataSpark experience a strong plus
- Solid familiarity with GCP ecosystem: Cloud Storage, Pub/Sub, Dataproc, Cloud Composer/Airflow
- Experience building ML training pipelines and Feature Stores (GCP Feature Store preferred); understanding of the ML lifecycle including feature engineering, data versioning, and train/eval splits
- Experience with Vertex AI Pipelines or similar MLOps tooling
- Demonstrated experience designing disaster recovery zones and failover strategies for cloud data platforms: cross-region replication, RTO/RPO definition, and DR testing
- Experience with data archival design: BigQuery table lifecycle management, Cloud Storage tiered storage policies, and long-term retention for regulated datasets