Acumenz Consulting is seeking an AI/ML Data Engineer to be the sole data engineering owner within the AI Insight and Next Best Action engine team. The role involves designing and operating the full data platform lifecycle, managing data ingestion, transformation, and storage, while ensuring compliance with HIPAA regulations.

Responsibilities:

Own and operate the raw ingestion pipeline: stream Health Data, Local Temp/News, Benefits Data, and Click-Thru External Files (CSV/Parquet) via Kafka and Dataflow Streaming Ingest into Cloud Storage Raw Buckets and BigQuery ingestion datasets
Build and maintain the transformation pipeline: execute ETL/ELT jobs using Dataflow and Dataproc (Apache Spark / DataSpark) to produce cleaned and normalized BigQuery datasets from raw sources
Drive normalization to curated data marts: produce denormalized 360-degree data views in BigQuery across Rx, Benefits, behavioural, and clinical signals for downstream Feature Store ingestion
Own the Feature Store population pipeline: ingest Behaviour Signals, Insurance Coverage Signals, Clinical Signals, Engagement Signals, Rx Signals, and Contextual Features from curated BigQuery data marts into the GCP Feature Store
Design and maintain ML training dataset pipelines for NBA and AI Insight models: offline batch paths, online serving feature paths, training/eval splits, and dataset versioning
Integrate Adobe Analytics event data as a behavioural signal source, aligning it with clinical and benefits data for multi-source model training
Operate and tune Dataproc Spark Jobs (DataSpark) for large-scale feature engineering and model training data preparation
Monitor feature freshness, training data drift, and model data quality in partnership with Vertex AI pipelines
Implement and monitor data quality checks, SLAs, and alerting across all pipeline stages; implement schema validation and anomaly detection
Manage schema evolution, partitioning strategies, and cost optimization for BigQuery tables
Design and implement disaster recovery (DR) zones for the data platform: define RTO/RPO targets, configure cross-region BigQuery dataset replication, set up Cloud Storage DR buckets, replicate Feature Store snapshots, and document and test failover runbooks
Implement data archival strategies across all tiers: design lifecycle policies for BigQuery table expiration, Cloud Storage object tiering (Nearline/Coldline/Archive), archive historical ML training snapshots, and ensure HIPAA-mandated retention windows are met for PHI-regulated datasets
Enforce HIPAA-compliant PHI handling across all pipeline and ML stages: apply PHI data classification, implement field-level encryption and masking for Protected Health Information, apply de-identification techniques (Safe Harbor or Expert Determination) before model training, manage access controls and audit logging per the HIPAA Security Rule, and ensure PHI is never written to unencrypted storage or non-compliant destinations
Collaborate with backend domain engineers to define Kafka event schemas that feed ML operations
Document pipeline architecture, data contracts, Feature Store definitions, training data lineage, DR runbooks, and archival policies

Requirements:

7–9 years of hands-on data engineering or ML data engineering experience in a production GCP environment
Strong proficiency in Python, Java, or Node.js for pipeline development, feature engineering scripts, and automation
Strong hands-on experience with BigQuery (partitioning, clustering, cost management, complex SQL, ML-optimized table design)
Proficiency with Apache Kafka for real-time streaming ingestion
Experience with Dataflow (Apache Beam) for both streaming and batch pipelines
Proficiency with Apache Spark (PySpark or Scala); DataSpark experience a strong plus
Solid familiarity with GCP ecosystem: Cloud Storage, Pub/Sub, Dataproc, Cloud Composer/Airflow
Experience building ML training pipelines and Feature Stores (GCP Feature Store preferred); understanding of the ML lifecycle including feature engineering, data versioning, and train/eval splits
Experience with Vertex AI Pipelines or similar MLOps tooling
Demonstrated experience designing disaster recovery zones and failover strategies for cloud data platforms: cross-region replication, RTO/RPO definition, and DR testing
Experience with data archival design: BigQuery table lifecycle management, Cloud Storage tiered storage policies, and long-term retention for regulated datasets

AI/ML Data Engineer

Key skills

About this role

Responsibilities:

Requirements: