DEPLOY is engaged in delivering a phased data strategy and AI enablement for a client with extensive datasets in Databricks. The Data Engineer will profile datasets, document schemas, assess data quality, and build data pipelines to support the AI-driven data platform.
Responsibilities:
- Profile all 30+ datasets in Databricks: table structures, row counts, data types, distributions, refresh patterns
- Document schemas with inferred relationships and primary/foreign key candidates
- Assess data quality across dimensions: completeness, consistency, accuracy, freshness
- Analyze historical data behavior — determine which datasets use snapshot vs. overwrite patterns
- Support API and integration mapping (test data extraction capabilities)
- Build standardized ingestion framework and data pipelines (Phase 2)
- Implement data quality gates with automated validation and alerting (Phase 2)
- Support workflow integration, feature engineering pipelines, and ML data products (Phases 3-4)
Requirements:
- Strong SQL and Python skills
- Experience with Databricks (notebooks, Spark SQL, Delta Lake)
- Hands-on data profiling, data quality assessment, and technical documentation
- ETL/ELT pipeline development experience
- Comfort working in locked-down enterprise environments with restricted internet access
- Comfort with undocumented, messy data — you'll be making sense of datasets that have limited or no documentation
- Eager to learn AI tooling
- Financial services, lending, or banking data experience
- Experience with Medallion Architecture (bronze/silver/gold patterns)
- Familiarity with Power BI as a downstream consumer
- Experience working within VDI-based access environments
- Experience with modern AI tool sets