FEI Systems is dedicated to creating innovative technology solutions that enhance the delivery of health and human services. They are seeking a Data Engineer to support Machine Learning and AI initiatives, focusing on maintaining high-quality data within their cloud-based platform to support model training and deployment.
Responsibilities:
- Design, build, and maintain scalable data pipelines supporting ML/AI workloads
- Engineer pipeline patterns including full loads, incremental loads, change-based loads, and slowly changing dimensions
- Ensure pipelines are reliable, performant, secure, and maintainable, troubleshoot and monitor pipelines within an AWS ecosystem
- Perform data transformations in Snowflake using SQL and native Snowflake features
- Design and optimize schemas, tables, views, and materialized views for ML/AI consumption
- Support AWS-native data lake patterns using S3, Glue, Athena, Apache Iceberg, and S3 Tables
- Perform data cleansing, normalization, and enrichment to support ML model development
- Design and implement feature engineering pipelines including aggregation and transformation
- Ensure consistency, reuse, and versioning of features across models and use cases
- Support feature store patterns to enable feature discoverability and reuse
- Collaborate with ML engineers and data scientists to operationalize features into training pipelines
- Support model training workflows, including dataset preparation and scheduled refreshes
- Ensure training datasets and features are reproducible, traceable, and auditable
- Integrate data pipelines into CI/CD workflows; support version control, testing, and deployment of data assets
- Monitor pipeline health, data freshness, and downstream impact on ML/AI systems
Requirements:
- 5+ years of hands-on data engineering experience in a cloud environment
- Strong proficiency in Python for data processing and pipeline development
- Advanced skills in SQL with hands-on Snowflake transformation experience
- Experience with ELT pipeline design, schema optimization, performance tuning, and cost management in Snowflake
- Experience with querying, data modeling, and analytics in PostgreSQL; familiarity with SQL Server to PostgreSQL migration is a plus
- Familiarity with AWS services including S3, Glue, Athena, and managed relational databases (e.g., Aurora, RDS)
- Familiarity with Apache Iceberg / S3 Tables and open table format ecosystems
- Experience with streaming ingestion tools (e.g., Kinesis, Kafka, or equivalent)
- Experience with workflow orchestration tools (e.g., Airflow, Step Functions, or equivalent)
- Experience with full loads, incremental loads, append-only pipelines, change-based processing, and slowly changing dimensions (SCDs)
- Experience with data validation, reconciliation, error handling, and restart/recovery patterns
- Experience with data modeling for analytics, ML/AI, and downstream application use cases
- Ability to evaluate pipeline design trade-offs across performance, cost, reliability, and maintainability
- Structured SDLC experience with CI/CD pipelines for data and ML workflows
- Experience with API-based and event-driven data integration patterns
- Experience in distributed data processing environments
- Understanding of data requirements for ML/AI workloads
- Experience preparing training datasets and features from enterprise data lakes
- Familiarity with reproducibility, dataset versioning, and data lineage concepts
- Familiarity with GenAI concepts relevant to data engineering, such as embedding pipelines, vector databases, retrieval-augmented generation (RAG) data flows, or prompt-driven data processing
- Awareness of data security and privacy considerations when working with LLMs
- Bachelor's degree in Computer Science, Data Engineering, Information Systems, or a related technical field. Equivalent professional experience will be considered