AbsenceSoft is transforming the employee experience with innovative technology designed for HR professionals. They are seeking an AI Data Engineer to design and manage data pipelines and infrastructure that support intelligent AI applications, collaborating closely with data scientists and product teams to ensure high-performance data architecture.
Responsibilities:
- Design, build, and maintain data pipelines for structured, unstructured, and semi-structured data sources
- Develop and optimize data models, ETL processes, and batch/streaming data infrastructure
- Partner with data scientists to support training, evaluation, and deployment of ML and LLM models
- Implement scalable architectures for embeddings, vector databases, and retrieval pipelines
- Enable real-time and offline analytics workflows using best-in-class data engineering practices
- Ensure data quality, lineage, observability, and governance across all data products
- Deploy secure, cloud-native data infrastructure (AWS, Azure, GCP) for high-volume AI workloads
- Contribute to the design of feature stores and MLOps platforms for continuous learning and model updates
- Collaborate on Responsible AI workflows to ensure compliant data usage and access controls
- Continuously evaluate new tools and technologies for improving performance, reliability, and agility
Requirements:
- 5+ years of experience as a Data Engineer building large-scale, production-grade data pipelines
- Strong command of SQL, Python, and distributed data processing frameworks (Spark, Flink, Beam)
- Hands-on experience with ETL/ELT tools and orchestration systems (Airflow, dbt, Prefect, Dagster)
- Familiarity with cloud-native data platforms (Snowflake, BigQuery, Redshift, Databricks)
- Experience supporting ML/AI workloads and collaborating with model development teams
- Knowledge of vector databases (FAISS, Pinecone, Weaviate) and embeddings management
- Understanding of data privacy, access control, and compliance in regulated environments
- Proficiency in modern DevOps tooling for data infrastructure (Docker, Terraform, CI/CD)
- Ability to work autonomously and thrive in a fast-paced, collaborative environment
- Cloud: AWS (Redshift, S3, Lambda), Azure (Data Lake, Synapse), GCP (BigQuery, Cloud Functions)
- Streaming: Kafka, Kinesis, Pub/Sub, Spark Streaming, Apache Flink
- Workflow Tools: dbt, Airflow, Dagster, Prefect
- Storage & Processing: Snowflake, Databricks, Parquet, Delta Lake
- Vector Search: FAISS, Pinecone, Weaviate, txtai