Leidos is a company focused on digital modernization, and they are seeking a highly skilled Unstructured Data Engineer to lead the design and implementation of unstructured data pipelines. This role involves transforming raw unstructured content into AI-ready data products and optimizing enterprise AI applications.

Responsibilities:

Design, build, and manage end-to-end RAG pipelines for enterprise AI applications
Lead preprocessing of unstructured data, including discovery, classification, cleansing, redaction, and metadata enrichment
Develop and optimize document chunking, embedding, and vectorization strategies for structured and unstructured datasets
Coordinate ingestion of curated datasets into vector databases and AI platforms
Package curated unstructured datasets as governed, reusable data products for enterprise consumption
Define and implement metadata tagging strategies to align with Collibra governance standards
Partner with Data Governance and Data Quality teams to ensure AI-ready data meets enterprise standards for lineage, classification, and compliance
Evaluate and optimize embedding models, retrieval strategies, and indexing performance
Monitor and tune RAG pipeline performance, including latency, retrieval accuracy, and cost efficiency
Implement automation for document ingestion, transformation, and publishing workflows
Support integration with enterprise AI platforms (e.g., ChatGPT Enterprise, AskSage, Moveworks)
Conduct cost analysis and capacity planning for vector storage and processing workloads
Provide technical guidance on AI data readiness and unstructured data lifecycle management
Design, implement, and optimize enterprise-grade RAG and prompt engineering frameworks, including context engineering strategies (chunking, metadata enrichment, semantic filtering, dynamic context management) to improve retrieval accuracy, grounding, and response quality
Develop and maintain scalable multi-modal data pipelines that ingest, preprocess, embed, and integrate text, documents, images, audio, and structured data into governed vectorized data products consumable by enterprise AI platforms

Requirements:

Bachelor's degree in Computer Science, Data Engineering, AI/ML, or related field and 8+ years of relevant experience
Hands-on experience designing and implementing RAG architectures in production environments
Experience working with unstructured data (PDFs, documents, email, transcripts, images with OCR, etc.)
Strong proficiency in Python and experience with NLP/LLM frameworks (e.g., LangChain, LlamaIndex, Hugging Face, OpenAI APIs)
Experience with vector databases (e.g., Pinecone, Weaviate, FAISS, OpenSearch, Azure AI Search)
Experience implementing document chunking, embedding generation, and similarity search
Understanding of metadata modeling and governance principles
Experience building scalable data pipelines in cloud environments (AWS, Azure, or GCP)
Hands-on experience with prompt engineering, evaluation metrics, and context window optimization
Strong understanding of multi-modal data processing and pipeline engineering
Strong knowledge of API integration and microservices architecture
US Citizenship is required
Experience with Ohalo Data xRay or similar unstructured data discovery and redaction platforms
Experience aligning RAG pipelines with enterprise Data Governance frameworks (e.g., Collibra)
Familiarity with data classification, CUI/PII handling, and redaction controls
Experience packaging datasets as governed data products with defined SLAs and stewardship
Experience integrating AI-ready datasets into enterprise tools such as ChatGPT Enterprise, AskSage, or similar AI copilots
Understanding of model evaluation metrics for retrieval quality (precision, recall, MRR, hallucination reduction)
Experience working in regulated or government environments
Familiarity with MLOps practices and AI lifecycle management
Experience optimizing infrastructure costs for embedding and vector storage workloads
Awareness of AI/ML lifecycle management practices, including model evaluation, monitoring, versioning, governance, and responsible AI considerations in production environments
Familiarity with Model Context Protocol (MCP) concepts and agentic architectures, including tool orchestration, memory management, and multi-step reasoning workflows
Exposure to Knowledge Graph and graph database technologies (e.g., Neo4j, RDF/SPARQL, property graphs) and their application in semantic search, entity resolution, and AI context enhancement

Unstructured Data Engineer

Key skills

About this role

Responsibilities:

Requirements: