Leidos is a company focused on digital modernization, and they are seeking a highly skilled Unstructured Data Engineer to lead the design and implementation of unstructured data pipelines. This role involves transforming raw unstructured content into AI-ready data products and optimizing enterprise AI applications.
Responsibilities:
- Design, build, and manage end-to-end RAG pipelines for enterprise AI applications
- Lead preprocessing of unstructured data, including discovery, classification, cleansing, redaction, and metadata enrichment
- Develop and optimize document chunking, embedding, and vectorization strategies for structured and unstructured datasets
- Coordinate ingestion of curated datasets into vector databases and AI platforms
- Package curated unstructured datasets as governed, reusable data products for enterprise consumption
- Define and implement metadata tagging strategies to align with Collibra governance standards
- Partner with Data Governance and Data Quality teams to ensure AI-ready data meets enterprise standards for lineage, classification, and compliance
- Evaluate and optimize embedding models, retrieval strategies, and indexing performance
- Monitor and tune RAG pipeline performance, including latency, retrieval accuracy, and cost efficiency
- Implement automation for document ingestion, transformation, and publishing workflows
- Support integration with enterprise AI platforms (e.g., ChatGPT Enterprise, AskSage, Moveworks)
- Conduct cost analysis and capacity planning for vector storage and processing workloads
- Provide technical guidance on AI data readiness and unstructured data lifecycle management
- Design, implement, and optimize enterprise-grade RAG and prompt engineering frameworks, including context engineering strategies (chunking, metadata enrichment, semantic filtering, dynamic context management) to improve retrieval accuracy, grounding, and response quality
- Develop and maintain scalable multi-modal data pipelines that ingest, preprocess, embed, and integrate text, documents, images, audio, and structured data into governed vectorized data products consumable by enterprise AI platforms
Requirements:
- Bachelor's degree in Computer Science, Data Engineering, AI/ML, or related field and 8+ years of relevant experience
- Hands-on experience designing and implementing RAG architectures in production environments
- Experience working with unstructured data (PDFs, documents, email, transcripts, images with OCR, etc.)
- Strong proficiency in Python and experience with NLP/LLM frameworks (e.g., LangChain, LlamaIndex, Hugging Face, OpenAI APIs)
- Experience with vector databases (e.g., Pinecone, Weaviate, FAISS, OpenSearch, Azure AI Search)
- Experience implementing document chunking, embedding generation, and similarity search
- Understanding of metadata modeling and governance principles
- Experience building scalable data pipelines in cloud environments (AWS, Azure, or GCP)
- Hands-on experience with prompt engineering, evaluation metrics, and context window optimization
- Strong understanding of multi-modal data processing and pipeline engineering
- Strong knowledge of API integration and microservices architecture
- US Citizenship is required
- Experience with Ohalo Data xRay or similar unstructured data discovery and redaction platforms
- Experience aligning RAG pipelines with enterprise Data Governance frameworks (e.g., Collibra)
- Familiarity with data classification, CUI/PII handling, and redaction controls
- Experience packaging datasets as governed data products with defined SLAs and stewardship
- Experience integrating AI-ready datasets into enterprise tools such as ChatGPT Enterprise, AskSage, or similar AI copilots
- Understanding of model evaluation metrics for retrieval quality (precision, recall, MRR, hallucination reduction)
- Experience working in regulated or government environments
- Familiarity with MLOps practices and AI lifecycle management
- Experience optimizing infrastructure costs for embedding and vector storage workloads
- Awareness of AI/ML lifecycle management practices, including model evaluation, monitoring, versioning, governance, and responsible AI considerations in production environments
- Familiarity with Model Context Protocol (MCP) concepts and agentic architectures, including tool orchestration, memory management, and multi-step reasoning workflows
- Exposure to Knowledge Graph and graph database technologies (e.g., Neo4j, RDF/SPARQL, property graphs) and their application in semantic search, entity resolution, and AI context enhancement