Guidehouse is seeking a Data Infrastructure Engineer to build and operate the data platform that powers AI/ML analytics modules. The role involves designing and implementing scalable data ingestion pipelines, robust ETL/ELT processes, and a modern data lake on AWS, while ensuring data governance and quality.
Responsibilities:
- Build & Operate Data Pipelines (Batch + Streaming)
- Design and implement batch and streaming ingestion from APIs, relational databases, file drops, event streams, and external partners. Build and optimize ETL/ELT pipelines to produce curated, analytics-ready datasets for reporting and ML consumption. Implement incremental processing patterns, change data capture (CDC) approaches where appropriate, and data contract standards
- Deliver a Modern Lakehouse (Data Lake / Delta Lake)
- Build and manage a scalable lakehouse on AWS object storage (e.g., S3) using open table/file formats and delta/lakehouse concepts (e.g., ACID tables, schema evolution, time travel patterns). Optimize performance and cost through partitioning, compaction, lifecycle policies, and efficient compute/storage usage. Establish environment standards for dev/test/prod and consistent promotion across stages
- Metadata, Governance, Lineage & Quality (Trust Layer)
- Implement a managed metadata repository for dataset cataloging, ownership, glossary/definitions, tagging, and discoverability. Enable end-to-end lineage (source → transformations → consumption) to support auditability and impact analysis. Implement governance controls including policy-based access, data classification, retention, and secure data handling. Build operational data quality checks (freshness, completeness, validity, anomaly detection) and publish SLAs/SLOs
- AWS Automation + CI/CD for Data Pipelines
- Implement automated cloud provisioning in AWS using Infrastructure as Code (IaC) for consistent environments and secure-by-default baselines. Build and enhance CI/CD for data pipelines, including automated tests, validation gates, promotion workflows, and rollback strategies. Improve observability with metrics/logs/alerts, dashboards, runbooks, and incident response readiness
- Cross-Team Collaboration & Documentation
- Work closely with engineering, security, networking, and application teams to support mission needs and delivery timelines. Maintain high-quality engineering documentation including SOPs, system diagrams, and secure configuration baselines. Summarize and present findings and recommendations—both written and verbal—to technical and non-technical stakeholders
Requirements:
- Must be able to OBTAIN and MAINTAIN a Federal or DoD 'PUBLIC TRUST'; candidates must obtain approved adjudication of their PUBLIC TRUST prior to onboarding with Guidehouse. Candidates with an ACTIVE PUBLIC TRUST or SUITABILITY are preferred
- Bachelor's degree in Engineering, IT, Computer Science, or related field (or equivalent experience)
- Minimum of FOUR (4) years experience building production data pipelines and/or data platforms
- Strong experience implementing data ingestion and ETL/ELT workflows, including data modeling and transformation best practices
- Hands-on experience building a data lake / delta lake (lakehouse) on AWS (or equivalent cloud) using object storage and modern table formats/patterns
- Proficiency in SQL and one programming language commonly used for data engineering (Python preferred; Scala/Java acceptable)
- Experience with metadata management and governance: cataloging, lineage, ownership, access controls, classification and policy enforcement
- Experience implementing automated AWS provisioning using IaC and operating across multiple environments
- Experience building or operating CI/CD pipelines for data workflows (testing, packaging, deployment automation, environment promotion)
- Solid security fundamentals: IAM/least privilege, encryption, secrets management, secure SDLC practices
- Hands-on experience with Databricks
- Hands-on experience utilizing modern DevOps practices, including tools like Git, Terraform, Jenkins, AWS CodePipeline, and Docker
- Experience utilizing AI-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Cursor, Kiro) to safely accelerate implementation while maintaining strict code quality through testing, code reviews, and security practices
- Knowledge graph and Graph RAG experience, including: Graph modeling and ontology/taxonomy alignment, Entity resolution and relationship extraction, Hybrid retrieval approaches combining graph traversal with semantic/vector search to improve grounding and explainability