UT MD Anderson is a leading cancer center focused on innovative healthcare solutions, and they are seeking a Senior Data Engineer to architect and build data infrastructure for AI and machine learning applications. The role involves designing scalable data pipelines, ensuring data quality, and collaborating with various stakeholders to enhance cancer care through responsible AI innovation.
Responsibilities:
- Build and Scale AI/ML Data Pipelines
- Design, implement, and maintain batch and streaming pipelines for ML training, deployment, inference, and monitoring using Azure, Dataiku, and open-source tools
- Data, Feature, and Vector Store Engineering
- Deploy and manage raw data, feature, and vector stores to enable fast, reliable access for production AI/ML systems
- Automate Infrastructure and Deployments
- Use Infrastructure-as-Code (IaC) and CI/CD workflows to automate deployments, improving reliability and efficiency
- Ensure Data Quality and Trust
- Implement validation, lineage, anomaly detection, and drift monitoring to deliver accurate, compliant data
- Security and Compliance by Design
- Enforce encryption, RBAC, tokenization, and audit logging to ensure HIPAA/HITRUST compliance while enabling scalable AI operations
- Collaborate and Lead
- Partner with data engineers, ML engineers, data scientists, and clinical stakeholders to deliver scalable AI solutions.Mentor team members and drive best practices in data engineering
- Own and Operate
- Manage pipelines and infrastructure end-to-end, including monitoring, alerting, incident management, and continuous improvement
- Other Duties
- Perform additional tasks as assigned to support departmental goals
Requirements:
- Bachelor's degree
- Five years of relevant information technology experience. May substitute required education with years of related experience on a one-to-one basis. With preferred degree, three years of experience required
- Expert in Python, SQL, Spark, and modern data engineering frameworks; proficient in Azure services, IaC tools (Terraform, Bicep), and CI/CD workflows
- Experienced in designing and managing feature and vector stores, batch and streaming pipelines, and high-throughput data architectures for AI/ML systems
- Familiar with HL7, FHIR, DICOM standards and skilled in handling EHR, imaging, and clinical datasets with de-identification and compliance
- Strong understanding of HIPAA/HITRUST requirements and ability to implement encryption, RBAC, and audit logging
- Capable of mentoring team members, driving best practices, and partnering with clinicians, data scientists, and IT teams to deliver impactful solutions
- Adept at troubleshooting complex data challenges, optimizing performance, and exploring emerging technologies for scalable AI operations
- Able to clearly document processes and present technical concepts to both technical and non-technical audiences
- Master's Level Degree
- Must obtain at least one Epic Data Model certification (Clinical, Access, or Revenue) issued by Epic within 180 days of date of entry into job
- Any of the following: Azure Data Engineer Associate (DP-203), EPIC Cogito Certification, HIPAA Privacy & Security Certification, HL7/FHIR Certification
- Healthcare experience in AI/ML space is a must, two years of industry experience in a Senior Data Scientist role, knowledge of data privacy, security, and HIPAA compliance in healthcare