HCA Healthcare is redefining patient care through its Digital Transformation and Innovation team, focusing on leveraging AI and digital technologies. The Lead Machine Learning Engineer will build and scale the core platform infrastructure for enterprise ML and Generative AI initiatives, ensuring high-performance and resilient AI solutions.
Responsibilities:
- Architect and build scalable ML/LLM platform infrastructure to support multi-team AI development and deployment
- Design and maintain robust Python-based SDKs, reusable frameworks, and internal AI tooling
- Lead end-to-end MLOps and LLMOps platform implementation (CI/CD, model registry, feature store integration, evaluation pipelines)
- Build standardized GenAI application frameworks (RAG orchestration, prompt pipelines, evaluation harnesses, guardrails)
- Develop scalable model serving and inference infrastructure optimized for latency, throughput, and cost efficiency
- Implement enterprise-grade observability (logging, tracing, monitoring, drift detection, usage tracking)
- Deploy and manage AI workloads on GCP using containerized, Kubernetes-based, and serverless architectures
- Establish governance, security, and compliance standards for ML and LLM systems
- Drive infrastructure-as-code and automation practices for reproducible AI environments
- Partner with security, data, and DevOps teams to ensure reliability and platform resilience
Requirements:
- Bachelor's degree
- Expert-level Python proficiency with experience building production SDKs and scalable backend systems
- Strong experience designing and scaling GenAI platforms (RAG systems, vector DBs, embedding pipelines, LLM orchestration layers)
- Deep hands-on experience with MLOps & LLMOps tooling (Vertex AI, MLflow, Kubeflow, model registry, CI/CD automation)
- Strong understanding of API design, microservices architecture, and scalable backend engineering
- Master's degree
- 7+ years of experience in software engineering with a focus on ML Engineering and AI Engineering or platform engineering
- Strong background in GCP (GKE, Cloud Run, Vertex AI, BigQuery, Pub/Sub, IAM, networking)
- Experience implementing monitoring/observability frameworks (Prometheus, OpenTelemetry, logging pipelines)
- Deep hands-on experience with Terraform or other IaC tools