Partner with ML research, data engineering, and application teams to translate requirements into reliable, secure, and cost-effective platform capabilities.
Lead design reviews, RFCs, and proof-of-concepts; mentor team members on cloud, Kubernetes, and data best practices.
Own incident response for platform components and drive continuous improvement through automation and standards.
Design and implement secure, scalable, multi-cloud (GCP + AWS) configurations.
Establish and maintain infrastructure as code (IaC) standards with Terraform.
Lead cloud-to-cloud data migration including secure transfer planning, checksum/manifest validation, parallelization, and cutover strategy.
Implement robust ingestion pipelines for medical images and metadata into structured data stores with schema management, versioning, and data lineage.
Optimize storage tiers and caching strategies for high-throughput image workloads.
Establish cost observability with budgets, alerts, showback/chargeback, and automated idle resource cleanup.
Own permissions and access management across clouds.
Plan and execute winddown and exit from prior cloud providers: data egress, dependency mapping, app cutover, contract/savings plan termination, and archival with retention policies.
Stand up and maintain managed ML platforms (Vertex AI) or managed Kubernetes clusters (GKE/EKS) with CI/CD for pipelines, images, and deployments.
Partner with data/ML teams to codify data management practices: versioned datasets, reproducible preprocessing, clear lineage, and documentation.
Requirements
7+ years in DevOps/SRE/Platform roles, including multi-cloud (AWS/Azure/GCP) experience
Deep proficiency with Terraform, CI/CD (GitHub Actions/GitLab/CodeBuild/Cloud Build), and Kubernetes (EKS/GKE)
Hands-on experience with GPU workloads for ML training/inference and object storage patterns for large image datasets
Proven track record in data migration (cloud-to-cloud), structured data ingestion (e.g., BigQuery/Redshift/Postgres), and schema/governance