GE HealthCare is a global leader in ultrasound medical devices and solutions, and they are seeking a Lead DevOps Engineer to architect and operate multi-cloud infrastructure for ML research and production software in medical imaging. This role involves hands-on engineering and technical leadership, focusing on data governance, security, and platform reliability.
Responsibilities:
- Partner with ML research, data engineering, and application teams to translate requirements into reliable, secure, and cost-effective platform capabilities
- Lead design reviews, RFCs, and proof-of-concepts; mentor team members on cloud, Kubernetes, and data best practices
- Own incident response for platform components and drive continuous improvement through automation and standards
- Design and implement secure, scalable, multi-cloud (GCP + AWS) configurations
- Establish and maintain infrastructure as code (IaC) standards with Terraform
- Lead cloud-to-cloud data migration (e.g., GCS ↔ S3) including secure transfer planning, checksum/manifest validation, parallelization, and cutover strategy
- Implement robust ingestion pipelines for medical images and metadata into structured data stores (e.g., BigQuery/Redshift/Postgres) with schema management, versioning, and data lineage
- Create tools/services for dataset definition, preprocessing, curation, de-identification, and data quality checks
- Architect and manage GPU/CPU clusters for distributed training and batch inference using managed services (e.g., SageMaker) and/or Kubernetes (EKS with autoscaling)
- Optimize storage tiers (S3/GCS, Glacier/Archive, Filestore/FSx, EBS/PersistentDisk) and caching strategies for high-throughput image workloads
- Establish cost observability (per team/project/workload) with budgets, alerts, showback/chargeback, and automated idle resource cleanup
- Right-size compute/storage, leverage reserved/committed usage, spot/preemptible strategies, and data lifecycle policies
- Partner with ML teams to optimize training job efficiency (e.g., mixed precision, checkpointing strategies, data locality, sharding) and autoscaling
- Own permissions and access management across clouds (AWS IAM, GCP IAM) with least privilege, role/attribute-based access, and service identities
- Implement secrets management (e.g., AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) and key management (KMS)
- Support compliance and security controls relevant to healthcare/PHI (e.g., HIPAA, SOC 2): encryption in transit/at rest, audit logging, VPC Service Controls, private endpoints, and incident response runbooks
- Plan and execute winddown and exit from prior cloud providers: data egress, dependency mapping, app cutover, contract/savings plan termination, and archival with retention policies
- Validate post-migration integrity and performance; document the final state and reduce operational surface area
- Stand up and maintain managed ML platforms (Vertex AI, SageMaker) or managed Kubernetes clusters (GKE/EKS) with CI/CD for pipelines, images, and deployments
- Provide platform abstractions (templates, Helm charts, Terraform modules) for ML engineering and app teams to self-serve safely
- Partner with data/ML teams to codify data management practices: versioned datasets, reproducible preprocessing, clear lineage, and documentation
- Build internal tools/CLIs to automate data prep, dataset validation, and catalog updates; integrate with governance/catalog platforms where applicable
Requirements:
- 7+ years in DevOps/SRE/Platform roles, including multi-cloud (AWS/Azure/GCP) experience
- Deep proficiency with Terraform, CI/CD (GitHub Actions/GitLab/CodeBuild/Cloud Build), and Kubernetes (EKS/GKE)
- Hands-on experience with GPU workloads for ML training/inference and object storage patterns for large image datasets
- Proven track record in data migration (cloud-to-cloud), structured data ingestion (e.g., BigQuery/Redshift/Postgres), and schema/governance
- Strong security mindset: IAM, secrets, KMS, network isolation, private endpoints, encryption, auditability
- Demonstrated cost optimization (FinOps) across compute/storage/networking with measurable savings
- Excellent cross-functional communication; ability to lead architectural direction and mentor engineers
- Experience with Vertex AI and/or SageMaker
- Knowledge of medical imaging formats (DICOM), de-identification, and regulated environments (HIPAA, SOC 2)
- Observability stacks: Cloud Monitoring/Logging, Prometheus/Grafana, OpenTelemetry
- Container security and supply chain: SBOMs, image signing (Cosign), policy enforcement (OPA/Gatekeeper)
- Proven ability to sunset legacy environments and perform compliant archival and data retention
- Scripting and tooling in Python; CLIs and SDK automation for AWS/GCP