Ford Motor Company is seeking an HPC - AI/ML Platform Engineer to join their team responsible for engineering and operating large-scale GPU and compute platforms. The role focuses on building reliable, scalable GPU platforms and assisting internal users in running AI/ML and high-performance workloads on Kubernetes and related infrastructure.

Responsibilities:

Design, implement, and support GPU/Kubernetes clusters and supporting infrastructure
Supporting AI/ML training, simulation, and HPC workload customers
Develop automation and tooling for cluster provisioning, configuration management, and platform operations
Collaborate with application and research teams to optimize workloads running on GPU infrastructure
Implement monitoring, observability, and performance tuning across GPU and compute platforms
Troubleshoot infrastructure issues across compute, networking, and container platforms (occasional on-call support)
Contribute to platform reliability, scalability, and operational best practices
Produce clear technical documentation and operational runbooks

Requirements:

5+ years of Linux systems engineering or infrastructure experience
2+ years working with container platforms such as Kubernetes or OpenShift
Familiarity with Kubernetes GPU scheduling and related tooling
Familiarity with CI/CD pipelines and platform engineering practices
Experience operating compute infrastructure for high-performance workloads or large distributed systems
Strong scripting or programming skills (Python, Bash, or similar)
Experience building infrastructure automation and operational tooling
Strong troubleshooting and problem-solving skills across complex infrastructure systems
Ability to communicate clearly with both platform engineers and application teams
Demonstrated ability to manage multiple technical initiatives simultaneously
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
Experience with observability platforms such as Prometheus, Grafana, or similar
Experience with infrastructure automation tools (Ansible, Terraform, etc.)
Experience with high-speed networking technologies such as InfiniBand or RDMA

HPC - AI/ML Platform Engineer

Key skills

About this role

Responsibilities:

Requirements: