Ford Motor Company is seeking an HPC - AI/ML Platform Engineer to join their team responsible for engineering and operating large-scale GPU and compute platforms. The role focuses on building reliable, scalable GPU platforms and assisting internal users in running AI/ML and high-performance workloads on Kubernetes and related infrastructure.
Responsibilities:
- Design, implement, and support GPU/Kubernetes clusters and supporting infrastructure
- Supporting AI/ML training, simulation, and HPC workload customers
- Develop automation and tooling for cluster provisioning, configuration management, and platform operations
- Collaborate with application and research teams to optimize workloads running on GPU infrastructure
- Implement monitoring, observability, and performance tuning across GPU and compute platforms
- Troubleshoot infrastructure issues across compute, networking, and container platforms (occasional on-call support)
- Contribute to platform reliability, scalability, and operational best practices
- Produce clear technical documentation and operational runbooks
Requirements:
- 5+ years of Linux systems engineering or infrastructure experience
- 2+ years working with container platforms such as Kubernetes or OpenShift
- Familiarity with Kubernetes GPU scheduling and related tooling
- Familiarity with CI/CD pipelines and platform engineering practices
- Experience operating compute infrastructure for high-performance workloads or large distributed systems
- Strong scripting or programming skills (Python, Bash, or similar)
- Experience building infrastructure automation and operational tooling
- Strong troubleshooting and problem-solving skills across complex infrastructure systems
- Ability to communicate clearly with both platform engineers and application teams
- Demonstrated ability to manage multiple technical initiatives simultaneously
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
- Experience with observability platforms such as Prometheus, Grafana, or similar
- Experience with infrastructure automation tools (Ansible, Terraform, etc.)
- Experience with high-speed networking technologies such as InfiniBand or RDMA