General Motors is a leading automotive company focused on innovation and safety. They are seeking a Senior AI/ML Capacity and Performance Engineer to support the development of autonomous vehicles by optimizing ML infrastructure and collaborating with cross-functional teams.
Responsibilities:
- Strategic Infrastructure Development: Adopt and run AV models to support GM’s long-term GPU system strategy and "evergreen" infrastructure roadmap
- Performance Optimization: Conduct deep-dive analyses of production workloads to identify bottlenecks and propose high-impact optimization strategies
- Cross-Functional Collaboration: Partner with AI/ML Research, Infrastructure Engineering, and Cloud Vendors to spearhead projects that enhance engineering velocity and cost-efficiency
- Proactive System Scaling: Identify opportunities for architectural improvements to ensure the scalability and reliability of large-scale ML training and inference environments
Requirements:
- 5+ years of professional experience in high-scale infrastructure or ML systems
- Bachelor's Degree in Computer Science, a related technical field, or equivalent practical experience
- Expert-level coding skills in Python and the ability to architect/debug within the PyTorch ecosystem
- Proven track record of resolving performance issues within large-scale distributed production environments
- Deep understanding of distributed systems, specifically modern ML system design and high-performance computing (HPC)
- Hands-on experience with Kubernetes for orchestrating complex workloads
- Technical proficiency with Nvidia DCGM, nvidia-smi, and Grafana for real-time telemetry and observability
- Extensive experience working within major cloud ecosystems (AWS, GCP, or Azure)
- 8+ years of relevant industry experience
- Working knowledge of Enterprise-grade Nvidia GPU architectures, including H100, B200, and GB200
- Experience deploying and scaling open-source models via the Hugging Face ecosystem
- Proficiency in BigQuery for large-scale data analysis and reporting
- Practical experience utilizing Nvidia Nsight and Nsight Compute for kernel-level performance tuning
- Strong technical communication skills with the ability to translate complex infrastructure needs into actionable business insights