NVIDIA is a leading technology company known for its innovation in GPU technology and AI computing. They are seeking a Senior AI and ML HPC Cluster Engineer to provide leadership in designing and implementing GPU compute clusters for demanding workloads, while also improving automation and supporting researchers' needs.
Responsibilities:
- Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage
- Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
- Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
- Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs
- Support our researchers to run their workloads including performance analysis and optimizations
- Conduct root cause analysis and suggest corrective action Proactively find and fix issues before they occur
Requirements:
- Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum 5+ years of experience designing and operating large scale compute infrastructure
- Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
- Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
- Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt
- In depth understating of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
- Proficiency in Python programming and bash scripting
- Applied experience with AI/HPC workflows that use MPI
- Experience analyzing and tuning performance for a variety of AI/HPC workloads
- Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields
- Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
- Experience with Machine Learning and Deep Learning concepts, algorithms and models
- Familiarity with InfiniBand with IPoIB and RDMA
- Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
- Familiarity with deep learning frameworks like PyTorch and TensorFlow