NVIDIA has been transforming computer graphics and accelerated computing for over 25 years. They are looking for a software engineer to develop datacenter scale performance modeling and predictions tools for AI researchers running AI workloads in GPU clusters.
Responsibilities:
- Build performance modeling and prediction tools for AI workloads at Data-center scale
- Develop production tools and workflows used by multiple teams both within NVIDIA and its customers
- Automate workflows including search for the most efficient configurations over millions of parameters
- Partner with HW and SW architects to propose new features or improve existing features with real world use cases
Requirements:
- BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
- Strong software skills in design, coding (C++ and Python), analytical, and debugging
- Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference
- Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
- Experience with NVIDIA GPUs, CUDA Programming, and Networking
- Motivated self-starter with strong problem-solving skills and customer-facing communication skills
- Passion for continuous learning. Ability to work concurrently with multiple global groups
- Proven SW engineering experience experience in deploying SW at Dataceter scale
- Solid experience in large AI job performance analysis for training/inference workload
- Knowledge of Linux device drivers and/or compiler implementation
- Knowledge of GPU and/or CPU architecture and general computer architecture principles