NVIDIA has been transforming computer graphics and accelerated computing for over 25 years. They are looking for a software engineer to develop datacenter scale performance modeling and predictions tools for AI researchers running AI workloads in GPU clusters.

Responsibilities:

Build performance modeling and prediction tools for AI workloads at Data-center scale
Develop production tools and workflows used by multiple teams both within NVIDIA and its customers
Automate workflows including search for the most efficient configurations over millions of parameters
Partner with HW and SW architects to propose new features or improve existing features with real world use cases

Requirements:

BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
Strong software skills in design, coding (C++ and Python), analytical, and debugging
Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference
Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
Experience with NVIDIA GPUs, CUDA Programming, and Networking
Motivated self-starter with strong problem-solving skills and customer-facing communication skills
Passion for continuous learning. Ability to work concurrently with multiple global groups
Proven SW engineering experience experience in deploying SW at Dataceter scale
Solid experience in large AI job performance analysis for training/inference workload
Knowledge of Linux device drivers and/or compiler implementation
Knowledge of GPU and/or CPU architecture and general computer architecture principles

Senior Datacenter Performance Model Engineer

Key skills

About this role

Responsibilities:

Requirements: