NVIDIA has been transforming computer graphics and accelerated computing for over 25 years, focusing on AI to define the next era of computing. They are seeking a Senior Systems Software Engineer to drive innovation in GPU performance at scale, contributing to advanced computing hardware and software while collaborating with various teams to enhance workflows and solutions.

Responsibilities:

Lead the implementation of performance practices in large-scale GPU infrastructure, delivering powerful tools, methodologies, and flows to validate and improve multiple datacenter products concurrently
Align next-generation AI workloads with next-generation datacenter builds for NVIDIA GPUs, CPUs, and networking hardware. Engage early with HW/FW/SW/platform internal and customer teams
Develop engineering solutions that provide continuous insights into the performance of AI workloads in evolving environments, generating swift insights into improvements and regressions
Decompose high-complexity performance or stability issues into minimal reproduction cases, working towards identifying the root cause
Participate in collaborations with various SW and FW teams (BMC/SBIOS/OS/drivers, etc.) to develop outstanding methods and tools. Analyze, debug, and resolve critical firmware and software issues to achieve the highest AI workload performance at scale

Requirements:

Proven understanding of accelerated computing software stacks (CUDA)
Experience with modern cloud and container-based enterprise computing architectures, with Slurm preferred
Strong programming and scripting experience in C/C++/Python/Bash
Deep expertise in systems architecture and the impact of various components on performance
Experience with container technology and Linux-based OSes, with Docker preferred
Experience supporting high-performance computing or deep learning in engineering or academic research communities
Strong teamwork and communication skills, coupled with results-focused analytical abilities
BS in Engineering, Mathematics, Physics, or Computer Science (or equivalent experience); MS or PhD desirable with 8+ years of applicable experience
End-to-end GPU performance engineering from the profiler to systems analysis
Linux systems programming and optimization experience
Exposure to virtualization techniques and cloud platform solutions
Experience with scheduling and resource management systems
Experience with large-scale HPC environments

Senior Systems Software Engineer - GPU Performance at Scale

Key skills

About this role

Responsibilities:

Requirements: