NVIDIA has been transforming computer graphics and accelerated computing for over 25 years, focusing on AI to define the next era of computing. They are seeking a Senior Systems Software Engineer to drive innovation in GPU performance at scale, contributing to advanced computing hardware and software while collaborating with various teams to enhance workflows and solutions.
Responsibilities:
- Lead the implementation of performance practices in large-scale GPU infrastructure, delivering powerful tools, methodologies, and flows to validate and improve multiple datacenter products concurrently
- Align next-generation AI workloads with next-generation datacenter builds for NVIDIA GPUs, CPUs, and networking hardware. Engage early with HW/FW/SW/platform internal and customer teams
- Develop engineering solutions that provide continuous insights into the performance of AI workloads in evolving environments, generating swift insights into improvements and regressions
- Decompose high-complexity performance or stability issues into minimal reproduction cases, working towards identifying the root cause
- Participate in collaborations with various SW and FW teams (BMC/SBIOS/OS/drivers, etc.) to develop outstanding methods and tools. Analyze, debug, and resolve critical firmware and software issues to achieve the highest AI workload performance at scale
Requirements:
- Proven understanding of accelerated computing software stacks (CUDA)
- Experience with modern cloud and container-based enterprise computing architectures, with Slurm preferred
- Strong programming and scripting experience in C/C++/Python/Bash
- Deep expertise in systems architecture and the impact of various components on performance
- Experience with container technology and Linux-based OSes, with Docker preferred
- Experience supporting high-performance computing or deep learning in engineering or academic research communities
- Strong teamwork and communication skills, coupled with results-focused analytical abilities
- BS in Engineering, Mathematics, Physics, or Computer Science (or equivalent experience); MS or PhD desirable with 8+ years of applicable experience
- End-to-end GPU performance engineering from the profiler to systems analysis
- Linux systems programming and optimization experience
- Exposure to virtualization techniques and cloud platform solutions
- Experience with scheduling and resource management systems
- Experience with large-scale HPC environments