NVIDIA has been a leader in computer graphics and accelerated computing for over 25 years, now venturing into the realm of AI. The role involves engineering solutions for efficient resource management and job scheduling in large datacenter clusters, while collaborating with various teams to optimize performance and drive innovation.

Responsibilities:

Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters
Drive next generation requirements and features for schedulers in at scale clusters
Ensure technical relationships with internal and external engineering teams
Assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology
Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community
Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance

Requirements:

BS, MS, or PhD in Engineering, Mathematics, Physics, Computer Science, or equivalent experience
12+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions
Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...)
Proven understanding of performance clusters, infrastructure and workload patterns
Experience using and installing Linux-based server platforms
C/Python/Bash/Lua programming/scripting experience
Experience working with engineering or academic research community supporting HPC or deep learning
Strong teamwork and both verbal and written communication skills
Experience with HPC cluster administration for AI
Experience deploying containerized services
Experience with orchestrators (e.g. Kubernetes)
Demonstrated work with Open-Source software: building, debugging, patching and contributing code
Experience tuning memory, storage, and networking settings for performance on Linux systems
Exposure to monitoring and telemetry systems

Senior HPC Scheduler Engineer

Key skills

About this role

Responsibilities:

Requirements: