NVIDIA has been a leader in computer graphics and accelerated computing for over 25 years, now venturing into the realm of AI. The role involves engineering solutions for efficient resource management and job scheduling in large datacenter clusters, while collaborating with various teams to optimize performance and drive innovation.
Responsibilities:
- Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters
- Drive next generation requirements and features for schedulers in at scale clusters
- Ensure technical relationships with internal and external engineering teams
- Assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology
- Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community
- Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance
Requirements:
- BS, MS, or PhD in Engineering, Mathematics, Physics, Computer Science, or equivalent experience
- 12+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions
- Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...)
- Proven understanding of performance clusters, infrastructure and workload patterns
- Experience using and installing Linux-based server platforms
- C/Python/Bash/Lua programming/scripting experience
- Experience working with engineering or academic research community supporting HPC or deep learning
- Strong teamwork and both verbal and written communication skills
- Experience with HPC cluster administration for AI
- Experience deploying containerized services
- Experience with orchestrators (e.g. Kubernetes)
- Demonstrated work with Open-Source software: building, debugging, patching and contributing code
- Experience tuning memory, storage, and networking settings for performance on Linux systems
- Exposure to monitoring and telemetry systems