NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. They are seeking an expert System Software Engineer with a strong background in Linux systems programming to design and maintain Slurm, troubleshoot complex system issues, and provide technical support. The role combines software development with system-level troubleshooting in a collaborative environment.
Responsibilities:
- Maintain, improve and optimize software components in C
- Develop and maintain system-level and application-level code for Slurm which includes networking, system and device level components
- Debug and troubleshoot complex Slurm issues related to reliability and performance
- Write clean, maintainable, and well-documented code that adheres to industry standards
- Collaborate with cross-functional teams including Operations, Infrastructure, and Deployment
- Provide direct technical support to internal teams or external customers
- Develop automated tests to ensure software reliability and regression prevention
- Stay current with best practices in C programming, compilers, build systems, and related technologies
Requirements:
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
- 5+ Years of professional experience in C development
- Strong understanding of memory management, pointers, data structures, and algorithms
- Experience with debugging tools such as GDB and performance profiling
- Solid understanding of Linux kernel interfaces, system calls, and file system including work with Automake
- Understanding of software development lifecycles and agile methodologies
- Strong problem-solving and analytical skills
- An environment with a focus on quality and reliability
- Experience with containers and GPU technologies
- Curious, self-motivated, and eager to learn new technologies
- Experience with C and other low-level languages
- Background in system administration or High Performance Computing
- Experience with Slurm Workload Manager or other HPC scheduling systems
- Knowledge of operating system internals or hardware-software interaction
- Contributions to open-source C projects are a plus