NVIDIA is a leader in accelerated computing, and they are seeking a Senior Software Engineer to develop and implement CUDA Core Libraries for GPU computing. This role involves working on C++ and Python libraries, optimizing GPU algorithms, and improving the developer experience for CUDA users.
Responsibilities:
- Develop and implement CUDA Core Libraries in C++ and/or Python, including parallel algorithms and idiomatic language bindings for core CUDA functionality
- Compose, optimize, and evolve GPU algorithms and APIs, from high-level interfaces down to low-level performance tuning involving memory, parallelism, and synchronization
- Own features end-to-end: develop, implementation, testing, benchmarking, documentation, and long-term maintenance
- Improve developer experience across the stack: CI, tests, benchmarks, packaging, examples, and docs
- Collaborate with senior CUDA engineers in design reviews, code reviews, and open-source-style workflows
- Engage with real users through issues, performance investigations, and API feedback
Requirements:
- BS, MS, or PhD in Computer Science, Computer Engineering, or a related field or equivalent experience
- Minimum of 8+ years of related development experience
- Strong programming skills in C++, Python, or both, with proven interest in systems-level software (performance, memory, concurrency, API design)
- Solid understanding of modern C++ (templates, generics, standard library) and/or Python library development and packaging
- Practical experience with parallel or heterogeneous programming (CUDA, OpenMP, GPU-accelerated Python, or similar)
- Experience contributing to production software or open-source libraries, including testing, profiling, and code review
- Ability to work independently, scope problems, and drive projects to completion
- Clear written communication for technical design and documentation
- Comfort navigating large, multi-language codebases (C++, Python, CMake, Pixi, CI systems)
- Strong understanding of CPU/GPU architecture and how hardware details affect performance
- Hands-on experience with CUDA C++, CUDA Python, PyTorch, JAX, Numba, CuPy, or similar GPU-accelerated stacks
- Familiarity with Thrust, CUB, libcudacxx, or other modern C++/GPU libraries
- Experience with compiler infrastructure or tooling (LLVM, Clang tooling, MLIR)
- Demonstrated interest in developer tools, library design, and making other developers faster