Develop and implement CUDA Core Libraries in C++ and/or Python, including parallel algorithms and idiomatic language bindings for core CUDA functionality.
Compose, optimize, and evolve GPU algorithms and APIs, from high-level interfaces down to low-level performance tuning involving memory, parallelism, and synchronization.
Own features end-to-end: develop, implementation, testing, benchmarking, documentation, and long-term maintenance.
Improve developer experience across the stack: CI, tests, benchmarks, packaging, examples, and docs.
Collaborate with senior CUDA engineers in design reviews, code reviews, and open-source-style workflows.
Engage with real users through issues, performance investigations, and API feedback.
Requirements
BS, MS, or PhD in Computer Science, Computer Engineering, or a related field or equivalent experience.
Minimum of 8+ years of related development experience
Strong programming skills in C++, Python, or both, with proven interest in systems-level software (performance, memory, concurrency, API design).
Solid understanding of modern C++ (templates, generics, standard library) and/or Python library development and packaging.
Practical experience with parallel or heterogeneous programming (CUDA, OpenMP, GPU-accelerated Python, or similar).
Experience contributing to production software or open-source libraries, including testing, profiling, and code review.
Ability to work independently, scope problems, and drive projects to completion.
Clear written communication for technical design and documentation.