Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We are seeking a GPU Software Engineer with deep expertise in CUDA programming, GPU architecture, and high-performance computing to design and optimize compute-intensive workloads on modern accelerator hardware.
Responsibilities:
- Design and implement high-performance CUDA kernels for compute-intensive workloads across AI and HPC use cases
- Profile and optimize GPU code using tools such as Nsight Systems, Nsight Compute, and CUDA profilers
- Tune memory access patterns, occupancy, register usage, and shared memory utilization for peak performance
- Develop highly optimized libraries for linear algebra, attention, and other ML primitives
- Optimize multi-GPU and multi-node training using NCCL, RDMA, and high-performance networking
- Implement custom operators and fused kernels in PyTorch, JAX, or Triton
- Collaborate with ML engineers to identify performance bottlenecks in training and inference pipelines
- Develop benchmarks and regression tests to safeguard performance over time
- Evaluate new GPU architectures and feature sets, and advise on adoption strategy
- Contribute to compiler-level optimizations for tensor programs where appropriate, working at the boundary between ML frameworks and underlying accelerator codegen to unlock performance not reachable through framework-level tuning alone
- Optimize memory hierarchy usage across HBM, L2, shared memory, and registers
- Implement mixed-precision and quantized compute paths that maximize accelerator throughput while preserving numerical fidelity within bounds acceptable for the target workloads
- Document performance characteristics, design decisions, and tuning playbooks for internal teams
- Stay current with GPU architecture, CUDA evolution, and emerging accelerator technologies
Requirements:
- Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field
- Six or more years of experience in GPU programming and performance engineering
- Deep expertise in CUDA C/C++ and GPU programming models
- Strong understanding of modern GPU architectures, memory hierarchies, and execution models
- Hands-on experience profiling and optimizing GPU workloads in production
- Familiarity with NCCL, MPI, and high-performance interconnect technologies
- Experience integrating custom kernels into ML frameworks
- Strong C++ skills and familiarity with modern systems programming practices
- Solid grounding in linear algebra and numerical methods
- Strong communication and collaboration skills with research and engineering teams
- Experience with Triton, CUTLASS, or other GPU kernel authoring frameworks
- Familiarity with TensorRT, FasterTransformer, or vLLM internals
- Exposure to compiler infrastructure such as LLVM or MLIR
- Open-source contributions to GPU or ML performance libraries
- Experience with large-scale distributed training infrastructure