Modular is on a mission to revolutionize AI infrastructure by rebuilding the AI software stack. The Senior AI Kernel Engineer will lead the design and optimization of high-performance kernels for large-scale AI inference on GPUs, collaborating with various teams to enhance performance and efficiency.
Responsibilities:
- Design, implement, and optimize performance-critical kernels for AI inference workloads (e.g., GEMM, attention, communication, fusion)
- Lead kernel-level optimization efforts across single-GPU, multi-GPU, and heterogeneous hardware environments
- Make informed trade-offs between latency, throughput, memory footprint, and numerical precision
- Drive adoption of new hardware features (e.g., Tensor Cores, asynchronous execution, advanced memory spaces)
- Analyze performance using profilers, hardware counters, and microbenchmarks; translate insights into concrete improvements
- Work closely with compiler and runtime teams to influence code generation, scheduling, and kernel fusion strategies
- Review and mentor other engineers on kernel design, performance tuning, and best practices
- Contribute to technical roadmaps and long-term performance strategy for AI inference
Requirements:
- 5+ years of experience in performance-critical systems or kernel development (or equivalent depth of expertise)
- Strong proficiency in C/C++ and low-level programming
- Extensive hands-on experience with GPU kernel programming (CUDA, HIP, or equivalent)
- Deep understanding of GPU architecture, including memory hierarchies, synchronization, and execution models
- Proven track record of delivering measurable performance improvements in production systems
- Strong problem-solving skills and ability to work independently on complex, ambiguous performance challenges
- Experience with PTX, assembly-level tuning, or code generation frameworks (e.g., Triton)
- Experience optimizing distributed or multi-GPU inference pipelines
- Familiarity with custom AI accelerators or domain-specific hardware
- Understanding of modern AI models (e.g., transformers, LLMs, diffusion) from a systems and performance perspective
- Contributions to open-source kernel libraries, compilers, or performance tools
- Experience collaborating directly with hardware or compiler teams