Optimize emerging AI inference workloads such as Large Language Models (LLMs) and Diffusion models on GPUs
Develop and optimize graph-based compilation flows (e.g., MLIR/LLVM) for neural network workloads
Write and tune performance-critical GPU kernels and runtime code in C++ or parallel programming languages
Identify and resolve bottlenecks across compiler, runtime, and kernel layers
Profile, benchmark, and characterize AI workloads to validate performance gains
Collaborate with hardware, driver, and framework teams on hardware/software co-optimization
Requirements
Bachelor's degree with 4+ years of relevant experience, OR Master's degree with 2+ years of relevant experience in Computer Science or a related field
Strong C++ development and debugging skills
Solid understanding of GPU architectures or AI accelerators
Hands-on experience with modern neural network architecture for inference on hardware accelerators
PhD and 1+ years of relevant experience preferred
Experience optimizing end-to-end real-world AI workloads
Familiarity with OpenVINO or other AI inference frameworks
Knowledge of neural network optimization techniques and performance tradeoffs
Experience across multiple layers of the AI software stack, including: AI inference engines or runtimes, Graph compilers (e.g., MLIR/LLVM), GPU kernels or performance critical compute code