Work with the inference team to improve serving latency and throughput
Bring up support for new models and state-of-the art inference optimizations or quantization schemes
Optimize inference across the entire stack, from GPU kernels to serving endpoints
Requirements
Strong engineering track record with proven knowledge of fundamentals and programming languages (multi-threaded programming, networking, compilation, systems programming, etc)
Pursuing a Master's or PhD in Computer Science with a focus on performance-related subjects (HPC, Compilers, Distributed Systems)
Experience with ML frameworks (Torch, JAX)
Experience with GPU programming (CUDA, Triton)
Experience with High-Performance Computing (OpenMPI)
Tech Stack
Distributed Systems
Benefits
Unfortunately we cannot provide housing.
Unfortunately we cannot provide health insurance for interns. Full time employees receive full health insurance and benefits.
There is no limit. All outstanding performers will be given a full time offer!