SambaNova is a pioneering company in the generative AI space, providing a full-stack platform optimized for various organizations. The role involves optimizing and scaling advanced foundation models, bridging deep learning and systems performance to achieve exceptional AI inference performance.
Responsibilities:
- Bring up and optimize cutting-edge foundation models (e.g., DeepSeek, Llama, Qwen, and others) on the SambaNova platform through the SambaNova software stack
- Profile and enhance model performance across compiler, runtime, and hardware layers to achieve SOTA throughput and latency
- Collaborate with machine learning, compiler, runtime, and hardware teams to deliver co-designed, high-performance AI applications
- Integrate the latest advances in model architecture, quantization, scheduling, and memory optimization from both academia and industry
- Develop robust, scalable, and efficient end-to-end inference solutions aligned with customer needs
- Identify performance bottlenecks and propose dataflow or scheduling optimizations for both single-node and distributed systems
Requirements:
- Bachelor's or higher degree in computer science, electrical engineering, or a related field (e.g., applied mathematics, physics, or statistics)
- 3+ years of experience in one or more of the following areas: Deep learning model development and performance optimization, Compiler, runtime, or kernel-level optimization, Software–hardware co-design or systems performance tuning
- Proficiency in Python or C++, with strong foundations in algorithms, data structures, and numerical computing
- Experience with at least one major ML framework — PyTorch, TensorFlow, or JAX
- Demonstrated ability to analyze and optimize performance in real-world ML pipelines
- Hands-on experience with LLM or multimodal model training and inference
- Background in large-scale distributed training, continuous batching, and high-throughput inference systems
- Familiarity with quantization, graph optimization, kernel fusion, and model partitioning
- Experience with frameworks such as DeepSpeed, Megatron, vLLM, or TensorRT
- Strong GPU programming skills (CUDA, Triton, or OpenCL); experience with cuDNN, cuBLAS, or similar libraries is a plus
- Knowledge of memory hierarchy optimization, caching, and scheduling for large-scale model execution
- Publication record or open-source contributions in ML systems or performance optimization is a plus