Together AI is a research-driven artificial intelligence company focused on optimizing AI systems. The role involves designing and developing distributed inference engines for large language models, emphasizing performance and scalability.
Responsibilities:
- Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models
- Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving
- Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability
- Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators
- Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines
Requirements:
- 3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing
- Familiar with at least one LLM inference frameworks (e.g., TensorRT-LLM, vLLM, SGLang, TGI(Text Generation Inference))
- Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling
- Deep understanding of KV cache systems like Mooncake, PagedAttention, or custom in-house variants
- Proficient in Python and C++/CUDA for high-performance deep learning inference
- Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization
- Knowledge of inference optimization, such as workload scheduling, CUDA graph, compiled, efficient kernels
- Strong analytical problem-solving skills with a performance-driven mindset
- Excellent collaboration and communication skills across teams
- Experience in developing software systems for large-scale data center networks with RDMA/RoCE
- Familiar with distributed filesystem(e.g., 3FS, HDFS, Ceph)
- Familiar with open source distributed scheduling/orchestration frameworks, such as Kubernetes (K8S)
- Contributions to open-source deep learning inference projects