Together AI is a research-driven artificial intelligence company focused on optimizing AI systems. The role involves designing and developing distributed inference engines for large language models, emphasizing performance and scalability.

Responsibilities:

Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models
Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving
Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability
Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators
Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines

Requirements:

3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing
Familiar with at least one LLM inference frameworks (e.g., TensorRT-LLM, vLLM, SGLang, TGI(Text Generation Inference))
Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling
Deep understanding of KV cache systems like Mooncake, PagedAttention, or custom in-house variants
Proficient in Python and C++/CUDA for high-performance deep learning inference
Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization
Knowledge of inference optimization, such as workload scheduling, CUDA graph, compiled, efficient kernels
Strong analytical problem-solving skills with a performance-driven mindset
Excellent collaboration and communication skills across teams
Experience in developing software systems for large-scale data center networks with RDMA/RoCE
Familiar with distributed filesystem(e.g., 3FS, HDFS, Ceph)
Familiar with open source distributed scheduling/orchestration frameworks, such as Kubernetes (K8S)
Contributions to open-source deep learning inference projects

LLM Inference Frameworks and Optimization Engineer

Key skills

About this role

Responsibilities:

Requirements: