AWSAzureC++CloudDockerGoogle Cloud PlatformKubernetesPythonRustGoCGolangAILarge Language ModelsLlamaOllamaMLOpsGCPGoogle Cloud
About this role
Role Overview
Build and maintain robust inference engines using tools like vLLM, TGI, and NVIDIA Triton, ensuring high performance at scale.
Handle deployment optimizations to deliver low-latency AI serving solutions for multiple business applications.
Profile and optimize models for specialized hardware backends, including NVIDIA GPUs, Apple Silicon, and AI accelerators.
Collaborate with hardware teams to maximize utilization and performance across various computational environments.
Design and implement auto-scaling architectures for online and batch inference pipelines, leveraging Kubernetes.
Establish robust observability frameworks to monitor performance metrics against service-level agreements.
Requirements
Proficiency in programming languages such as Python, C++, Rust, or Golang specifically for high-performance AI workflows.
Proven hands-on experience with tools like vLLM, TensorRT, Llama.cpp, and Ollama for inference development and optimization.
Strong familiarity with infrastructure technologies, including Docker, Kubernetes, and cloud platforms such as AWS, GCP, and Azure.
Comprehensive understanding of GPU and AI hardware, including techniques for profiling and optimizing performance for accelerators like NVIDIA GPUs and TPUs.
Prior experience deploying Large Language Models (LLMs) with advanced techniques like Speculative Decoding or PagedAttention.
Contributions to open-source inference libraries or hardware-level kernel development (e.g., CUDA, Triton kernels).
Background in MLOps or SRE roles focused on high-performance AI endpoints and reliability during demand surges.
Proficiency in designing scalable solutions for high-throughput inference environments optimized for traffic bursts.