AWSAzureC++CloudDockerGoogle Cloud PlatformKubernetesPythonRustGoCGolangAILarge Language ModelsLlamaOllamaMLOpsGCPGoogle CloudRemote Work
About this role
Role Overview
Build and maintain robust inference engines using tools like vLLM, TGI, and NVIDIA Triton, ensuring high performance at scale.
Handle deployment optimizations to deliver low-latency AI serving solutions for multiple business applications.
Profile and optimize models for specialized hardware backends, including NVIDIA GPUs, Apple Silicon, and AI accelerators like TPUs and LPUs.
Collaborate with hardware teams to maximize utilization and performance across various computational environments.
Design and implement auto-scaling architectures for online (real-time) and batch inference pipelines, leveraging Kubernetes for inference routing and orchestration.
Ensure software solutions are optimized for peak performance during traffic spikes, maintaining reliability and scalability.
Establish robust observability frameworks to monitor Time to First Token (TTFT), tokens per second, and memory bandwidth utilization against service-level agreements (SLAs).
Build and execute performance and load testing suites to identify bottlenecks and ensure consistent reliability at scale.
Requirements
Programming Languages: Proficiency in programming languages such as Python, C++, Rust, or Golang specifically for high-performance AI workflows.
Inference Tools: Proven hands-on experience with tools like vLLM, TensorRT, Llama.cpp, and Ollama for inference development and optimization.
Infrastructure Expertise: Strong familiarity with infrastructure technologies, including Docker, Kubernetes, and cloud platforms such as AWS, GCP, and Azure.
Hardware Optimization Expertise: Comprehensive understanding of GPU and AI hardware, including techniques for profiling and optimizing performance for accelerators like NVIDIA GPUs and TPUs.
Preferred Experience: Prior experience deploying Large Language Models (LLMs) with advanced techniques like Speculative Decoding or PagedAttention.
Contributions to open-source inference libraries or hardware-level kernel development (e.g., CUDA, Triton kernels).
Background in MLOps or SRE roles focused on high-performance AI endpoints and reliability during demand surges.
Proficiency in designing scalable solutions for high-throughput inference environments optimized for traffic bursts.