GMI Cloud is a fast-growing AI infrastructure company that specializes in providing advanced GPU compute services and AI model inference solutions. They are seeking a Machine Learning Engineer to build an inference optimization team and drive the research and productionization of advanced optimization techniques for LLM serving performance.

Responsibilities:

Drive frontier research and engineering in LLM inference optimization, building GMI’s industry-leading capabilities in performance, efficiency, and scalability
Develop next-generation optimization strategies for large-scale LLM serving across model execution, runtime systems, and production inference platforms
Advance state-of-the-art techniques in quantization and precision optimization to improve throughput, latency, memory efficiency, and cost-performance across modern GPU systems
Push the frontier of speculative decoding and related acceleration methods, including both systems and model-level approaches for faster generation
Lead innovation in KV cache and memory optimization , improving long-context serving efficiency, memory utilization, and multi-tenant performance
Develop advanced architectures for prefill/decode disaggregation and other distributed inference optimization strategies for large-scale production environments
Drive system-level optimization across scheduling, batching, routing, gateway orchestration, adapter serving, and end-to-end inference efficiency
Build scalable optimization frameworks, performance methodologies, and engineering practices that allow GMI to stay ahead of the industry as models, hardware, and serving patterns evolve
Turn cutting-edge optimization ideas into production-ready capabilities that improve real-world customer workloads across latency, throughput, quality, and cost
Collaborate closely with platform, infrastructure, and product teams to make inference optimization a core technical advantage of GMI Cloud

Requirements:

Strong hands-on experience with LLM inference systems and performance optimization
Solid understanding of inference metrics and tradeoffs, including TTFT, ITL, throughput, goodput, tail latency, GPU utilization, memory efficiency, and quality/cost tradeoffs
Experience with one or more modern serving stacks such as SGLang, vLLM, TensorRT-LLM, Triton, or similar systems
Deep familiarity with GPU-based inference, model serving architecture, and production bottlenecks around compute, memory bandwidth, KV-cache behavior, and scheduling
Strong experimentation skills: able to design benchmarks, interpret results, debug regressions, and produce actionable conclusions rather than isolated microbenchmark wins
Comfortable working across research-style validation and production engineering, with a bias toward measurable impact in real customer scenarios
Strong coding and systems skills in Python, with practical experience in profiling, observability, and performance debugging
Clear communication skills and the ability to explain technical tradeoffs to both engineers and cross-functional stakeholders
1+ years of hands-on experience in LLM inference optimization, ML systems optimization, or closely related areas
Experience working on optimization for large-scale model serving, such as latency reduction, throughput improvement, memory efficiency, or cost-performance tuning
Familiarity with one or more major areas of inference optimization, including quantization, speculative decoding, KV cache optimization, prefill/decode disaggregation, or system-level serving optimization
Experience with modern LLM serving stacks, GPU inference systems, or production ML infrastructure is a strong plus

Machine Learning Engineer (LLM inference)

Key skills

About this role

Responsibilities:

Requirements: