GMI Cloud is a fast-growing AI infrastructure company that specializes in providing advanced GPU compute services and AI model inference solutions. They are seeking a Machine Learning Engineer to build an inference optimization team and drive the research and productionization of advanced optimization techniques for LLM serving performance.
Responsibilities:
- Drive frontier research and engineering in LLM inference optimization, building GMI’s industry-leading capabilities in performance, efficiency, and scalability
- Develop next-generation optimization strategies for large-scale LLM serving across model execution, runtime systems, and production inference platforms
- Advance state-of-the-art techniques in quantization and precision optimization to improve throughput, latency, memory efficiency, and cost-performance across modern GPU systems
- Push the frontier of speculative decoding and related acceleration methods, including both systems and model-level approaches for faster generation
- Lead innovation in KV cache and memory optimization , improving long-context serving efficiency, memory utilization, and multi-tenant performance
- Develop advanced architectures for prefill/decode disaggregation and other distributed inference optimization strategies for large-scale production environments
- Drive system-level optimization across scheduling, batching, routing, gateway orchestration, adapter serving, and end-to-end inference efficiency
- Build scalable optimization frameworks, performance methodologies, and engineering practices that allow GMI to stay ahead of the industry as models, hardware, and serving patterns evolve
- Turn cutting-edge optimization ideas into production-ready capabilities that improve real-world customer workloads across latency, throughput, quality, and cost
- Collaborate closely with platform, infrastructure, and product teams to make inference optimization a core technical advantage of GMI Cloud
Requirements:
- Strong hands-on experience with LLM inference systems and performance optimization
- Solid understanding of inference metrics and tradeoffs, including TTFT, ITL, throughput, goodput, tail latency, GPU utilization, memory efficiency, and quality/cost tradeoffs
- Experience with one or more modern serving stacks such as SGLang, vLLM, TensorRT-LLM, Triton, or similar systems
- Deep familiarity with GPU-based inference, model serving architecture, and production bottlenecks around compute, memory bandwidth, KV-cache behavior, and scheduling
- Strong experimentation skills: able to design benchmarks, interpret results, debug regressions, and produce actionable conclusions rather than isolated microbenchmark wins
- Comfortable working across research-style validation and production engineering, with a bias toward measurable impact in real customer scenarios
- Strong coding and systems skills in Python, with practical experience in profiling, observability, and performance debugging
- Clear communication skills and the ability to explain technical tradeoffs to both engineers and cross-functional stakeholders
- 1+ years of hands-on experience in LLM inference optimization, ML systems optimization, or closely related areas
- Experience working on optimization for large-scale model serving, such as latency reduction, throughput improvement, memory efficiency, or cost-performance tuning
- Familiarity with one or more major areas of inference optimization, including quantization, speculative decoding, KV cache optimization, prefill/decode disaggregation, or system-level serving optimization
- Experience with modern LLM serving stacks, GPU inference systems, or production ML infrastructure is a strong plus