Red Hat is the world’s leading provider of enterprise open source software solutions, and they are seeking a Senior Machine Learning Engineer focused on distributed vLLM infrastructure. This role involves designing, developing, and maintaining distributed inference infrastructure while collaborating with cross-functional teams to tackle challenges in scalable inference systems and Kubernetes-native deployments.
Responsibilities:
- Contribute to the design, development, and testing of new features and solutions for Red Hat AI Inference
- Innovate in the inference domain by participating in upstream communities
- Design, develop, and maintain distributed inference infrastructure leveraging Kubernetes APIs, operators, and the Gateway Inference Extension API for scalable LLM deployments
- Develop and maintain system components in Go and/or Rust to integrate with the vLLM project and manage distributed inference workloads
- Develop and maintain KV cache-aware routing and scoring algorithms to optimize memory utilization and request distribution in large-scale inference deployments
- Enhance the resource utilization, fault tolerance, and stability of the inference stack
- Develop and test various inference optimization algorithms
- Actively participate in technical design discussions and propose innovative solutions to complex challenges for high impact projects
- Contribute to a culture of continuous improvement by sharing recommendations and technical knowledge with team members
- Collaborate with product management, other engineering and cross-functional teams to analyze and clarify business requirements
- Communicate effectively to stakeholders and team members to ensure proper visibility of development efforts
- Mentor and coach a distributed team of engineers
- Provide timely and constructive code reviews
- Represent RHAI in external engagements including industry events, customer meetings, and open source communities
Requirements:
- Strong proficiency in Python and GoLang or similar
- Experience with cloud-native Kubernetes service mesh technologies/stacks such as Istio, Cilium, Envoy (WASM filters), and CNI
- A solid understanding of Layer 7 networking, HTTP/2, gRPC, and the fundamentals of API gateways and reverse proxies
- Knowledge of serving runtime technologies for hosting LLMs, such as vLLM, SGLang, TensorRT-LLM, etc
- Excellent written and verbal communication skills, capable of interacting effectively with both technical and non-technical team members
- Experience providing technical leadership in a global team
- Autonomous work ethic and the ability to thrive in a dynamic, fast-paced environment
- Strong proficiency in Rust, C, or C++
- Working knowledge of high-performance networking protocols and technologies including UCX, RoCE, InfiniBand, and RDMA is a plus
- Deep experience with the Kubernetes ecosystem, including core concepts, custom APIs, operators, and the Gateway API inference extension for GenAI workloads
- Experience with GPU performance benchmarking and profiling tools like NVIDIA Nsight or distributed tracing libraries/techniques like OpenTelemetry
- Experience in writing high performance code for GPUs and deep knowledge of GPU hardware
- Strong understanding of computer architecture, parallel processing, and distributed computing concepts
- Bachelor's degree in computer science or related field is an advantage, though we prioritize hands-on experience
- Active engagement in the ML research community (publications, conference participation, or open source contributions) is a significant advantage