Job Description -

We are urgently looking to onboard a top-tier On-Premises LLM Inference & GPU Systems Engineer for an exciting project with one of our premium clients. We are specifically seeking high-caliber professionals with deep, hands-on experience in On-Premises LLM Inference & GPU Systems Engineering.

Key Requirements:

Experience:10+ years of total experience is mandatory.
Location:Local to Charlotte, NC only. There are no relocation or remote options for this role.
Interview Process: Candidates must be available for a Face-to-Face interview at the client s office. Please only submit candidates who are 100% comfortable with an in-person interview.
Onsite Policy: Must be comfortable working on-site as per client requirements.

Need Old LinkedIn with photo before 2020

Dl and Visa copy

Genuine Visa

Important Note: Please avoid submitting junior or unrelated profiles. We are looking for strong, hands-on professionals who can lead the technical direction of AI products.

Job Description:

We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model training infrastructure or fine-tuning pipelines.

Key Responsibilities:

NVIDIA GPU Runtime Optimization: Drive extreme runtime efficiency and optimization for the token generation pipeline. Specifically manage prefill/decode optimization and KV cache management.
Inference Serving: Deploy and manage inference engines including vLLM and TensorRT-LLM.
Hardware Utilization: Optimize GPU throughput tuning, batching strategies, and latency optimization. Manage workload orchestration using RunAI and Kubernetes GPU orchestration.
Model Lifecycle Management: Oversee the complete Hugging Face model lifecycle, including model onboarding, deployment, and retirement.
Platform Operations: Operate and maintain the OpenShift AI ecosystem as the primary container platform for GenAI workloads.

Required Qualifications:

5+ years expertise as an LLM Systems Engineer or AI Infrastructure Runtime Engineer.
5+ years hands-on experience with NVIDIA H200 clusters and runtime optimization techniques (KV Cache, prefill/decode).
3+ years experience in OpenShift AI and GPU orchestration tools like RunAI.
Strong experience with modern inference frameworks, specifically vLLM and TensorRT-LLM.
Proven track record managing the Hugging Face deployment lifecycle.

On-Premises LLM Inference & GPU Systems Engineer

Key skills

About this role