Job Description -
We are urgently looking to onboard a top-tier On-Premises LLM Inference & GPU Systems Engineer for an exciting project with one of our premium clients. We are specifically seeking high-caliber professionals with deep, hands-on experience in On-Premises LLM Inference & GPU Systems Engineering.
Key Requirements:
- Experience:10+ years of total experience is mandatory.
- Location:Local to Charlotte, NC only. There are no relocation or remote options for this role.
- Interview Process: Candidates must be available for a Face-to-Face interview at the client s office. Please only submit candidates who are 100% comfortable with an in-person interview.
- Onsite Policy: Must be comfortable working on-site as per client requirements.
Need Old LinkedIn with photo before 2020
Dl and Visa copy
Genuine Visa
Important Note: Please avoid submitting junior or unrelated profiles. We are looking for strong, hands-on professionals who can lead the technical direction of AI products.
Job Description:
We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model training infrastructure or fine-tuning pipelines.
Key Responsibilities:
- NVIDIA GPU Runtime Optimization: Drive extreme runtime efficiency and optimization for the token generation pipeline. Specifically manage prefill/decode optimization and KV cache management.
- Inference Serving: Deploy and manage inference engines including vLLM and TensorRT-LLM.
- Hardware Utilization: Optimize GPU throughput tuning, batching strategies, and latency optimization. Manage workload orchestration using RunAI and Kubernetes GPU orchestration.
- Model Lifecycle Management: Oversee the complete Hugging Face model lifecycle, including model onboarding, deployment, and retirement.
- Platform Operations: Operate and maintain the OpenShift AI ecosystem as the primary container platform for GenAI workloads.
Required Qualifications:
- 5+ years expertise as an LLM Systems Engineer or AI Infrastructure Runtime Engineer.
- 5+ years hands-on experience with NVIDIA H200 clusters and runtime optimization techniques (KV Cache, prefill/decode).
- 3+ years experience in OpenShift AI and GPU orchestration tools like RunAI.
- Strong experience with modern inference frameworks, specifically vLLM and TensorRT-LLM.
- Proven track record managing the Hugging Face deployment lifecycle.