HCLTech is a leading company in Digital, Engineering, and Cloud capabilities, and they are seeking an experienced AI Infrastructure Engineer (L3) to enhance their AI Engineering capability. The role involves designing and operating high-performance GPU environments for AI workloads, requiring deep expertise in various technologies and systems.
Responsibilities:
- Deploy, configure & manage GPU platforms (NVIDIA A100/H100/L40, AMD Instinct, TPU)
- Troubleshoot GPU issues — hardware failures, thermal throttling, NVLink/PCIe topology, driver conflicts
- Maintain GPU software stacks: CUDA, cuDNN, TensorRT, Drivers & Firmware
- Performance optimization for AI workloads & Linux systems (Ubuntu, RHEL, Rocky)
- Manage HPC storage systems: BeeGFS, Ceph, Lustre, high‑throughput NFS
- Operate high‑speed networks: InfiniBand, RoCE, RDMA, NVLink
- Administer Kubernetes GPU clusters (GPU Operator, device plugins, MIG)
- Support MLOps and orchestration tools: Kubeflow, Ray, MLflow, Slurm/PBS
- Optimize distributed training (NCCL, PyTorch Distributed, Horovod, DeepSpeed)
- Work with inference runtimes: vLLM, Triton, TensorRT‑LLM
- Build automation with Terraform, Helm, Kustomize & GitOps (ArgoCD/Flux)
- Implement observability: Prometheus, Grafana, DCGM, OpenTelemetry
- Lead deep‑dive RCA across GPU, network, storage, and orchestration layers
- Provide L3 expertise and collaborate with L1/L2, OEMs & tech partners
- Drive reliability, patching, hotfixes & production readiness
Requirements:
- 10–12 Years Minimum experience
- Bachelor's in CS, Engineering, IT or related field
- 8–12 years overall infra/platform engineering experience
- 4–6 years specialized experience in AI/ML infrastructure
- Strong background in large‑scale GPU systems & distributed environments
- Advanced Kubernetes & containerization expertise
- Understanding of AI workloads & MLOps practices
- NVIDIA Certified Associate – AI Infrastructure
- NVIDIA NPN Certification
- NVIDIA Base Command Manager
- AWS Solutions Architect Associate
- CKA – Certified Kubernetes Administrator
- CKAD – Kubernetes Application Developer