Assisting with deployment, debugging, and improving the efficiency of AI workloads on extensive NVIDIA platforms.
Identifying hardware issues, supervising them through bugs, and keeping customers updated on current progress.
Benchmarking new framework features, analyzing performance, and sharing actionable insights with both customers and internal teams.
Working directly with external customers/partners to solve cluster performance and stability issues, identify bottlenecks, and implement effective solutions.
Build expertise and guide customers in scaling workloads efficiently and reliably on the latest generation of NVIDIA GPUs.
Collaborate with AI factory deployment teams and ensure RAs/Blueprints are accurately followed and implemented.
Requirements
BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Physics, or other Engineering fields, or equivalent experience.
10+ years of experience in designing, managing, and supporting large-scale hybrid networks.
Experience with scripting is helpful.
Strong programming skills in at least one of the following languages: C, C++, or Python.
Practical experience identifying and resolving bottlenecks in large-scale training workloads or parallel applications.
Proven understanding of CPU and GPU architectures, CUDA, parallel filesystems, and high-speed interconnects.
Experienced in working with large compute clusters with an understanding of their internal scheduling and resource management mechanisms (e.g. SLURM or Cloud based clusters).
System-level understanding of server/rack-level architecture, BMC, PCIe devices, Network Adapters, Linux OS, and kernel drivers.
Excellent communication and liaison skills to work with customers, partners, and internal functions.