QumulusAI is building the next generation of AI infrastructure, and they are seeking an HPC-focused Site Reliability Engineer. The role involves managing high-performance computing workloads, optimizing GPU clusters, and ensuring maximum performance for distributed training and inference workloads.
Responsibilities:
- Architect, deploy, and manage HPC job scheduling and resource management systems, primarily Slurm, across large-scale GPU clusters (H100, B200)
- Optimize NCCL configuration, collective communication patterns, and multi-node GPU interconnect performance across high-speed Ethernet and InfiniBand fabrics
- Design and implement cluster provisioning and management workflows for GPU nodes including driver stacks, CUDA toolkits, and container runtimes (Enroot, Pyxis, Singularity)
- Develop monitoring and profiling frameworks for HPC workloads including GPU utilization, memory bandwidth, interconnect throughput, and job performance analytics
- Collaborate with customers and internal teams to troubleshoot distributed training issues, performance bottlenecks, and multi-node communication failures
- Evaluate, integrate, and support HPC ecosystem tools including MPI implementations, NVIDIA DCGM, NVSMI, and container-native HPC platforms
- Build automation for cluster health checks, GPU validation testing, and large-scale burn-in processes
- Participate in on-call rotations, incident response, and any and all engineering tasks required to maintain and advance the platform
- Provide outstanding customer service when engaging directly with customers, including technical consultation, troubleshooting, and onboarding support
- Develop documentation, runbooks, and best practices for HPC operations and customer onboarding
Requirements:
- 5+ years in HPC operations, GPU cluster management, or high-performance infrastructure engineering
- Deep hands-on experience with Slurm (installation, configuration, scheduling policies, accounting, and multi-cluster federation)
- Strong understanding of NCCL, collective communication operations (all-reduce, all-gather), and GPU-to-GPU communication topologies
- Experience managing NVIDIA GPU infrastructure at scale: driver lifecycle, CUDA, cuDNN, TensorRT, and DCGM
- Proficiency with HPC container runtimes: Enroot, Pyxis, Singularity/Apptainer, or equivalent
- Working knowledge of MPI (OpenMPI, MVAPICH2) and distributed computing frameworks
- Familiarity with high-performance networking for HPC: lossless Ethernet (RoCE), InfiniBand, GPUDirect RDMA, and adaptive routing
- Linux systems expertise including kernel tuning for HPC (hugepages, NUMA, CPU pinning, IRQ affinity)
- Experience with AI/ML training frameworks at scale (PyTorch DDP, DeepSpeed, Megatron-LM, JAX)
- Knowledge of parallel filesystem technologies (Lustre, GPFS/Spectrum Scale, BeeGFS)
- Familiarity with Kubernetes-based HPC platforms (Volcano, Run:AI, or KubeFlow)
- Experience with NVIDIA Base Command Manager, Bright Cluster Manager, or similar HPC management suites
- Background in benchmarking HPC systems (HPL, NCCL-tests, MLPerf)