QumulusAI is building the next generation of AI infrastructure, and they are seeking an HPC-focused Site Reliability Engineer. The role involves managing high-performance computing workloads, optimizing GPU clusters, and ensuring maximum performance for distributed training and inference workloads.

Responsibilities:

Architect, deploy, and manage HPC job scheduling and resource management systems, primarily Slurm, across large-scale GPU clusters (H100, B200)
Optimize NCCL configuration, collective communication patterns, and multi-node GPU interconnect performance across high-speed Ethernet and InfiniBand fabrics
Design and implement cluster provisioning and management workflows for GPU nodes including driver stacks, CUDA toolkits, and container runtimes (Enroot, Pyxis, Singularity)
Develop monitoring and profiling frameworks for HPC workloads including GPU utilization, memory bandwidth, interconnect throughput, and job performance analytics
Collaborate with customers and internal teams to troubleshoot distributed training issues, performance bottlenecks, and multi-node communication failures
Evaluate, integrate, and support HPC ecosystem tools including MPI implementations, NVIDIA DCGM, NVSMI, and container-native HPC platforms
Build automation for cluster health checks, GPU validation testing, and large-scale burn-in processes
Participate in on-call rotations, incident response, and any and all engineering tasks required to maintain and advance the platform
Provide outstanding customer service when engaging directly with customers, including technical consultation, troubleshooting, and onboarding support
Develop documentation, runbooks, and best practices for HPC operations and customer onboarding

Requirements:

5+ years in HPC operations, GPU cluster management, or high-performance infrastructure engineering
Deep hands-on experience with Slurm (installation, configuration, scheduling policies, accounting, and multi-cluster federation)
Strong understanding of NCCL, collective communication operations (all-reduce, all-gather), and GPU-to-GPU communication topologies
Experience managing NVIDIA GPU infrastructure at scale: driver lifecycle, CUDA, cuDNN, TensorRT, and DCGM
Proficiency with HPC container runtimes: Enroot, Pyxis, Singularity/Apptainer, or equivalent
Working knowledge of MPI (OpenMPI, MVAPICH2) and distributed computing frameworks
Familiarity with high-performance networking for HPC: lossless Ethernet (RoCE), InfiniBand, GPUDirect RDMA, and adaptive routing
Linux systems expertise including kernel tuning for HPC (hugepages, NUMA, CPU pinning, IRQ affinity)
Experience with AI/ML training frameworks at scale (PyTorch DDP, DeepSpeed, Megatron-LM, JAX)
Knowledge of parallel filesystem technologies (Lustre, GPFS/Spectrum Scale, BeeGFS)
Familiarity with Kubernetes-based HPC platforms (Volcano, Run:AI, or KubeFlow)
Experience with NVIDIA Base Command Manager, Bright Cluster Manager, or similar HPC management suites
Background in benchmarking HPC systems (HPL, NCCL-tests, MLPerf)

Site Reliability Engineer /HPC Subject Matter Expert

Key skills

About this role

Responsibilities:

Requirements: