Nebius is leading a new era in cloud computing to serve the global AI economy. The Senior HPC Cluster Engineer will be responsible for enhancing and optimizing the core components of the cloud platform, focusing on GPU computing and InfiniBand networks.
Responsibilities:
- Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments
- Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions
- Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM
- Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments
- Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation
Requirements:
- 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)
- Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking
- Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads)
- Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication
- Background in Software-Defined Networking (SDN) and experience with HPC cluster networking
- Understanding of QEMU/KVM virtualization and managing virtualized environments
- Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems
- Familiarity with collective communication libraries like MPI and NCCL for distributed computing