Nebius is leading a new era in cloud computing to serve the global AI economy. They are seeking a Senior HPC Cluster Engineer to develop their cutting-edge hyperscaler platform, focusing on GPU computing and InfiniBand networks.

Responsibilities:

Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments
Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions
Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM
Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments
Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation

Requirements:

5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)
Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking
Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads)
Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication
Background in Software-Defined Networking (SDN) and experience with HPC cluster networking
Understanding of QEMU/KVM virtualization and managing virtualized environments
Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems
Familiarity with collective communication libraries like MPI and NCCL for distributed computing

Senior HPC Cluster Engineer

Key skills

About this role

Responsibilities:

Requirements: