Nebius is leading a new era in cloud computing to serve the global AI economy. They are seeking a Senior HPC Cluster Engineer to develop their cutting-edge hyperscaler platform, focusing on GPU computing and InfiniBand networks.
Responsibilities:
- Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments
- Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions
- Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM
- Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments
- Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation
Requirements:
- 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)
- Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking
- Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads)
- Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication
- Background in Software-Defined Networking (SDN) and experience with HPC cluster networking
- Understanding of QEMU/KVM virtualization and managing virtualized environments
- Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems
- Familiarity with collective communication libraries like MPI and NCCL for distributed computing