Nebius is leading a new era in cloud infrastructure for the global AI economy. They are seeking a Senior Software Systems Engineer to enhance and optimize the core components of their Cloud platform, focusing on GPU computing and InfiniBand networks. The role involves tuning performance, analyzing issues, integrating new hardware, and enhancing automation systems for monitoring and resolution in complex environments.
Responsibilities:
- Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments
- Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions
- Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM
- Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments
- Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation
Requirements:
- 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)
- Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking
- Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads)
- Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication
- Background in Software-Defined Networking (SDN) and experience with HPC cluster networking
- Understanding of QEMU/KVM virtualization and managing virtualized environments
- Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems
- Familiarity with collective communication libraries like MPI and NCCL for distributed computing