Nebius is leading a new era in cloud computing to serve the global AI economy. They are seeking a Lead Systems HPC Engineer to play a key role in building their hyperscaler platform, focusing on performance optimization of large-scale GPU clusters across hardware and software.
Responsibilities:
- Focus on understanding system behavior across multiple layers, identifying performance bottlenecks, and driving improvements that shape how our clusters are built, operated, tuned, and validated
- Investigate and troubleshoot performance issues of GPU cluster under real workloads (training and inference)
- Evaluate and integrate new hardware, system configurations and tuning approaches through software stack
- Support complex performance-related escalations from internal teams and customers
- Work closely with infrastructure, software engineering and hardware vendor teams (e.g. NVIDIA, Mellanox, Intel)
- Contribute to hardware and cluster qualification (acceptance), ensuring systems meet performance expectations
Requirements:
- 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)