Nebius is leading a new era in cloud computing to serve the global AI economy. They are seeking a Lead Systems HPC Engineer to play a key role in building their hyperscaler platform, focusing on performance optimization of large-scale GPU clusters across hardware and software.

Responsibilities:

Focus on understanding system behavior across multiple layers, identifying performance bottlenecks, and driving improvements that shape how our clusters are built, operated, tuned, and validated
Investigate and troubleshoot performance issues of GPU cluster under real workloads (training and inference)
Evaluate and integrate new hardware, system configurations and tuning approaches through software stack
Support complex performance-related escalations from internal teams and customers
Work closely with infrastructure, software engineering and hardware vendor teams (e.g. NVIDIA, Mellanox, Intel)
Contribute to hardware and cluster qualification (acceptance), ensuring systems meet performance expectations

Requirements:

5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)

Lead Systems HPC Engineer

Key skills

About this role

Responsibilities:

Requirements: