Nebius is leading a new era in cloud infrastructure for the global AI economy, building a full-stack AI cloud platform. The Lead Software Systems Engineer - GPU Performance will analyze and optimize the performance of large-scale GPU clusters, working across core components and collaborating with various engineering teams.

Responsibilities:

Focus on understanding system behavior across multiple layers, identifying performance bottlenecks, and driving improvements that shape how our clusters are built, operated, tuned, and validated
Investigate and troubleshoot performance issues of GPU cluster under real workloads (training and inference)
Evaluate and integrate new hardware, system configurations and tuning approaches through software stack
Support complex performance-related escalations from internal teams and customers
Work closely with infrastructure, software engineering and hardware vendor teams (e.g. NVIDIA, Mellanox, Intel)
Contribute to hardware and cluster qualification (acceptance), ensuring systems meet performance expectations

Requirements:

5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)

Lead Software Systems Engineer - GPU Performance

Key skills

About this role

Responsibilities:

Requirements: