Nebius is leading a new era in cloud infrastructure for the global AI economy, building a full-stack AI cloud platform. The Lead Software Systems Engineer - GPU Performance will analyze and optimize the performance of large-scale GPU clusters, working across core components and collaborating with various engineering teams.
Responsibilities:
- Focus on understanding system behavior across multiple layers, identifying performance bottlenecks, and driving improvements that shape how our clusters are built, operated, tuned, and validated
- Investigate and troubleshoot performance issues of GPU cluster under real workloads (training and inference)
- Evaluate and integrate new hardware, system configurations and tuning approaches through software stack
- Support complex performance-related escalations from internal teams and customers
- Work closely with infrastructure, software engineering and hardware vendor teams (e.g. NVIDIA, Mellanox, Intel)
- Contribute to hardware and cluster qualification (acceptance), ensuring systems meet performance expectations
Requirements:
- 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming)
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning)
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python)