Nebius is leading a new era in cloud computing to serve the global AI economy, creating tools and resources for customers to transform industries. They are seeking a Senior Hardware Support Engineer to ensure production hardware reliability in large-scale data center environments, acting as a senior escalation point for complex hardware issues.
Responsibilities:
- Leading root cause analysis for complex hardware and firmware failures across production fleets
- Aggregating recurring problems and error patterns to identify systemic reliability issues
- Acting as the senior escalation point for hardware-related incidents impacting availability or performance
- Coordinating with vendors to drive timely diagnostics, RMAs, firmware fixes, and corrective actions
- Partnering with internal engineering teams to validate fixes and prevent recurrence
- Performing hardware and firmware validation before fleet-wide rollout
- Driving structured incident investigations using established IT problem management methodologies
- Supporting on-site teams with technical coordination during critical hardware events
- Improving hardware observability, failure tracking, and reporting processes
- Contributing to long-term hardware reliability strategy and fleet-wide stability improvements
Requirements:
- Strong hands-on expertise with server hardware in data center or large-scale production environments
- Proven experience performing root cause analysis of hardware and firmware failures
- Deep understanding of server components (CPU, memory, storage, networking, power, BMC) and failure modes
- Experience working directly with hardware vendors and engineering teams to resolve production issues
- Structured problem-solving skills using formal IT or incident management methodologies
- Strong analytical capabilities and ability to interpret logs, telemetry, and error patterns
- Experience coordinating technical activities with on-site operations teams
- Ability to manage multiple concurrent investigations with production impact
- Clear written and verbal communication skills in cross-functional environments
- Experience in GPU-dense, AI, or high-performance computing environments
- Exposure to firmware lifecycle management and large-scale rollout validation
- Familiarity with Linux-based production systems and infrastructure tooling
- Experience improving fleet-wide hardware reliability metrics at scale