NVIDIA has been transforming computer graphics and computing for over 25 years, and they are seeking a Principal Engineer to architect and scale next-generation diagnostic systems for Cloud Service Providers. This role involves defining the technical roadmap, leading multi-functional development, and mentoring engineering teams to deploy robust diagnostic frameworks for AI accelerator products.

Responsibilities:

Define technical strategy and development of NVIDIA’s Data Center diagnostic systems, orchestrating large-scale stress testing for CPUs, GPUs, networking, memory, and high-speed interconnects
Mentor and grow engineering teams, providing technical leadership and encouraging a culture of innovation and excellence
Drive the root-cause analysis of systemic failures that intersect multiple hardware and software domains
Partner with CSPs to diagnose and address scalability challenges within their unique data center infrastructures

Requirements:

Bachelor's degree in Computer Science/Engineering, Electrical Engineering, or a related field (or equivalent experience)
15+ years of system software experience working on highly resilient distributed systems with programming experience in C++ or Python
Deep systems knowledge of x86/ARM architectures, Linux OS internals, firmware (UEFI/BIOS), Redfish, HMC, BMC protocols and platform security
Consistent track record demonstrating technical leadership leading project teams and setting technical direction
Expertise in software testing methodologies with an automation-led, AI-first approach to ensuring software quality

Principal System Software Engineer - Data Center MODS

Key skills

About this role

Responsibilities:

Requirements: