NVIDIA has been transforming computer graphics and computing for over 25 years, and they are seeking a Principal Engineer to architect and scale next-generation diagnostic systems for Cloud Service Providers. This role involves defining the technical roadmap, leading multi-functional development, and mentoring engineering teams to deploy robust diagnostic frameworks for AI accelerator products.
Responsibilities:
- Define technical strategy and development of NVIDIA’s Data Center diagnostic systems, orchestrating large-scale stress testing for CPUs, GPUs, networking, memory, and high-speed interconnects
- Mentor and grow engineering teams, providing technical leadership and encouraging a culture of innovation and excellence
- Drive the root-cause analysis of systemic failures that intersect multiple hardware and software domains
- Partner with CSPs to diagnose and address scalability challenges within their unique data center infrastructures
Requirements:
- Bachelor's degree in Computer Science/Engineering, Electrical Engineering, or a related field (or equivalent experience)
- 15+ years of system software experience working on highly resilient distributed systems with programming experience in C++ or Python
- Deep systems knowledge of x86/ARM architectures, Linux OS internals, firmware (UEFI/BIOS), Redfish, HMC, BMC protocols and platform security
- Consistent track record demonstrating technical leadership leading project teams and setting technical direction
- Expertise in software testing methodologies with an automation-led, AI-first approach to ensuring software quality