Home
Jobs
Saved
Resumes
Principal Systems Engineer – At-Scale at NVIDIA | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Principal Systems Engineer – At-Scale
NVIDIA
Website
LinkedIn
Principal Systems Engineer – At-Scale
Santa Clara, California, United States of America
Full Time
2 days ago
$272,000 - $431,250 USD
Visa Sponsor
Apply Now
Key skills
Linux
Lua
Python
C
Bash
AI
Deep Learning
Analytics
Communication
About this role
Role Overview
Deploy strategies to analyze and collect debugging and anomaly signals from large fleets of clusters to improve quality and experience.
Build and expand debugging tools to identify, diagnose, and recover out-of-service systems, growing customer-available capacity.
Author and deploy "fault signatures" and automated recovery rules.
Lead cross-team task forces to address undefined failure modes in high-value AI/GPU systems, cutting backlogs through data-driven isolation.
Leverage AI, analytics, and efficiency tools to scale debug efforts, turning manual triage into productized, automated code.
Act as a technical leader and cultural anchor.
Mentor junior and senior engineers.
Encourage organizational health initiatives.
Promote innovation through hackathons and sharing sessions.
Requirements
15+ years of experience in systems debugging at scale and debugging components of large fleets.
BS/MS Computer Science or related field (or equivalent experience)
Proven understanding of performance clusters, infrastructure, and workload patterns.
Knowledge and experience with telemetry and at-scale analytics for large platforms.
Experience using and installing fleets of Linux-based server platforms.
C/Python/Bash/Lua programming/scripting experience.
Experience working with engineering or academic research community supporting performance engineering or deep learning.
Strong teamwork and both verbal and written communication skills.
Tech Stack
Linux
Lua
Python
Benefits
equity
benefits
Apply Now
Home
Jobs
Saved
Resumes