NVIDIA is hiring experienced software engineers to help scale up its AI Infrastructure. The role involves designing and developing a scalable platform for GPU clusters, ensuring production AI clusters run reliably, and collaborating with teams across the organization.

Responsibilities:

You will be part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads
Designing and developing a massively distributed scalable platform which would be used to identify, diagnose and remediate non-performant GPU assets
Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. Evaluating system failures and improving services based on a well-defined incident management process
Working across all of our product stack: React, Web Components, TypeScript, Golang, PostgreSQL, Temporal, Bazel, Kubernetes

Requirements:

Direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work
Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies
5+ years in similar role and experience on large-scale production systems
Experience with common software engineering principles, tools and techniques
You possess a BS in Computer Science or Engineering or equivalent experience
6+ years of experience doing full-stack engineering
3+ years building and shipping consumer-facing products
Proficiency in React, TypeScript/JavaScript, and Golang
Proficiency with a SQL database
Technical competency in managing and automating large-scale distributed systems independent of cloud providers
Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager)
Empathy for users, attention to detail, and a passion for creating world-class user experiences
Prior experience in asynchronous workflows and/or event driven architecture
Proven operational excellence in maintaining reliable and performant infrastructure
A good understanding of how to use LLMs responsibly and the perils of blindly consuming their output

Senior Full Stack Software Engineer - DGX Cloud

Key skills

About this role

Responsibilities:

Requirements: