NVIDIA is hiring experienced software engineers to help scale up its AI Infrastructure. The role involves designing and developing a scalable platform for GPU clusters, ensuring production AI clusters run reliably, and collaborating with teams across the organization.
Responsibilities:
- You will be part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads
- Designing and developing a massively distributed scalable platform which would be used to identify, diagnose and remediate non-performant GPU assets
- Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. Evaluating system failures and improving services based on a well-defined incident management process
- Working across all of our product stack: React, Web Components, TypeScript, Golang, PostgreSQL, Temporal, Bazel, Kubernetes
Requirements:
- Direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work
- Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies
- 5+ years in similar role and experience on large-scale production systems
- Experience with common software engineering principles, tools and techniques
- You possess a BS in Computer Science or Engineering or equivalent experience
- 6+ years of experience doing full-stack engineering
- 3+ years building and shipping consumer-facing products
- Proficiency in React, TypeScript/JavaScript, and Golang
- Proficiency with a SQL database
- Technical competency in managing and automating large-scale distributed systems independent of cloud providers
- Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager)
- Empathy for users, attention to detail, and a passion for creating world-class user experiences
- Prior experience in asynchronous workflows and/or event driven architecture
- Proven operational excellence in maintaining reliable and performant infrastructure
- A good understanding of how to use LLMs responsibly and the perils of blindly consuming their output