NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. The role involves improving AI cluster resiliency and designing AIOps-based solutions, developing automated workflows, and delivering technical presentations while collaborating closely with customers and internal teams.
Responsibilities:
- You will bring together and understand internal and external customer requirements to improve AI cluster resiliency and design AIOps-based solutions that address these needs
- Develop automated workflows for issue detection and root cause analysis and closely collaborate with operators to debug sophisticated, full-stack AI cluster problems
- Deliver compelling technical presentations and lead hands-on demos or training
- You'll also handle evaluation deployments (POC/POV) and ensure smooth, reliable installations by staying engaged and encouraging throughout the customer journey
Requirements:
- Bachelor of Science or equivalent experience
- 8+ years of networking experience in enterprise or service provider environments, with strong hands-on expertise in routing and switching
- Proficient in scripting and automation using Python or similar languages, with strong Linux expertise
- Proven experience working directly with customers to resolve issues and ensure success in Systems Engineer or SRE roles
- Exceptional oral, written, and presentation skills for clearly communicating complex technical topics
- Demonstrated ability to collaborate effectively across teams, partnering with operations, engineering, and product development
- Experience with data center infrastructure and cloud architectures
- Background in network performance monitoring or observability
- Previous experience working at a technological start-up