NVIDIA has been transforming computer graphics and accelerated computing for over 25 years, and they are seeking a Senior Systems Engineer in Artificial Intelligence Operations. The role involves improving AI cluster resiliency, developing automated workflows, and delivering technical presentations while collaborating with various teams and customers.
Responsibilities:
- You will bring together and understand internal and external customer requirements to improve AI cluster resiliency and design AIOps-based solutions that address these needs
- Develop automated workflows for issue detection and root cause analysis and closely collaborate with operators to debug sophisticated, full-stack AI cluster problems. We will bring to bear the findings for product improvements!
- Deliver compelling technical presentations and lead hands-on demos or training. You'll also handle evaluation deployments (POC/POV) and ensure smooth, reliable installations by staying engaged and encouraging throughout the customer journey
Requirements:
- Bachelor of Science or equivalent experience
- 8+ years of networking experience in enterprise or service provider environments, with strong hands-on expertise in routing and switching
- Proficient in scripting and automation using Python or similar languages, with strong Linux expertise
- Proven experience working directly with customers to resolve issues and ensure success in Systems Engineer or SRE roles
- Exceptional oral, written, and presentation skills for clearly communicating complex technical topics
- Demonstrated ability to collaborate effectively across teams, partnering with operations, engineering, and product development
- Experience with data center infrastructure and cloud architectures
- Background in network performance monitoring or observability
- Previous experience working at a technological start-up