NVIDIA is a leader in groundbreaking developments in Artificial Intelligence and High-Performance Computing. They are seeking passionate software support engineers to partner closely with internal customers, providing support for cloud platforms and improving user experiences.
Responsibilities:
- Partner with multiple internal teams to provide Tier 1 support for complex cloud platforms
- Define and improve operational workflows (runbooks, escalation paths, support processes)
- Triage/investigate root cause of customer issues and escalate as needed
- File bugs and report issues while working closely with the Site Reliability team
- Build tooling to improve customer support process and visibility
- Deeply understand user workloads and use cases
- Partner with multiple internal teams to give feedback to engineering teams and develop solutions to aid in their success
- Be part of an on call rotation to support production systems
Requirements:
- BS/MS degree in Computer science or related areas (or equivalent experience)
- 2+ yrs of experience with supporting distributed software systems, supporting end-user software platforms, and experience with Linux
- Experience with Kubernetes, AWS, Azure, OCI, and GCP
- Background of Infrastructure, Networking, Storage, and DevOps scripting/tooling
- Understanding of data storage technologies (databases, file, block, blob)
- Customer Service/Support Experience
- Willingness to work up and down the stack as well as across multiple teams
- Strong skills in troubleshooting and Communication
- Experience with MLOps workflows or ML infrastructure
- Familiarity with GPU workloads or distributed training systems
- SLURM or HPC previous experience
- Strong drive to work with internal customers and make them successful
- A drive to improve process with strong organizational skills