NVIDIA is a leading technology company focused on building AI factories worldwide. They are seeking a Senior Site Reliability Engineer to architect and automate the lifecycle of next-generation, open-source-driven datacenters, ensuring resilience and scalability in GPU-accelerated computing.
Responsibilities:
- Running commissioning and provisioning for GPU systems
- Running the firmware versions of equipment and components, and communicating the supported versions across the organization
- Through Day-2 operations, keeping tight SLOs around efficiency, performance, and availability
- Monitoring the hardware state of the cluster, finding bottlenecks and hot spots, and helping users attain peak performance constantly
- Triaging the HW break-fix issues and making constant improvements using open-source break-fix solutions
- Collaborate with programming and technical divisions to define and implement repeatable procedures
- Develop and implement operations strategy & processes, maintaining consistency with SLAs across critically important infrastructure
- Develop and apply procedures for minimal downtime and quality controls to strive to achieve continuous uptime
- Feeding requirements to software and hardware teams
- Creation of documentation that the ecosystem can use to run its own AI Data Centers
Requirements:
- BS or MS degree in Computer Engineering/Science, or related field (or equivalent experience) with 10+ overall years of meaningful work experience
- Experience managing GPU Fleets
- 10+ years of expertise in improving data center operations or critical infrastructure
- Expertise in BMS & Power management
- Background in working with Provisioning, Commissioning, and Config Management solutions
- Experience working with Packer and developing QCOW2 images
- Background in coordinating with remote hands
- Experience working with Datacenter Inventory Management Systems like Netbox, Nautilus, or others
- Proven track record of working with multiple teams to achieve operational excellence for an organization
- Experience driving reliability with robust processes, rapid field response, and recovery
- History of involvement with Automated Break-Fix solutions at scale
- Familiarity with handling a Message Bus and Workflow Engine
- Hands-on involvement with Zero Touch Provisioning solutions for the network and host