NVIDIA is a leader in AI technology, and they are seeking a Senior Data Center Connectivity Engineer to translate product reference architectures into physical builds for their AI Factory and research clusters. The role involves leading engineering support for cluster build-outs, optimizing layouts, and developing connectivity solutions to enhance large-scale AI deployments.
Responsibilities:
- Own the development of connectivity reference designs based on requirements from cluster architecture, network engineering, infrastructure software and product hardware teams
- Build and develop comprehensive documentation, including detailed rack elevations and network architecture diagrams and cabling point-to-point list. Support projects throughout design and deployment phases
- Serve as the primary engineering support, closely collaborating with deployment and field teams to ensure successful cluster build-out and operation
- Strategically co-design the cluster with power and cooling infrastructure teams, ensuring a thorough understanding of all facility architectural requirements (Arch, power, cooling)
- Work with hardware, network and security teams to translate software stack requirements into physical requirements: hardware selection, fault domain, network architecture
- Develop new solutions and products in the connectivity space to accelerate the deployment of large scale AI Factories
Requirements:
- Minimum of 12+ years in a connectivity, network architecture or engineering role within a Hyperscale Cloud Provider, large-scale enterprise data center, or High-Performance Computing (HPC) environment
- BA or BS (or equivalent experience)
- Consistent record of designing, deploying, and operating network fabrics for thousands of GPU/CPU nodes
- Deep expertise in high-speed interconnect technologies, including InfiniBand, RoCE, and RDMA
- Proven experience designing connectivity solutions for high-density GPU clusters (100kW+ per rack) and understanding the unique front-end and back-end requirements for AI training vs. inference
- Deep understanding of data center infrastructure, including rack power/cooling, cable management, and physical density constraints
- Demonstrated ability to lead multidisciplinary teams and complete sophisticated technical initiatives
- Deep expertise with NVIDIA's compute and network product families and deployment standards
- Comfortable operating at the intersection of network engineering, MEP systems, and Infrastructure as a Service software layer
- Experienced with field deployments and/or global reference design documentation, ideally both