Runpod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full‑stack AI applications. The Engineering Manager, Datacenter Network Engineering will lead a team responsible for designing, deploying, and operating Runpod's global datacenter and backbone network, focusing on network architecture and team leadership.
Responsibilities:
- Lead the Datacenter Networking Team: Manage and grow a team of network engineers responsible for datacenter fabrics, interconnects, and global WAN connectivity. Provide mentorship, technical guidance, and clear ownership boundaries
- Own Datacenter Network Architecture: Define and evolve network designs for GPU-heavy clusters, including spine-leaf topologies, ECMP routing, and high-bandwidth east-west traffic patterns
- High-Performance GPU Networking: Oversee design and operation of InfiniBand and RoCE-based fabrics supporting distributed training and inference workloads. Ensure performance, loss characteristics, and congestion control meet AI workload requirements
- Encapsulation & Overlay Protocols: Guide implementation and operations of encapsulation technologies such as VXLAN, EVPN, Geneve, or similar, enabling scalable multi-tenant isolation and flexible network provisioning
- Global WAN & Backbone Connectivity: Lead strategy and execution for global WAN connectivity, including private backbone links, IX connectivity, and hybrid connectivity with cloud providers and partners
- Reliability & Operations: Establish operational best practices for monitoring, capacity planning, change management, incident response, and post-mortems across the network stack
- Cross-Functional Collaboration: Partner closely with Infrastructure, SRE, Hardware, and Product Engineering teams to ensure network capabilities align with platform and customer requirements
- Vendor & Partner Management: Work with hardware vendors, colocation providers, and transit partners on network design, procurement, deployment timelines, and escalations
- Security & Segmentation: Ensure network designs support secure isolation, DDoS resilience, and compliance requirements without compromising performance
Requirements:
- Engineering Leadership Experience: 3+ years managing network or infrastructure engineering teams, with experience scaling teams and systems in production environments
- Datacenter Networking Expertise: 8+ years designing and operating large-scale datacenter networks, including spine-leaf architectures, BGP-based routing, and high-throughput fabrics
- Encapsulation & Overlays: Strong hands-on experience with VXLAN/EVPN or equivalent encapsulation protocols, including control-plane and data-plane considerations
- High-Performance Networking: Proven experience with InfiniBand and/or RoCE, including congestion management, lossless Ethernet concepts, and performance tuning for GPU workloads
- Global WAN Experience: Deep familiarity with global WAN technologies, including private backbone design, inter-region connectivity, routing policy, and traffic engineering
- Linux & Network OS Fluency: Comfortable working with Linux-based systems, network operating systems, and automation tooling
- Operational Excellence: Strong background in network observability, incident management, capacity forecasting, and change control
- Communication & Leadership: Clear written and verbal communication skills, with the ability to align stakeholders and lead teams through complex technical challenges
- Successful completion of a background check
- Experience operating networks for GPU clusters, HPC environments, or AI/ML platforms
- Familiarity with RDMA tuning, NCCL traffic patterns, and distributed training communication models
- Experience with automation frameworks and network-as-code (e.g., Terraform, Ansible, internal tooling)
- Background in multi-region or multi-cloud networking architectures
- Experience working in high-growth or hyperscale infrastructure environments