Design, deploy, and operate large-scale RoCEv2 data center networks supporting AI and ML clusters from thousands to 100,000+ GPUs
Own congestion management and performance tuning across RDMA fabrics, including PFC, ECN, and DCQCN, in production environments
Implement and maintain automation, validation, and observability tooling using Python, Ansible, Terraform, and modern DevOps workflows
Ensure high availability and reliability across multi-tenant environments by leading operational excellence, incident response, and continuous improvement
Requirements
Bachelor of Science in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience
Deep experience with RDMA and RoCEv2 in large-scale production data centers supporting AI or HPC workloads
Strong Arista expertise, including EOS, hardware platforms, and operating high-speed Ethernet fabrics
Proven knowledge of congestion management and performance tuning using PFC, ECN, and DCQCN
Hands-on experience with high-speed optics and cabling including 400G, 800G, and AEC, AOC, DAC, and structured cabling in dense environments
Automation and operations mindset, with experience using Python, Ansible, Terraform, Git, and observability tooling in always-on production systems