IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference. The Senior Network Engineer is responsible for designing, operating, and optimizing high-performance data center networks that power large-scale AI/ML and cloud infrastructure.
Responsibilities:
- Act as a Tier 3 (L3) escalation point for complex network incidents, providing deep technical troubleshooting and root cause analysis beyond Tier 1/2 support
- Partner with Network Operations (Tier 1/2) teams to resolve escalated issues, improve runbooks, and reduce recurring incidents through automation and systemic fixes
- Lead resolution of critical, high-impact network incidents affecting AI/ML workloads, data center operations, and customer environments
- Develop and refine operational procedures, playbooks, and escalation paths to enhance efficiency across Tier 1/2/3 support models
- Identify opportunities to shift-left operational knowledge by mentoring junior engineers and improving observability and tooling
- Maintain accurate network documentation, diagrams, and design standards, supporting compliance and audit requirements
- Optimize network performance for distributed training workloads, including tuning for large flows, incast, and burst traffic patterns
- Support RDMA/RoCEv2 or InfiniBand environments(if applicable), including PFC, ECN, and lossless fabric tuning
- Automate network provisioning, configuration management, and remediation using Python, Ansible, Terraform, or similar tools
Requirements:
- Proven experience operating as a Tier 3 / escalation engineer handling complex, high-impact production issues
- Deep expertise in: Layer 2/3 networking (VLANs, VXLAN, EVPN)
- Routing protocols (BGP, OSPF, IS-IS)
- Data center fabric design (Clos, ECMP)
- Strong hands-on experience with major networking vendors (e.g., Cisco, Juniper, Arista) and/or open networking platforms
- Experience with network automation and infrastructure-as-code, including scripting in Python and using tools like Ansible or Terraform
- Solid understanding of network performance tuning, including buffer management, queueing, QoS, and congestion control
- Experience with network observability tools (SNMP, streaming telemetry, NetFlow/sFlow, syslog, packet capture/analysis)
- Experience with cloud and hybrid networking, including VPC design, peering, and connectivity to public cloud providers
- Bachelor's degree in Computer Science, Network Engineering, Information Technology, or related field (or equivalent practical experience)
- Familiarity with high-performance networking concepts: RDMA / RoCEv2 (preferred)
- InfiniBand (nice to have)
- DPU/SmartNIC architectures (nice to have)