Fluidstack is building the infrastructure for abundant intelligence, partnering with top AI labs and enterprises. They are seeking a Network Engineer, Operations & Repair to ensure network reliability through incident response and operational excellence, while also providing support for datacenter deployments and managing regional operations.
Responsibilities:
- Serve as the primary network operations contact for a datacenter region
- Own network health, respond to incidents escalated from NOC, and ensure fabrics run reliably
- Build deep knowledge of your region's network topology, common failure modes, and operational characteristics
- Handle network incidents escalated from Tier 1 NOC during your coverage window
- Troubleshoot complex issues across physical and logical layers, coordinate with other engineers for follow-the-sun coverage, and drive incidents to resolution
- Lead incident response when you're the subject matter expert
- Coordinate with hardware repair teams onsite for incidents escalated and assigned
- Support RMA case process and escalations with supplier support teams
- Build and support dashboards per region and multi-region aggregate observability
- Manage field testing of repair and other operations process and automation; providing visibility and feedback to partners developing the tooling
- Provide operational support for datacenter deployments and expansions in your region
- Partner with Deployment teams on turn-up activities, validate production readiness, and ensure smooth handovers from deployment to operations
- Build and execute operational runbooks for both repair and non-repair activities
- Identify gaps in runbooks, document lessons learned, and provide feedback to the Operations lead on runbook improvements
- Build relationships with onsite DC Operations teams, structured cabling vendors, and hardware logistics partners
- Serve as the network engineering liaison for your datacenter region
- Communicate clearly about network status, planned maintenance, and operational issues
Requirements:
- 5-8 years in network engineering with significant hands-on operational experience
- Basic SQL and dashboard experience with Grafana, Tableau, or similar query/dashboard services
- Deep experience operating modern datacenter networks including EVPN/VXLAN, BGP, CLOS topologies, and high-radix switching
- Proven ability to lead incident response, perform systematic troubleshooting, and drive issues to resolution
- You understand how to build relationships with onsite teams, coordinate physical infrastructure work, and represent network engineering in a field environment
- You can troubleshoot with imperfect information, make pragmatic decisions under time pressure, and prioritize based on business impact
- You're productive working remotely but understand that datacenter operations sometimes require hands-on presence
- Experience operating AI/ML or HPC fabrics with RDMA (RoCEv2), lossless Ethernet (PFC, ECN), or high-performance networking
- You've been a site lead, campus engineer, or regional operations lead before
- Hands-on experience coordinating hardware repairs, RMAs, and physical infrastructure work
- Familiarity with network monitoring platforms, alerting systems, and telemetry collection
- Basic scripting or automation experience (Python, Ansible) for operational tasks
- Experience working in distributed operations teams with follow-the-sun coverage models