Design, implement, and optimize high-performance, low-latency network architectures for HPC environments, including InfiniBand, RoCE, and high-speed Ethernet.
Collaborate with HPC, DevOps, and AI research teams to integrate networking solutions with compute clusters, storage systems, and cloud platforms.
Troubleshoot and resolve complex network issues to minimize downtime and maximize performance.
Follow escalation procedures and ensure solutions are provided in a timely manner. Ensure escalation is progressing accordingly with the given severity.
Monitor network performance, capacity, and security, implementing improvements as needed.
Stay updated with emerging HPC networking technologies and best practices, and drive their adoption within Mistral.
Develop and maintain documentation for network architectures, configurations, and operational procedures.
Requirements
Proficiency in HPC networking protocols (InfiniBand, RoCE, TCP/IP, MPLS).
Hands-on experience with network hardware (switches, routers, NICs) from vendors like Mellanox, Cisco, or Arista.
Knowledge of network automation tools (Ansible, Python scripting).
Familiarity with HPC environments, parallel computing, and distributed systems.
Experience with network security best practices.
Strong problem-solving and analytical skills.
Ability to thrive in a fast-paced, collaborative environment.
Excellent communication skills (English required; French is a plus).
Teaching and documentation skills to ensure knowledge is archived and distributed to team members.