Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers. The Principal Supercomputing Operations Engineering Manager is responsible for the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments, leading teams that operate InfiniBand and GPU interconnect fabrics to ensure reliability and performance.
Responsibilities:
- Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
- Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
- Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
- Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
- Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
- Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
- Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet
Requirements:
- Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
- Bachelor's Degree in Computer Science OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python OR equivalent experience
- Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
- 4+ years people management experience
- 6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
- Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
- Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads
- Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers
- Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact