Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers. The Principal Supercomputing Operations Engineering Manager is responsible for the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments, leading teams that operate InfiniBand and GPU interconnect fabrics to ensure reliability and performance.

Responsibilities:

Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet

Requirements:

Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Bachelor's Degree in Computer Science OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python OR equivalent experience
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years people management experience
6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads
Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers
Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact

Principal Supercomputing Operations Engineering Manager

Key skills

About this role

Responsibilities:

Requirements: