Microsoft is on a mission to build the next generation of distributed AI supercomputers. As a Principal Software Engineering Manager, you will lead a team in developing foundational components of Azure’s AI networking infrastructure to enhance performance and reliability of AI platforms.
Responsibilities:
- Hire, manage, and grow a high-performing team of software engineers, fostering a culture of excellence, inclusion, and innovation
- Lead the design and development of large-scale distributed systems and services that power Azure’s AI infrastructure
- Drive engineering planning and execution while ensuring alignment with organizational OKRs and long-term strategy
- Establish lean, scalable, and efficient processes that promote innovation and engineering rigor
- Deliver best-in-class engineering by ensuring services and components are modular, secure, reliable, diagnosable, observable, and reusable
- Improve test coverage, automation, and integration testing to proactively identify and resolve reliability gaps
- Ensure live-site reliability and service health through robust monitoring, telemetry, and automation
- Collaborate across Microsoft and partner organizations to deliver cohesive, end-to-end infrastructure solutions
- Apply data-driven insights to optimize performance, scalability, and customer satisfaction
- Champion Microsoft’s culture by modeling, coaching, and caring—nurturing diversity, inclusion, and continuous growth for your team and peers
Requirements:
- Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
- Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
- 4+ years people management experience
- 10+ years of professional software design and development experience in large-scale distributed systems
- Experience building and operating networking infrastructure for hyperscale datacenters or AI clusters
- Hands-on experience with networking technologies in AI-specific hardware (e.g., InfiniBand, ROCE, MRC, NVLink)
- In-depth understanding of networking protocols (e.g., Ethernet, TCP/IP, RDMA, gRPC) and distributed systems
- Familiarity with network virtualization, software-defined networking (SDN), or network performance tuning
- Familiarity with AI accelerators such as GPUs (NVIDIA, AMD) or TPUs, and how they interact with networking infrastructure
- Experience with telemetry and observability tools for network monitoring at scale
- Background in building scalable and fault-tolerant systems in large, distributed environments