Microsoft is looking for a Product Manager II to join their Azure High-Performance Computing/Artificial Intelligence team. The role involves driving networking for large AI training supercomputers, working with various stakeholders to maximize operational uptime and AI workload throughput.
Responsibilities:
- Drive, track, and publish success criteria for backend networking of ultra large scale AI supercomputers. Your primary objective, shared with colleagues and partner teams, is to drive maximum operational uptime and AI workload throughput of some of the largest supercomputers on the planet
- Identify leading and/or unique points of failure affecting your primary goal and associated KPIs, and drive remediations and roadmap changes to address those issues
- Work across and build trust among a V-team of supercomputing product groups, datacenter site operators, quality control specialists, vendors, business leaders, and customers to achieve your objectives
Requirements:
- Bachelor's Degree AND 2+ years experience in product/service/project/program management or software development OR equivalent experience
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
- Bachelor's Degree AND 5 years of experience in product/service/project/program management or software development OR equivalent experience
- 5+ years experience in operating production supercomputers
- 5+ years experience improving product metrics for a product, feature, or experience in a market (e.g., growing customer base, expanding customer usage, avoiding customer churn)
- Familiarity with RoCE v2, InfiniBand, UCX, MPI, NCCL, RCCL, and distributed memory compute workloads
- Ability to work overlapping hours with East Coast teams (EST)