Microsoft is a leading technology company, and they are seeking a Senior Software Engineer to join their Azure High Performance Computing & AI Engineering team. The role involves designing and developing capabilities for monitoring and operating supercomputers at scale, while also creating data pipelines for actionable alerts to enhance customer satisfaction.
Responsibilities:
- Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers
- Manages operations of supercomputers by responding quickly to mitigate issues
- Implements systemic solutions and mitigations to more complex issues impacting performance or functionality of supercomputers
- Reviews and writes incident postmortem and presents insights that drive changes to reduce or eliminate incidents
- Independently improves troubleshooting guides (TSGs), wikis, tests, and telemetry, adding comprehensive observability and monitoring capabilities
- Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of supercomputers while also driving consistency in monitoring and operations at scale
Requirements:
- Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python OR equivalent experience
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
- Bachelor's Degree in Computer Science OR related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python OR Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
- Experience diagnosing and troubleshooting GPU based systems such as H100, A100 or networking technologies such as InfiniBand or Ethernet
- Experience with large scale data pipelines using tools such as Prometheus, Grafana, etc