Microsoft is a leading technology company dedicated to empowering individuals and organizations. As a Principal Site Reliability Engineer, you will lead initiatives to ensure high-quality handling of high severity incidents across Microsoft M365 services, collaborating with various stakeholders to improve incident response and recovery processes.
Responsibilities:
- Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
- Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
- Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
- Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
- Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
- Coach and help develop a team of Site Reliability Engineers serving as incident responders
- Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
- Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
- Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
- Communicate clearly and credibly with senior leadership during customer impacting events
Requirements:
- Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
- Doctorate Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 12+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
- 7+ years technical experience working with large-scale cloud or distributed systems
- Experience building or scaling incident response programs at organizational or enterprise scope
- Background in SRE, production engineering, or platform reliability roles
- Track record of reducing customer impact through improved incident handling, tooling, or prevention
- Experience operating in follow the sun or globally distributed incident response models
- Proven experience leading teams through high severity production incidents in large, distributed systems
- Proficient understanding of incident management, reliability engineering, and live site operations at scale
- Ability to drive clarity, accountability, and results in ambiguous, time critical situations