Microsoft is a leading technology company dedicated to empowering individuals and organizations. As a Principal Site Reliability Engineer, you will lead initiatives to ensure high-quality handling of high severity incidents across Microsoft M365 services, collaborating with various stakeholders to improve incident response and recovery processes.

Responsibilities:

Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
Coach and help develop a team of Site Reliability Engineers serving as incident responders
Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
Communicate clearly and credibly with senior leadership during customer impacting events

Requirements:

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Doctorate Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 12+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
7+ years technical experience working with large-scale cloud or distributed systems
Experience building or scaling incident response programs at organizational or enterprise scope
Background in SRE, production engineering, or platform reliability roles
Track record of reducing customer impact through improved incident handling, tooling, or prevention
Experience operating in follow the sun or globally distributed incident response models
Proven experience leading teams through high severity production incidents in large, distributed systems
Proficient understanding of incident management, reliability engineering, and live site operations at scale
Ability to drive clarity, accountability, and results in ambiguous, time critical situations

Principal Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: