Microsoft is a leading technology company focused on empowering individuals and organizations. They are seeking a Principal Site Reliability Engineer to lead initiatives in managing high severity incidents for Microsoft M365 Substrate Core services, ensuring effective incident handling and operational governance.
Responsibilities:
- Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
- Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
- Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
- Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk. Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
- Coach and help develop a team of Site Reliability Engineers serving as incident responders
- Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
- Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
- Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
- Communicate clearly and credibly with senior leadership during customer impacting events
Requirements:
- Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
- OR equivalent experience
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
- Doctorate Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 12+ years technical experience in software engineering, network engineering, or systems administration
- 7+ years technical experience working with large-scale cloud or distributed systems
- Experience building or scaling incident response programs at organizational or enterprise scope
- Background in SRE, production engineering, or platform reliability roles
- Track record of reducing customer impact through improved incident handling, tooling, or prevention
- Experience operating in follow the sun or globally distributed incident response models
- Proven experience leading teams through high severity production incidents in large, distributed systems
- Proficient understanding of incident management, reliability engineering, and live site operations at scale
- Ability to drive clarity, accountability, and results in ambiguous, time critical situations