Microsoft is a leading technology company that empowers individuals and organizations to achieve more. As a Senior Site Reliability Engineer in the Foundation Incident Response team, you will ensure service resilience, prevent outages, and drive improvements through data-driven practices and automation.
Responsibilities:
- Incident management excellence: Lead high-severity incident response, debug complex issues, drive incidents to resolution with clear communication and ownership. Ensure high-quality post mortems reports are created and enforce repair-item SLAs
- Improve observability: Enhance telemetry, alerting, and dashboards using One Microsoft tooling to provide actionable insights and reduce detection time
- Define and measure reliability: Partner with engineering teams to establish and track SLIs/SLOs for critical scenarios
- Live site health reviews: Lead and facilitate live site health review meetings, translating business requirements into metrics and action
- Engineering for prevention: Translate learnings into proactive tests, product fixes, rollout guardrails, and automation that reduce risk and improve service health
- Reliability drills: Design and execute drills to simulate product failures, validate resilience and recovery, and develop resilience strategies
- Define Policy: Draft process and policy documentation for how the organization prepares for, responds to, and prevents incidents
Requirements:
- Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration
- OR equivalent experience
- Doctorate Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field AND 6+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration
- 3+ years technical experience working with large-scale cloud or distributed systems