Microsoft is seeking a Site Reliability Engineer II to join their M365 Sovereign Clouds organization, which focuses on delivering secure productivity solutions for critical customers. The role involves automating operations, building disaster recovery capabilities, and improving service delivery for distributed systems at scale.

Responsibilities:

Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s). Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Communicates details and resolutions through post-mortem reports and review meetings
Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale
Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations. Monitors the impact of changes on operations metrics (e.g., Time-to-X)
Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities. Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams
Independently creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of one or more platforms, systems, or products operating at scale
Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams
Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles. Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products

Requirements:

Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
2+ years technical experience working with large-scale cloud or distributed systems
Candidates must have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance
Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role

Site Reliability Engineer II - CTJ -TS/SCI

Key skills

About this role

Responsibilities:

Requirements: