Microsoft is seeking a Site Reliability Engineer II to join their M365 Sovereign Clouds organization, which focuses on delivering secure productivity solutions for critical customers. The role involves automating operations, building disaster recovery capabilities, and improving service delivery for distributed systems at scale.
Responsibilities:
- Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s). Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Communicates details and resolutions through post-mortem reports and review meetings
- Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale
- Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations. Monitors the impact of changes on operations metrics (e.g., Time-to-X)
- Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities. Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams
- Independently creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of one or more platforms, systems, or products operating at scale
- Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams
- Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles. Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products
Requirements:
- Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience
- 2+ years technical experience working with large-scale cloud or distributed systems
- Candidates must have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance
- Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role