Drive enterprise problem investigations arising from major incidents and proactive analysis, working in close partnership with Digital Product and Engineering teams to identify true root causes and prevent recurrence.
Analyze incidents, problems and availability data to identify systemic risks, recurring failure patterns, and reliability gaps, translating insights into actionable improvement opportunities for Digital Products.
Partner with Digital Product and Engineering teams to strengthen service resilience, including improvements to monitoring, alerting, recovery, and preventative controls that reduce customer impact.
Use learnings from problem investigations to influence improvements in automated service restoration and operational readiness, maintaining a strong focus on availability outcomes.
Contribute to Major Incident Management and Retro activities when required, providing investigative insight, historical context, and problem-oriented thinking during high severity events.
Continuously improve problem management practices, tooling, and ways of working, partnering with Digital Product and Engineering teams to embed learning and prevention and drive meaningful, lasting change.
Requirements
Minimum 2 years’ experience in Problem Management, Incident Management, or Service Operations role within a production operations or service environment.
Hands-on experience with ServiceNow for Problem and Incident management
Demonstrated experience driving structured problem investigations for major or high impact incidents, including root cause identification, documented causal analysis, and driving corrective or preventative actions through completion.
Experience using data visualization or reporting tools (e.g., Power BI, ServiceNow)