Own the end-to-end Problem Management lifecycle in line with ITIL best practice: problem detection, logging, categorisation, prioritisation, investigation, resolution, and closure
Maintain and govern the Problem Record backlog in Jira Service Management, ensuring all records are accurate, prioritised, and progressing toward resolution
Define and enforce the standards for problem identification, including criteria for reactive problem management (post-incident) and proactive problem management (trend analysis and risk identification)
Manage the Known Error Database (KEDB), ensuring it is current, accurate, and actively used by L1/L2 support teams to improve first-contact resolution
Lead and facilitate structured RCA sessions following major and recurring incidents, using recognised methodologies (e.g. 5 Whys, Fishbone/Ishikawa, fault tree analysis)
Produce high-quality Problem Records and RCA reports that clearly articulate the root cause, contributing factors, timeline, and recommended corrective/preventative actions
Ensure RCA outputs translate into tracked, accountable action plans with clear owners, timelines, and success criteria
Challenge superficial root cause findings and push for systemic, durable fixes rather than symptomatic workarounds
Analyse incident, change, and event data to proactively identify trends, recurring issues, and systemic risks before they become major incidents
Collaborate with Observability and Platform teams to use monitoring signals, error budgets, and SLO breach data as early-warning inputs to the problem management process
Communicate problem status, known errors, and risk exposure clearly to technical and non-technical stakeholders, including engineering leads and senior management
Produce regular problem management reporting, including metrics such as: number of open problems by age/severity, incident recurrence rate, time to root cause, and percentage of problems with preventative actions closed on time
Present insights and trends to the Director of Application Operations and wider PETO leadership to inform prioritisation decisions and continuous improvement initiatives
Work closely with Incident Management to ensure seamless handoff from major incidents into the problem management process
Partner with L2.5/L3 engineering teams to coordinate investigation effort, agree timelines, and remove blockers to root cause resolution
Integrate problem management activity into the Service Catalogue and Jira Service Management workflows, ensuring service ownership and escalation paths are respected
Continuously assess and improve the Problem Management process itself, maturing capability over time and aligning with evolving ITIL and organisational standards
Build and maintain problem management documentation, templates, and guidance to enable consistent, high-quality practice across the PETO organisation
Support the development of L2 team capability in recognising and logging potential problems, contributing to the team's progression toward greater autonomy
Requirements
Solid, demonstrable experience in an ITIL-aligned Problem Management role, ideally within a fast-paced, product-led technology organisation
Strong working knowledge of ITIL Problem Management practices (ITIL 4 Foundation certification or above preferred), including the distinction between reactive and proactive problem management and the role of the KEDB
Hands-on experience facilitating RCA sessions using structured methodologies (5 Whys, Fishbone, fault tree analysis, etc.) and translating findings into actionable improvement plans
Experience working with Jira Service Management or a comparable ITSM platform to manage problem records, workflows, and reporting
Ability to analyse incident and operational data to identify trends and systemic issues, with experience using dashboards or reporting tools to communicate findings
Strong written and verbal communication skills, with the ability to produce clear RCA reports and updates for both technical audiences and senior non-technical stakeholders
Collaborative working style with experience engaging engineering, infrastructure, and operations teams in problem investigation and resolution
Familiarity with Agile ways of working and the ability to integrate ITIL practices within a modern, product-centric engineering environment
Experience with observability and monitoring tooling (e.g. Datadog, Grafana, PagerDuty) as inputs to proactive problem management (desirable)
Understanding of SLOs, error budgets, and their relationship to operational risk and problem prioritisation (desirable)
Experience contributing to or maintaining a knowledge base (e.g. Confluence), including runbooks and known error documentation (desirable)
Exposure to cloud-native application architectures and API-first platforms (desirable)
ITIL 4 Specialist or Practitioner certification in relevant practices (e.g. Problem Management, Incident Management) (desirable)
Experience with operational metrics and reporting frameworks, including DORA metrics or similar (desirable)
Tech Stack
Cloud
Grafana
ITSM
Benefits
Annual Wellness Bonus
Monthly Edenred Electronic Food Voucher
Udemy: Access for your professional development
Flexible Holiday plan & other leave benefits
Book Benefit: Professional development books and an additional annual budget for fiction books of your choice