Serve as the primary production incident facilitator, ensuring seamless coordination during all production-impacting events.
Act as the first point of contact for production incidents, including providing or managing after-hours support as required.
Initiate, manage, and lead production incident bridge calls, ensuring all necessary stakeholders (IT, Operations, Quality, Engineering, Vendors) are engaged.
Manage Operations Center staff and processes (ITOM), ensuring potential incidents are predicted and detected through our monitoring systems, operations staff is trained and empowered to resolve issues, and guide the evolution and expansion of our ITOM systems.
Lead, mentor, and develop teams responsible for Operations Center and ITOM processes
Ensure production incident tickets are created, assigned, tracked, updated, and closed accurately and comprehensively.
Coordinate and deliver clear and timely communications to business partners, leadership, and production support teams during incidents.
Ensure that alerts, notifications, and communication channels are monitored for critical production issues and initiate immediate response.
Participate in Change, Problem, and Incident Review meetings to analyze trends, review stability metrics, and recommend improvements.
Lead or support creation of Root Cause Analysis (RCA) reports and corrective action plans following major incidents.
Drive continuous improvement initiatives to enhance production reliability, resilience, and operational efficiency.
Requirements
5+ years of experience in production operations, IT operations, or incident management.
Strong understanding of incident response practices and ability to perform under pressure in high-impact scenarios.
Experience coordinating cross-functional response teams and managing real-time production events.
Experience managing ITOM platforms (DataDog or similar)
Strong communication skills with the ability to provide clear direction and timely updates.
Familiarity with ITSM tools, production monitoring systems, and collaboration platforms such as Microsoft Teams.
Ability to interpret operational data, identify patterns, and drive corrective actions.
Preferred Experience
Experience in industries with high regulatory, operational, or customer-impacting environments.
Background with ITIL Incident, Problem, and Change Management frameworks.
Experience leading high-severity incident bridges or production war room calls.
Knowledge of operational risk, compliance, or systems reliability engineering practices.
Experience managing global teams and 24x7 operations.