Lead the operational management of high-severity incidents and customer escalations across the OneStream Cloud platform.
Serve as the central coordination point during critical incidents, ensuring appropriate teams are engaged and resolution efforts remain focused and efficient.
Act as the incident manager during major incidents, maintaining situational awareness, coordinating response activities, and ensuring accountability for resolution actions.
Facilitate incident response calls, coordinate technical teams, and maintain executive-level communication during major incidents.
Clearly identify and assign resolution ownership to reduce ambiguity during incidents.
Ensure customers receive timely updates, clear communication, and strong ownership throughout the escalation lifecycle.
Own the operational incident lifecycle including incident declaration, coordination, escalation, communication, and post-incident review.
Drive root cause analysis (RCA) processes and ensure corrective and preventative actions are implemented and tracked to completion.
Track and manage escalated issues to resolution while identifying patterns, systemic risks, and recurring operational gaps.
Develop and improve incident management frameworks, escalation procedures, severity definitions, and operational runbooks.
Partner with cross-functional teams to reduce recurring incidents through automation, resiliency improvements, and architectural enhancements.
Monitor escalation metrics and operational KPIs including MTTR, incident frequency, and customer impact.
Lead post-incident reviews and drive accountability for operational improvements.
Own and drive measurable incident outcomes, including reduction in MTTR and reduction of recurring incidents.
Requirements
Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical field, or equivalent professional experience.
5+ years of experience in cloud operations, incident management, site reliability engineering, or technical escalation management.
Proven experience coordinating and managing high-severity incidents in production cloud or SaaS environments.
Strong understanding of cloud infrastructure, distributed systems, networking fundamentals, and enterprise SaaS operations.
Experience coordinating cross-functional technical teams during complex production incidents.
Demonstrated experience operating incident management platforms used to coordinate major incident response (e.g., PagerDuty, Opsgenie, ServiceNow, or similar).
Experience using observability and monitoring tools to support incident diagnosis and response.
Demonstrated ability to communicate effectively with both technical teams and executive stakeholders during high-impact situations.
Strong analytical and problem-solving skills with the ability to drive root cause analysis and systemic resolution of operational issues.