Own the incident response end-to-end. Serve as incident commander on high-impact P1/P2 incidents, lead communication, coordinate engineering and business stakeholders, run blameless postmortems, and drive resulting action items to closure.
Architect and continuously refine the observability stack. Build and tune dashboards, alerts and synthetic monitoring that give real-time visibility into system and end-user experience health.
Drive a shift from reactive to proactive. Identify what is breaking before it breaks. Use telemetry, anomaly detection, and trend analysis to surface systemic issues; partner with technical stakeholders to eliminate them at the source.
Hit the operational metrics that matter. Ensure tight SLA adherence and continuous improvement across both end-user support and production incidents.
Be the senior technical escalation. You will personally dig into logs, queries, and infrastructure to unblock the team when needed.
Drive automation and self-service that eliminates repeat tickets and matures our knowledge management strategy.
Implement AI and automated solutions to improve quality & speed of operations.
Manage defects and enhancements, resolve incidents, and prevent systemic issues through structured Product management, Engineering and Operations partnership.
Strengthen the feedback loop from support into the product roadmap.
Own and advance internal IT & Product support services (service desk, endpoint support, access management, etc).
Foster a culture of accountability and ownership. Set the tone for technical excellence and user obsession across the org.
Requirements
8+ years of experience leading distributed teams including managers in high-growth SaaS and API-first environments.
Seasoned incident commander with hands-on experience leading P1/P2 response, running postmortems, and driving systemic fixes that prevent recurrence.
Proven track record of reshaping support operations through AI, automation, and intelligent tooling.
Strong technical proficiency such as but not limited to SQL, NQL, DDSQL, Powershell, Python, Terraform, Shell, Bash, Rest API.
Solid understanding of AWS services including S3, CloudWatch, EKS, Kubernetes, ECS, EC2, Lambda, and Workspaces.
Deep knowledge of support, DEX, observability and CRM platforms such as ServiceNow, Salesforce, DataDog, Nexthink.
Excellent leadership & decision-making skills as well as project management experience with ability to execute across multiple priorities.
Exceptional communication skills with an ability to translate technical concepts into business value for executives and non-technical stakeholders.
Familiarity with ITIL or similar frameworks with experience optimizing incident, change & problem management at scale.
Tech Stack
AWS
EC2
Kubernetes
Python
ServiceNow
SQL
Terraform
Benefits
Considerable employer contributions for health, dental, and vision programs
Generous PTO, paid holidays, and paid parental leave