Backblaze is the object storage leader in the open cloud movement, and they are seeking a Strategic Ops Engineer III to join their team. The role focuses on managing incidents, problems, and changes while leveraging AI/ML for operational improvements and ensuring service reliability.

Responsibilities:

Available to Lead and govern the end-to-end incident management lifecycle, including detection, triage, escalation, and resolution
Drive major incident management (MIM) processes and communications
Improve MTTR (Mean Time to Resolution) through automation and process optimization
Establish and maintain incident response playbooks and runbooks
Maintain and improve intelligent heatmaps leveraging AI/ML to identify recurring technical themes and prioritize long-term remediation
Implement trend analysis and proactive problem identification using observability data and AI
Track and manage problem records to closure
Govern change management processes (lead the CAB), ensuring safe, compliant, and low-risk deployments
Define and enforce change policies, risk assessments, and approval workflows
Drive continuous improvement in release and deployment practices
Maintain a strong understanding of system architecture and monitoring strategies, identifying gaps and opportunities for improvement
Partner with engineering teams to improve system resilience and performance
Reduce alert fatigue by improving signal-to-noise ratio in monitoring systems
Leverage AI/ML for anomaly detection, predictive alerting, and automated root cause analysis
Implement AI-driven solutions to optimize incident response and operational workflows
Analyze large-scale operational data to identify patterns and recommend improvements

Requirements:

5+ years of experience in IT Operations, SRE, or similar roles
Strong expertise in Incident, Problem, and Change Management (ITIL or similar frameworks)
Proven experience in governing and optimizing operational processes
AI & Data Expertise: Strong knowledge of AI/ML concepts, including anomaly detection, predictive analytics, and data modeling
AIOps Experience: Hands-on experience with AIOps platforms or building AI-driven operational solutions (event correlation, alert prioritization)
ITIL certification (Foundation or higher)
Proficiency with platforms such as Jira, SNOW, FireHydrant, Moogsoft, etc
Experience working in high-availability, large-scale environments

Strategic Ops Engineer III

Key skills

About this role

Responsibilities:

Requirements: