Backblaze is the object storage leader in the open cloud movement, and they are seeking a Strategic Ops Engineer III to join their team. The role focuses on managing incidents, problems, and changes while leveraging AI/ML for operational improvements and ensuring service reliability.
Responsibilities:
- Available to Lead and govern the end-to-end incident management lifecycle, including detection, triage, escalation, and resolution
- Drive major incident management (MIM) processes and communications
- Improve MTTR (Mean Time to Resolution) through automation and process optimization
- Establish and maintain incident response playbooks and runbooks
- Maintain and improve intelligent heatmaps leveraging AI/ML to identify recurring technical themes and prioritize long-term remediation
- Implement trend analysis and proactive problem identification using observability data and AI
- Track and manage problem records to closure
- Govern change management processes (lead the CAB), ensuring safe, compliant, and low-risk deployments
- Define and enforce change policies, risk assessments, and approval workflows
- Drive continuous improvement in release and deployment practices
- Maintain a strong understanding of system architecture and monitoring strategies, identifying gaps and opportunities for improvement
- Partner with engineering teams to improve system resilience and performance
- Reduce alert fatigue by improving signal-to-noise ratio in monitoring systems
- Leverage AI/ML for anomaly detection, predictive alerting, and automated root cause analysis
- Implement AI-driven solutions to optimize incident response and operational workflows
- Analyze large-scale operational data to identify patterns and recommend improvements
Requirements:
- 5+ years of experience in IT Operations, SRE, or similar roles
- Strong expertise in Incident, Problem, and Change Management (ITIL or similar frameworks)
- Proven experience in governing and optimizing operational processes
- AI & Data Expertise: Strong knowledge of AI/ML concepts, including anomaly detection, predictive analytics, and data modeling
- AIOps Experience: Hands-on experience with AIOps platforms or building AI-driven operational solutions (event correlation, alert prioritization)
- ITIL certification (Foundation or higher)
- Proficiency with platforms such as Jira, SNOW, FireHydrant, Moogsoft, etc
- Experience working in high-availability, large-scale environments