Filevine is a Legal AI company delivering Legal Operating Intelligence for the future of legal work. In this role, you will own the strategy and execution of site reliability and platform resilience, managing teams to ensure the platform is fast, stable, and available while driving improvements in uptime and incident prevention.

Responsibilities:

Directly manage and align the prioritization of DevOps, SRE, and DBRE infrastructure teams under a unified reliability strategy
Set team objectives, drive execution, and ensure resources are focused on the highest-impact business and reliability investments
Conduct ongoing risk assessments of Filevine's platform to identify and prioritize areas of greatest fragility and business focus
Use data from incident history, usage analytics, monitoring systems, and customer feedback to drive proactive hardening efforts and reduce unplanned downtime
Define and track key reliability indicators (uptime/availability, mean time to detect, mean time to resolve, incident frequency)
Own the reporting apparatus that makes platform health visible and actionable for leadership and product teams
Manage the process for updating the status page (status.filevine.com) during reliability events
Define clear criteria for posting incidents according to established communication protocols, and ensure customers and internal stakeholders receive timely, accurate updates
Serve as the bridge between SRE, Product, Engineering, and customer-facing teams (Support, Sales, Partners) to ensure reliability priorities reflect real customer and business impact
Translate reliability trends and infrastructure health into actionable insights for non-technical stakeholders
Evaluate, implement, and manage the reliability and observability tech stack
Drive decisions on monitoring, alerting, test environments, and infrastructure tooling to ensure the platform scales reliably
Establish reliability standards, runbooks, and operational patterns that empower engineering teams to contribute to platform resilience
Build documentation and training to make reliability ownership a shared responsibility across the organization

Requirements:

5+ years of experience in SRE, DevOps, platform engineering, or reliability-focused product/program management in SaaS
Prior hands-on experience as a software engineer or in a deeply technical role. Comfortable reading code, reviewing architecture decisions, and engaging in technical design discussions with engineering teams
Strong understanding of site reliability principles, cloud infrastructure, database reliability, container orchestration, and modern DevOps practices. Experience managing or closely partnering with SRE and DevOps teams
Strong analytical skills with the ability to use data sources (monitoring platforms, Pendo, Domo, Salesforce, incident logs) to prioritize reliability efforts by business impact
Ability to translate complex reliability and infrastructure data into clear narratives for leadership, product managers, and customer-facing teams. Experience leading incident reviews and high-visibility operational meetings is essential
Deep understanding of software development lifecycles, release protocols, and incident response processes
Ability to identify the highest-leverage reliability investments and implement processes that improve platform stability without slowing engineering velocity
B.S. or M.S. in computer science, software engineering, or a related technical field; comparable certifications or equivalent direct work experience, with a demonstrated track record in software engineering and/or site reliability engineering

Sr Technical Product Manager

Key skills

About this role

Responsibilities:

Requirements: