Filevine is a Legal AI company delivering Legal Operating Intelligence for the future of legal work. In this role, you will own the strategy and execution of site reliability and platform resilience, managing teams to ensure the platform is fast, stable, and available while driving improvements in uptime and incident prevention.
Responsibilities:
- Directly manage and align the prioritization of DevOps, SRE, and DBRE infrastructure teams under a unified reliability strategy
- Set team objectives, drive execution, and ensure resources are focused on the highest-impact business and reliability investments
- Conduct ongoing risk assessments of Filevine's platform to identify and prioritize areas of greatest fragility and business focus
- Use data from incident history, usage analytics, monitoring systems, and customer feedback to drive proactive hardening efforts and reduce unplanned downtime
- Define and track key reliability indicators (uptime/availability, mean time to detect, mean time to resolve, incident frequency)
- Own the reporting apparatus that makes platform health visible and actionable for leadership and product teams
- Manage the process for updating the status page (status.filevine.com) during reliability events
- Define clear criteria for posting incidents according to established communication protocols, and ensure customers and internal stakeholders receive timely, accurate updates
- Serve as the bridge between SRE, Product, Engineering, and customer-facing teams (Support, Sales, Partners) to ensure reliability priorities reflect real customer and business impact
- Translate reliability trends and infrastructure health into actionable insights for non-technical stakeholders
- Evaluate, implement, and manage the reliability and observability tech stack
- Drive decisions on monitoring, alerting, test environments, and infrastructure tooling to ensure the platform scales reliably
- Establish reliability standards, runbooks, and operational patterns that empower engineering teams to contribute to platform resilience
- Build documentation and training to make reliability ownership a shared responsibility across the organization
Requirements:
- 5+ years of experience in SRE, DevOps, platform engineering, or reliability-focused product/program management in SaaS
- Prior hands-on experience as a software engineer or in a deeply technical role. Comfortable reading code, reviewing architecture decisions, and engaging in technical design discussions with engineering teams
- Strong understanding of site reliability principles, cloud infrastructure, database reliability, container orchestration, and modern DevOps practices. Experience managing or closely partnering with SRE and DevOps teams
- Strong analytical skills with the ability to use data sources (monitoring platforms, Pendo, Domo, Salesforce, incident logs) to prioritize reliability efforts by business impact
- Ability to translate complex reliability and infrastructure data into clear narratives for leadership, product managers, and customer-facing teams. Experience leading incident reviews and high-visibility operational meetings is essential
- Deep understanding of software development lifecycles, release protocols, and incident response processes
- Ability to identify the highest-leverage reliability investments and implement processes that improve platform stability without slowing engineering velocity
- B.S. or M.S. in computer science, software engineering, or a related technical field; comparable certifications or equivalent direct work experience, with a demonstrated track record in software engineering and/or site reliability engineering