Evaluate applications, platforms, and vendors to assess resiliency, reliability, and operational risk.
Design and implement processes that enforce enterprise resiliency and reliability standards.
Lead blameless post‑incident reviews for high‑severity incidents or incidents spanning multiple complex product families.
Partner with product and platform teams to proactively identify and remediate reliability risks before they impact clients.
Develop, communicate, and evangelize new standards, tools, and frameworks across subdivisions, ensuring consistent adoption.
Troubleshoot complex production issues and implement durable solutions that prevent recurrence.
Participate in a periodic on‑call rotation to support production stability.
Evaluate and onboard resiliency and reliability tooling.
Actively participate in reliability engineering and resilience communities of practice, contributing to shared learning and enterprise consistency.
Contribute to strategic initiatives that advance Vanguard’s operational maturity and resiliency posture.
Requirements
Experience with modern observability and monitoring tools, such as Splunk, Honeycomb, CloudWatch, Dynatrace, or AppDynamics.
Strong understanding of SLIs, SLOs, and SLAs, including dashboarding and reporting practices.
Experience with alert design, anomaly detection, predictive alerting, and synthetic monitoring using structured methodologies.
Experience with automation and resilience practices such as Python-based automation, RPA platforms (e.g., Blue Prism, UiPath), chaos engineering, and failure analysis techniques (e.g., FMEA).