Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
Set and enforce SLIs/SLOs/error budgets for critical user flows.
Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.