Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
Set and enforce SLIs/SLOs/error budgets for critical user flows.
Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
Own poison pill containment and workload isolation.
Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
Gate risky deploys and enforce reliability guardrails when production health is at risk.
Establish baseline reliability metrics and identify top platform risks.
Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).
Deliver high-impact hardening fixes across probes/startup paths/queue safety.
Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

Senior Platform Engineer, Reliability

Key skills