Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. They are hiring a Senior Platform & Reliability Engineer to own service reliability end-to-end, prevent incidents, and lead recovery efforts when production degrades.
Responsibilities:
- Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows
- Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access
- Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety
- Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation
- Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution)
- Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline
- Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk
Requirements:
- Experience with setting and enforcing SLIs/SLOs/error budgets for critical user flows
- Proven ability to drive failure isolation across API, workers, queues, and dependencies
- Expertise in defining probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety in Kubernetes
- Experience with BullMQ/Redis for queue and job safety, including poison pill containment and workload isolation
- Demonstrated ability to lead Sev1/Sev2 incident response end-to-end
- Strong skills in observability quality, on-call effectiveness, runbooks, and postmortem discipline
- Ability to gate risky deploys and enforce reliability guardrails
- Calm, structured incident commander under pressure
- Ability to think in failure modes and blast radius by default
- Pragmatic approach to stabilizing quickly and implementing durable fixes
- High ownership and strong written communication skills