Own reliability work end-to-end, from user-facing symptoms (crashes, latency, streaming failures) to root causes in services, infrastructure, or vendor dependencies.
Design and implement resilience patterns for upstream dependency failures (for example model providers): fallbacks, routing strategies, and degraded-mode designs.
Build and maintain reliability guardrails that make teams faster and safer: deployment safety, rollbacks, operational playbooks, automated checks, and standards for production readiness.
Improve observability (metrics, logs, traces, and client telemetry) so engineers can quickly answer 'Is it up?' and 'What changed?'.
Reduce operational toil through automation and better tooling.
Partner with product and infrastructure engineering teams as a drop-in reliability multiplier: embed on the highest-impact problems and drive them to a durable technical outcome.
Participate in an on-call rotation and help improve incident response practices over time (severity definitions, runbooks, retrospectives, and clear ownership of follow-up fixes).
You will own a small set of high-leverage reliability 'themes' at a time (for example client crash rate, streaming reliability, deploy safety). You drive these end-to-end until the reliability bar measurably moves.
Requirements
Strong experience owning reliability for production systems, including both incident response and long-term engineering fixes.
Expert-level experience in at least one of: Go, Node/TypeScript, or Python.
Deep practical knowledge of cloud infrastructure (AWS) and modern deployment/orchestration patterns (Kubernetes and/or ECS).
Experience with observability systems and practices (metrics, logs, traces, and alerting).