Define and enforce SLOs, SLIs, and error budgets for Project Graph's HTTP APIs and async compute platform.
Build and maintain observability—metrics, logging, tracing, and alerting—so issues are caught and diagnosed quickly.
Lead incident response, run blameless postmortems, and drive the follow-up work that prevents recurrence.
Improve the reliability and scalability of an async job scheduling system built on top of Kubernetes and Postgres.
Maintain and improve CI/CD systems to keep delivery fast, safe, and reliable.
Own database data protection, backup, and resilience—including backup strategy, recovery testing, and disaster recovery planning.
Design and implement cloud infrastructure and automation to meet reliability, performance, and cost goals.
Reduce operational toil through tooling and automation, and partner with developers to build reliability in from the start.
Participate in an on-call rotation.
Requirements
Bachelor's degree or equivalent experience in Computer Science.
5-10 years of experience in site reliability engineering, infrastructure, or backend software development with a strong operational focus.
Expertise with Kubernetes in production, including scaling, troubleshooting, and tuning.
Expertise with Docker and containerization.
Strong experience with bash and CI/CD tools, like CircleCI.
Strong hands-on experience in at least one server-side language; we use Node.js/TypeScript.
Experience operating data stores such as Postgres, Redis, or similar in production; we run on AWS Aurora (Postgres-compatible), so familiarity with managed/Aurora environments is a plus.
Experience with database backup, resilience, and disaster recovery—designing backup strategies, testing recovery, and meeting RPO/RTO targets.
Experience with Terraform and AWS.
Hands-on experience with observability tooling (metrics, logging, distributed tracing) and alerting.
Familiarity with HTTP API security.
A track record of incident response and a systematic, blameless approach to learning from failures.
An interest in and ability to learn new technologies.
Ability to tackle problems in a sustainable way, always striving to improve our processes and learn.
Excellent verbal and written communication skills; can effectively articulate complex ideas and influence others through well-reasoned explanations.