Own reliability for Volcano end-to-end: Define and drive SLOs, error budgets, and incident response practices for all Volcano services
Architect the platform's infrastructure: Design and build the multi-region Kubernetes infrastructure
Build the GitOps and CI/CD backbone: Establish deployment automation, canary pipelines, and preview environment provisioning
Scale managed data services: Design, operate, and harden multi-tenant PostgreSQL clusters
Drive observability from day one: Instrument every Volcano service with meaningful SLIs
Lead cross-functional reliability work: Collaborate with the OCTO team, product engineering, and security to bake reliability into Volcano's architecture
Set SRE culture and standards: Mentor engineers on reliability principles; lead postmortems
Evaluate and adopt emerging technologies: Given Volcano's greenfield nature, evaluate edge runtimes, serverless compute, vector databases, and AI-native infrastructure components.
Requirements
BS in Computer Science or equivalent
Substantial experience at Staff or Principal IC level in SRE/Platform Engineering
Proven track record building SRE or platform engineering practices for developer-facing platforms or PaaS/SaaS products
Deep Kubernetes expertise: multi-tenant cluster design, networking (CNI, service mesh, ingress), autoscaling, and security hardening.