Taking responsibility for observability strategy, designing telemetry, dashboards, alerts, defining SLO/SLI frameworks, and implementing improvements when targets are missed.
Building production-grade automation and tooling that reduces operational toil, improves incident response, and sets patterns that other SREs adopt
Owning incident management integration for inference workloads, designing frameworks, leading incident response during on-call rotations, and driving systemic improvements from post-mortems
Defining and implementing deployment safety practices including progressive rollouts, canary analysis, and rollback automation, establishing standards for the team
Partnering with product engineering teams to influence architecture decisions, ensure operational readiness, and represent the SRE perspective in design reviews
Mentoring Senior and mid-level SREs through code reviews, design discussions, and hands-on problem-solving
Requirements
Have extensive experience in SRE, platform engineering, or infrastructure engineering, working with large-scale distributed systems
Track record of defining SLO/SLI frameworks, building observability platforms, and running incident management processes at scale
Demonstrate expertise in Kubernetes and containerization, including autoscaling, resource scheduling, and orchestration for compute-intensive workloads at scale.
Build automation and tooling using Python or Go, while leveraging CI/CD pipelines, deployment safety practices, and infrastructure-as-code expertise.
Lead technical initiatives across teams, guide engineers through mentorship, and resolve complex reliability challenges independently with expertise and precision.
Gain experience in AI/ML infrastructure, model deployment, or handling GPU workloads effectively within relevant environments.
Demonstrate ownership of intricate reliability issues, deliver solutions collaboratively, and enhance the technical expertise of surrounding SRE team members.