Role Overview

Define and execute the long-term strategy for our Kubernetes platform across Google Kubernetes Engine, Amazon Elastic Kubernetes Service, RKE2, and on-premise environments, ensuring reliability, scalability, and operational consistency.
Drive architectural decisions across critical infrastructure, including cluster lifecycle management, networking, identity and access management, observability, autoscaling, capacity planning, and cost optimization.
Lead large-scale platform initiatives across multiple engineering teams, establishing technical direction, engineering standards, and measurable outcomes that improve platform reliability and developer experience.
Establish and evolve reliability practices by defining service level objectives, service level indicators, and error budget frameworks that align platform performance with business priorities.
Build automation-first infrastructure through Infrastructure as Code, GitOps workflows, self-healing systems, and internal platform tooling that improve engineering velocity and reduce operational overhead.
Champion the responsible adoption of AI-powered engineering capabilities that improve operational efficiency, accelerate incident response, and enhance developer productivity.
Lead critical platform incidents, drive post-incident improvements, and strengthen platform resilience through automation, capacity planning, and operational excellence.
Mentor senior engineers, influence technical strategy across the organization, and elevate engineering excellence through architecture reviews, coaching, and technical leadership.

Requirements

A Bachelor's Degree in Computer Science or a related technical field.
At least 8 years of experience designing, operating, and scaling distributed cloud and on-premise infrastructure, including at least 3 years operating at the Staff, Principal, or equivalent technical leadership level.
Proven experience leading large-scale infrastructure or platform initiatives that require cross-functional alignment and long-term technical ownership.
Deep expertise with Kubernetes, including cluster architecture, networking, storage, security, operators, lifecycle management, and large-scale production operations.
Extensive experience building and operating production infrastructure in AWS and Google Cloud Platform using Infrastructure as Code technologies such as Terraform, Pulumi, or similar tools.
Strong software development experience in Go, Python, or both, with expertise in GitOps, continuous integration and continuous delivery, observability, distributed systems, Linux, and reliability engineering principles.
Experience incorporating AI-powered tools into engineering workflows while applying sound judgment around reliability, security, and operational risk.
Exceptional communication and leadership skills with a proven ability to mentor engineers, influence technical strategy, and drive engineering excellence.
Experience working in regulated industries, hybrid cloud environments, contributing to open-source projects, or holding cloud certifications is preferred.

Tech Stack

AWS
Cloud
Distributed Systems
Google Cloud Platform
Kubernetes
Linux
Python
Terraform
Go

Benefits

Bonuses
Equity
Benefits as applicable

Principal Site Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits