Role Overview
- Own the design, implementation, and operation of shared Resource Plane services used across the organization.
- Operate and evolve Kafka, RabbitMQ, Redis, and Elasticsearch platforms in production.
- Apply SRE principles to ensure availability, scalability, reliability, and safe upgrades of stateful systems.
- Build and maintain Infrastructure-as-Code and GitOps-based automation for provisioning and lifecycle management.
- Define and enforce supported usage patterns, guardrails, and golden paths for platform consumers.
- Integrate Resource Plane services into the Internal Developer Platform for self-service consumption.
- Participate in on-call rotation (initially may include 24x7 weekly rotation; target state is business-hours primary).
- Lead incident response, root cause analysis, and reliability improvements for owned services.
- Collaborate closely with the Kubernetes Platform Team, IDP, and other teams as required.
- Act as a senior technical voice in architecture discussions and platform roadmap planning.
Requirements
- Bachelor’s degree in Computer Science, Software Engineering, or Information Technology.
- Senior-level experience operating production-grade, stateful distributed systems.
- Strong hands-on experience with Kafka, RabbitMQ, Redis, and Elasticsearch.
- Proven experience running infrastructure primarily in on-premises environments.
- Strong understanding of Linux systems, networking, and storage fundamentals.
- Deep experience with Infrastructure-as-Code (Terraform preferred).
- Experience with GitOps workflows and declarative infrastructure management.
- Solid grasp of reliability engineering concepts (SLOs, error budgets, alerting, capacity planning).
- Experience working with Kubernetes-adjacent platforms and services.
- Ability to operate independently and take ownership in a small team environment.
Preferred Qualifications:
- Experience designing shared platform services consumed by multiple product teams.
- Familiarity with IDP concepts and developer self-service platforms.
- Experience migrating on-premises workloads toward cloud-native or hybrid models.
- Exposure to security, compliance, and governance requirements for shared infrastructure.
- Prior experience in staff
- or principal-level technical leadership roles.
Tech Stack
- Cloud
- Distributed Systems
- ElasticSearch
- Kafka
- Kubernetes
- Linux
- RabbitMQ
- Redis
- Terraform
Benefits
We hire, promote, and compensate employees based on their ability to perform their job responsibilities without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or any other basis protected by applicable law (collectively referred to in our Code of Conduct as “Protected Classes”). We do not tolerate employment discrimination in the workplace and are committed to making reasonable accommodations for identified disabilities or other limitations as required by applicable laws. We are an equal opportunity employer, value diversity at our company, and do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.