Yugabyte is on a mission to become the default transactional database for enterprises building cloud-native applications. They are seeking a Staff Site Reliability Engineer focused on database availability and reliability to operate and automate the lifecycle of their Database as a Service (DBaaS). The role involves designing and building processes for managing databases and participating in incident management to ensure reliable service for customers.

Responsibilities:

Define and drive the technical vision, architecture, and strategy for YugabyteDB’s Database-as-a-Service (DBaaS)
Lead, Design, develop, test, debug, troubleshoot, and maintain components of the DBaaS cloud infrastructure
Manage operational priorities of the DBaaS infrastructure
Establish processes for handling and leading response to incidents on databases or infrastructure
Automate and manage regular maintenance operations such as upgrades etc
Design and build DBaaS processes for encryption, security key/password management, storage management, etc
Utilize SRE golden signals to analyze and optimize the DBaaS system's performance and reliability strategies

Requirements:

Strong software design and implementation skills in building infrastructure frameworks
15+ years of experience as a SRE and 5+ years of technical leadership experience
Experience in building and managing large-scale distributed systems
Experience building and operating data systems for production applications, including fault tolerant designs, software lifecycles, and automation of critical operations
Strong track record of Incident Response and Management in a managed service which is mission critical for its customers
Experience with Relational Database systems (PostgresQL preferred)
Experience with Public cloud infrastructure (AWS, GCP, and/or Azure)
Experience with Containerization tooling, theory and design (Docker, Kubernetes)
Experience with Infrastructure as Code (Terraform preferred)
Experience with Configuration Management Tooling (Ansible preferred)
Experience with Automation Scripting (Python and Bash preferred)
Experience with Monitoring systems (Prometheus preferred)
Experience with Version control systems (git preferred)
Experience with CI/CD systems (GitHub Actions preferred)
Solid understanding of Linux systems operations and troubleshooting
Willingness and ability to learn new languages and concepts

Staff Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: