Yugabyte is on a mission to become the default transactional database for enterprises building cloud-native applications. They are seeking a Staff Site Reliability Engineer to focus on database availability and reliability, operating and automating the life cycle of the YugabyteDB Database as a Service (DBaaS). The role involves designing and building infrastructure processes, managing operational priorities, and participating in incident response for the DBaaS infrastructure.
Responsibilities:
- Define and drive the technical vision, architecture, and strategy for YugabyteDB’s Database-as-a-Service (DBaaS)
- Lead, Design, develop, test, debug, troubleshoot, and maintain components of the DBaaS cloud infrastructure
- Manage operational priorities of the DBaaS infrastructure
- Establish processes for handling and leading response to incidents on databases or infrastructure
- Automate and manage regular maintenance operations such as upgrades etc
- Design and build DBaaS processes for encryption, security key/password management, storage management, etc
- Utilize SRE golden signals to analyze and optimize the DBaaS system's performance and reliability strategies
Requirements:
- Strong software design and implementation skills in building infrastructure frameworks
- 15+ years of experience as a SRE and 5+ years of technical leadership experience
- Experience in building and managing large-scale distributed systems
- Experience building and operating data systems for production applications, including fault tolerant designs, software lifecycles, and automation of critical operations
- Strong track record of Incident Response and Management in a managed service which is mission critical for its customers
- Experience with Relational Database systems (PostgresQL preferred)
- Experience with Public cloud infrastructure (AWS, GCP, and/or Azure)
- Experience with Containerization tooling, theory and design (Docker, Kubernetes)
- Experience with Infrastructure as Code (Terraform preferred)
- Experience with Configuration Management Tooling (Ansible preferred)
- Experience with Automation Scripting (Python and Bash preferred)
- Experience with Monitoring systems (Prometheus preferred)
- Experience with Version control systems (git preferred)
- Experience with CI/CD systems (GitHub Actions preferred)
- Solid understanding of Linux systems operations and troubleshooting
- Willingness and ability to learn new languages and concepts