Udemy is an AI-powered skills acceleration platform built to help people and teams grow. As a Principal Database Reliability Engineer, you will oversee the Datastore Infrastructure team, ensuring uptime, security, and performance of various database systems while collaborating with engineering and product teams to drive positive change.

Responsibilities:

Lead improvement projects for our datastores and platform teams to align with the company’s long term objectives
Maintain Infrastructure Uptime, monitor performance, and ensure infrastructure continues scaling as we grow
Develop Immutable infrastructure patterns, and automate Infrastructure provisioning via Code (Terraform, Python, Ansible etc ..)
Ensure adherence to PCI and ISO27001 compliance as well as SOC 2 security requirements, modifying CI/CD processes when necessary, and upholding policies and standards
Advocate for and implement positive changes in tools and processes through healthy discussions
Participate in the on-call rotation, demonstrating a systematic approach to incident management
Participate in day-to-day activities, support requests, and project-related tasks for the team
Contribute to documentation, maintain ticketing queues, provide project support, troubleshoot, and offer after-hours assistance as required
Provide coaching and mentorship to new hires, fostering their technical growth and integration into the team. Maintain close communication with team members throughout their tenure

Requirements:

8-10 years of professional experience working in a Cloud Engineering team (also SRE/DBRE team) with Infrastructure responsibilities in managing large production workloads
Proficiency with managing MySQL at scale (Horizontal Scaling, sharding, InnoDB optimizations, Query Optimization, HA/DR, Monitoring, Backups Strategy, Security, Automations)
Strong understanding in running Production Workloads in Kubernetes
Proficiency with tools like Terraform, Ansible, Git and how to work with Infrastructure as Code, and automated provisioning
Strong experience in Kafka cluster management, topic configuration, performance tuning, and ensuring high availability and fault tolerance. Experience with MSK is also good
Experience with Message Queues (MQ/SQS) and Caching (Redis, Memcache) or similar products
Experience in Python
Knowledge of configuration management tools, monitoring systems (Datadog or similar) for database infrastructure, and scaling strategies for handling increased data volumes
Strong troubleshooting skills to diagnose complex database issues
Hands-on experience with AWS cloud infrastructure and a grasp of security best practices
Adaptability and comfort working in a fast-paced, hands-on environment
Experience with any additional Programming Languages (Golang, Kotlin, Java)
Experience in implementing CDC pipelines for reliable data replication and synchronization
Experience with Vitess Operator running MySQL on Kubernetes
Experience with Writing Kubernetes Helm Charts
Experience with tools like ArgoCD/Argo Workflows, or similar alternatives in various combinations
Knowledge of security standards, vulnerability patching, TLS/SSL and related
Any additional experience or familiarity with related technologies would be advantageous

Principal Database Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: