Udemy is an AI-powered skills acceleration platform built to help people and teams grow. The Principal Cloud Engineer - Datastores will oversee the Datastore Infrastructure team, ensuring uptime, security, compliance, and performance of databases and related systems, while driving innovation and collaboration across teams.
Responsibilities:
- Lead improvement projects for our datastores and platform teams to align with the company’s long term objectives
- Maintain Infrastructure Uptime, monitor performance, and ensure infrastructure continues scaling as we grow
- Develop Immutable infrastructure patterns, and automate Infrastructure provisioning via Code (Terraform, Python, Ansible etc ..)
- Ensure adherence to PCI and ISO27001 compliance as well as SOC 2 security requirements, modifying CI/CD processes when necessary, and upholding policies and standards
- Advocate for and implement positive changes in tools and processes through healthy discussions
- Participate in the on-call rotation, demonstrating a systematic approach to incident management
- Participate in day-to-day activities, support requests, and project-related tasks for the team
- Contribute to documentation, maintain ticketing queues, provide project support, troubleshoot, and offer after-hours assistance as required
- Provide coaching and mentorship to new hires, fostering their technical growth and integration into the team. Maintain close communication with team members throughout their tenure
Requirements:
- 8-10 years of professional experience working in a Cloud Engineering team (also SRE/DBRE team) with Infrastructure responsibilities in managing large production workloads
- Proficiency with managing MySQL at scale (Horizontal Scaling, sharding, InnoDB optimizations, Query Optimization, HA/DR, Monitoring, Backups Strategy, Security, Automations)
- Strong understanding in running Production Workloads in Kubernetes
- Proficiency with tools like Terraform, Ansible, Git and how to work with Infrastructure as Code, and automated provisioning
- Strong experience in Kafka cluster management, topic configuration, performance tuning, and ensuring high availability and fault tolerance. Experience with MSK is also good
- Experience with Message Queues (MQ/SQS) and Caching (Redis, Memcache) or similar products
- Experience in Python
- Knowledge of configuration management tools, monitoring systems (Datadog or similar) for database infrastructure, and scaling strategies for handling increased data volumes
- Strong troubleshooting skills to diagnose complex database issues
- Hands-on experience with AWS cloud infrastructure and a grasp of security best practices
- Adaptability and comfort working in a fast-paced, hands-on environment
- Experience with any additional Programming Languages (Golang, Kotlin, Java)
- Experience in implementing CDC pipelines for reliable data replication and synchronization
- Experience with Vitess Operator running MySQL on Kubernetes
- Experience with Writing Kubernetes Helm Charts
- Experience with tools like ArgoCD/Argo Workflows, or similar alternatives in various combinations
- Knowledge of security standards, vulnerability patching, TLS/SSL and related
- Any additional experience or familiarity with related technologies would be advantageous