Zoom is seeking a Senior Site Reliability Engineer to support their Kubernetes platforms and customer-facing data systems. The role focuses on improving system reliability, scalability, and operations across distributed infrastructure and data platforms, while collaborating with various engineering teams.
Responsibilities:
- Support Kubernetes platforms and customer-facing data systems
- Improve system reliability, scalability, and day-to-day operations across distributed infrastructure and data platforms
- Partner with Infrastructure, Data Platform, and Application Engineering teams to reduce operational workload, improve incident response, and drive automation across multi-region environments
Requirements:
- Have 6+ years of experience in SRE, Platform Engineering, or Infrastructure roles
- Show hands-on experience with Kubernetes (K8s) in production environments
- Have experience in Linux systems, networking fundamentals, and distributed systems
- Show experience with monitoring and observability tools (Prometheus, Grafana, Datadog, PagerDuty, etc.)
- Demonstrate effective programming/scripting skills in Python, Go, or Shell
- Be able to build and operate CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD, etc.) and support data platforms (Spark, Trino/Presto, Airflow, Kafka) in production
- Have hands-on experience with cloud platforms (AWS, GCP, Azure) and incident management, troubleshooting and RCA skills
- Experience in data platform reliability and automation and AI-assisted operations (a bonus)