Google is a leading technology company, and they are seeking a Senior Software Engineer for their Vertex AI Platform SRE team. The role focuses on building and maintaining infrastructure for third-party AI workloads on Google Cloud's Vertex AI, ensuring reliability, performance, scalability, and efficiency.
Responsibilities:
- Manage and scale Google Kubernetes Engine (GKE) fleets (approximately 60 thousand clusters and growing), including specialized mega-clusters designed for AI/ML models
- Optimize low-latency and high-throughput model serving across all layers of the platform stack to maximize performance for AI/ML inferences
- Architect and maintain connectivity and interactions between the internal control planes and the external Google Cloud Platform (GCP) based data planes in a hybrid environment
- Implement and configure essential cloud technologies, including Istio for service mesh management, advanced load balancing, Google Compute Engine (GCE), and various GCP networking and security components
Requirements:
- Bachelor's degree in Computer Science, a related field, or equivalent practical experience
- 5 years of experience with software development in one or more programming languages
- 3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems
- 2 years of experience leading projects and providing technical leadership
- Experience with cloud compute platforms (e.g., Kubernetes, Google Cloud Functions)
- Experience in site reliability engineering, system design, and distributed computing
- Master's degree in Computer Science or Engineering
- Experience with Kubernetes, Google Cloud Platform (GCP), GKE Networking, and Istio
- Ability to demonstrate passion for technology and apply technical depth to uncover root causes of technical problems and provide guidance on solving them