SambaNova is building the future of AI computing, focusing on generative AI and high-performance computing. As a Senior Cloud Site Reliability Engineer, you will ensure the reliability and performance of the AI Inferencing Service, bridging the gap between software development and operations.

Responsibilities:

Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions
Participate in a balanced on-call rotation to provide 24/7 support for the service
Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence
Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization
Proactively identify and eliminate performance bottlenecks
Design and implement auto-scaling policies to handle variable inference loads cost-effectively
Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable
Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates
Forecast infrastructure needs based on product roadmaps and usage trends
Work with finance and engineering teams to manage cloud costs and optimize spending
Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments

Requirements:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
5-8+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure)
Strong programming/scripting skills in languages like Python, Go, or Java
Proven experience with containerization and orchestration technologies (Docker, Kubernetes)
Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog)
Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation)
Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD)
Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems
Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure
Direct experience supporting ML/AI inferencing services in production
Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs
Knowledge of model serving frameworks like vLLM, SGLang or Ray
Understanding of MLOps principles and practices
Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached)
Strong Linux/Unix system administration fundamentals

Senior Cloud Platform Engineer

Key skills

About this role

Responsibilities:

Requirements: