Baseten powers mission-critical inference for the world's most dynamic AI companies, enabling them to bring cutting-edge models into production. The role involves serving as the primary post-sales technical owner for strategic customers, ensuring smooth deployment and performance of ML workloads.

Responsibilities:

Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management
Debug infrastructure issues across Kubernetes (pods, controllers), networking, observability, and alerting systems
Lead incident response during outages or escalations, managing coordination between Product, FDE, Sales, and Engineering
Serve as the technical owner for top enterprise accounts with strict SLAs and high responsiveness expectations
Identify common failure modes and translate user feedback into roadmap signals, product improvements, our internal runbooks, knowledge bases, and diagnostic best practices
Own project coordination end-to-end: scoping, execution, communication, and stakeholder alignment across technical and non-technical teams ranging from feature requests, new deployments, and operational debugging issues

Requirements:

Deep Kubernetes troubleshooting expertise, including advanced resource debugging, pod/runtime analysis, and log-based diagnostics using observability tooling such as Grafana, Loki, and Prometheus
Strong infrastructure debugging ability across container orchestration, networking, and service dependencies, with hands-on experience supporting production-grade clusters
Experience managing high-severity incidents with major customers, including SLAs, post-incident reviews, and clear communication throughout escalations
Proven project management and organizational skills with an ownership mindset, able to manage multiple complex, multi-stakeholder initiatives in parallel — including issue resolution, root-cause analysis, and feature delivery
Ability to translate recurring technical pain points into roadmap-level insights, documentation improvements, or product enhancements
Strong communication skills and executive presence during high-visibility situations, ensuring technical clarity and customer confidence
3+ years of experience in a fast-paced, high-growth, or customer-facing engineering environment
Familiarity with running high-performance AI models and workloads, including troubleshooting ML pipelines from preprocessing through inference and serving
Experience implementing or managing ticketing and incident-response systems such as Zendesk or Pylon
Familiarity with Helm, Flux, CI/CD tooling, or scripting automations to improve deployment, release, or operational workflows

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: