Baseten powers mission-critical inference for the world's most dynamic AI companies, enabling them to bring cutting-edge models into production. The role involves serving as the primary post-sales technical owner for strategic customers, ensuring smooth deployment and performance of ML workloads.
Responsibilities:
- Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management
- Debug infrastructure issues across Kubernetes (pods, controllers), networking, observability, and alerting systems
- Lead incident response during outages or escalations, managing coordination between Product, FDE, Sales, and Engineering
- Serve as the technical owner for top enterprise accounts with strict SLAs and high responsiveness expectations
- Identify common failure modes and translate user feedback into roadmap signals, product improvements, our internal runbooks, knowledge bases, and diagnostic best practices
- Own project coordination end-to-end: scoping, execution, communication, and stakeholder alignment across technical and non-technical teams ranging from feature requests, new deployments, and operational debugging issues
Requirements:
- Deep Kubernetes troubleshooting expertise, including advanced resource debugging, pod/runtime analysis, and log-based diagnostics using observability tooling such as Grafana, Loki, and Prometheus
- Strong infrastructure debugging ability across container orchestration, networking, and service dependencies, with hands-on experience supporting production-grade clusters
- Experience managing high-severity incidents with major customers, including SLAs, post-incident reviews, and clear communication throughout escalations
- Proven project management and organizational skills with an ownership mindset, able to manage multiple complex, multi-stakeholder initiatives in parallel — including issue resolution, root-cause analysis, and feature delivery
- Ability to translate recurring technical pain points into roadmap-level insights, documentation improvements, or product enhancements
- Strong communication skills and executive presence during high-visibility situations, ensuring technical clarity and customer confidence
- 3+ years of experience in a fast-paced, high-growth, or customer-facing engineering environment
- Familiarity with running high-performance AI models and workloads, including troubleshooting ML pipelines from preprocessing through inference and serving
- Experience implementing or managing ticketing and incident-response systems such as Zendesk or Pylon
- Familiarity with Helm, Flux, CI/CD tooling, or scripting automations to improve deployment, release, or operational workflows