Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
Debug infrastructure issues across Kubernetes (pods, controllers), networking, observability, and alerting systems.
Lead incident response during outages or escalations, managing coordination between Product, FDE, Sales, and Engineering.
Serve as the technical owner for top enterprise accounts with strict SLAs and high responsiveness expectations.
Identify common failure modes and translate user feedback into roadmap signals, product improvements, our internal runbooks, knowledge bases, and diagnostic best practices.
Own project coordination end-to-end: scoping, execution, communication, and stakeholder alignment across technical and non-technical teams ranging from feature requests, new deployments, and operational debugging issues.
Requirements
Deep Kubernetes troubleshooting expertise, including advanced resource debugging, pod/runtime analysis, and log-based diagnostics using observability tooling such as Grafana, Loki, and Prometheus.
Strong infrastructure debugging ability across container orchestration, networking, and service dependencies, with hands-on experience supporting production-grade clusters.
Experience managing high-severity incidents with major customers, including SLAs, post-incident reviews, and clear communication throughout escalations.
Proven project management and organizational skills with an ownership mindset, able to manage multiple complex, multi-stakeholder initiatives in parallel — including issue resolution, root-cause analysis, and feature delivery.
Ability to translate recurring technical pain points into roadmap-level insights, documentation improvements, or product enhancements.
Strong communication skills and executive presence during high-visibility situations, ensuring technical clarity and customer confidence.
3+ years of experience in a fast-paced, high-growth, or customer-facing engineering environment.
Tech Stack
Grafana
Kubernetes
Prometheus
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.