Samsara is the pioneer of the Connected Operations™ Cloud, enabling organizations to harness IoT data for actionable insights. They are seeking a Staff Machine Learning Infrastructure Engineer to lead the design and evolution of their ML platform, impacting safety and efficiency across various industries.
Responsibilities:
- Design, build, and operate Samsara’s end-to-end ML platform (training, experimentation, batch/online inference, edge) used by multiple Safety AI product teams
- Evolve shared training and experimentation infrastructure (orchestration, clusters, environments) and standardize tracking, evaluation, and regression testing for fast, safe iteration
- Partner with product and applied ML teams to ship ML-powered features (CV models, EcoDriving insights, LLM-based reporting) that improve safety, reliability, and cost efficiency
- Lead throughput and cost modeling for new ML features—from exploration to production-scale capacity planning—to inform roadmap and go/no-go decisions
- Drive experiment design and evaluation, defining success metrics, structuring A/B or offline tests, and turning results into product and technical decisions
- Design and operate scalable online and batch inference systems (Ray, Spark), including deployment patterns, observability, SLOs, and unified training-to-production workflows
- Partner with firmware and edge teams to package, validate, and deploy models to Samsara devices, and build feedback loops from edge to cloud for continuous improvement
- Own reliability, observability, and security for ML systems across cloud and edge, including on-call practices, incident response, and infrastructure hardening
- Own or co-own end-to-end technical delivery for high-priority or high-risk initiatives, from modeling and system design through production rollout
- Provide Staff+/Senior-Staff technical leadership on ML infrastructure architecture and strategy, influencing cross-team decisions and mentoring engineers and applied scientists
- Drive strong developer experience through documentation, office hours, and best practices, while contributing to and representing Samsara in open source communities (Ray, Spark, RayDP)
- Champion and role model Samsara’s cultural principles: Focus on Customer Success, Build for the Long Term, Adopt a Growth Mindset, Be Inclusive, Win as a Team
Requirements:
- 10+ years of overall experience in machine learning engineering or related fields, with a strong track record of building and operating large-scale ML systems
- Strong experience with distributed computing frameworks such as Ray and/or Spark
- Hands-on experience with cloud infrastructure (AWS), containers/Kubernetes, and production observability tooling
- Proven experience building or supporting ML platforms (training, experimentation, or inference) used by multiple teams
- Solid understanding of ML fundamentals including evaluation, experiment design, and model iteration in production environments
- Experience shipping ML-powered features end-to-end, from design through production and iteration, with measurable impact on product or business metrics
- Background in computer vision and/or LLM-based systems in production environments
- Experience with edge or on-device ML and collaboration with firmware or embedded teams
- Familiarity with model lifecycle systems (model registry, deployment, monitoring, rollback, drift detection)
- Experience working in environments with strong security and compliance requirements
- Demonstrated ability to lead across teams and influence technical direction at Staff+ scope
- A strong sense of ownership and a desire for end-to-end autonomy—from platform design to real-world impact