Samsara is the pioneer of the Connected Operations™ Cloud, enabling organizations to harness IoT data for actionable insights. They are seeking a Staff Machine Learning Infrastructure Engineer to lead the design and evolution of their ML platform, impacting safety and efficiency across various industries.

Responsibilities:

Design, build, and operate Samsara’s end-to-end ML platform (training, experimentation, batch/online inference, edge) used by multiple Safety AI product teams
Evolve shared training and experimentation infrastructure (orchestration, clusters, environments) and standardize tracking, evaluation, and regression testing for fast, safe iteration
Partner with product and applied ML teams to ship ML-powered features (CV models, EcoDriving insights, LLM-based reporting) that improve safety, reliability, and cost efficiency
Lead throughput and cost modeling for new ML features—from exploration to production-scale capacity planning—to inform roadmap and go/no-go decisions
Drive experiment design and evaluation, defining success metrics, structuring A/B or offline tests, and turning results into product and technical decisions
Design and operate scalable online and batch inference systems (Ray, Spark), including deployment patterns, observability, SLOs, and unified training-to-production workflows
Partner with firmware and edge teams to package, validate, and deploy models to Samsara devices, and build feedback loops from edge to cloud for continuous improvement
Own reliability, observability, and security for ML systems across cloud and edge, including on-call practices, incident response, and infrastructure hardening
Own or co-own end-to-end technical delivery for high-priority or high-risk initiatives, from modeling and system design through production rollout
Provide Staff+/Senior-Staff technical leadership on ML infrastructure architecture and strategy, influencing cross-team decisions and mentoring engineers and applied scientists
Drive strong developer experience through documentation, office hours, and best practices, while contributing to and representing Samsara in open source communities (Ray, Spark, RayDP)
Champion and role model Samsara’s cultural principles: Focus on Customer Success, Build for the Long Term, Adopt a Growth Mindset, Be Inclusive, Win as a Team

Requirements:

10+ years of overall experience in machine learning engineering or related fields, with a strong track record of building and operating large-scale ML systems
Strong experience with distributed computing frameworks such as Ray and/or Spark
Hands-on experience with cloud infrastructure (AWS), containers/Kubernetes, and production observability tooling
Proven experience building or supporting ML platforms (training, experimentation, or inference) used by multiple teams
Solid understanding of ML fundamentals including evaluation, experiment design, and model iteration in production environments
Experience shipping ML-powered features end-to-end, from design through production and iteration, with measurable impact on product or business metrics
Background in computer vision and/or LLM-based systems in production environments
Experience with edge or on-device ML and collaboration with firmware or embedded teams
Familiarity with model lifecycle systems (model registry, deployment, monitoring, rollback, drift detection)
Experience working in environments with strong security and compliance requirements
Demonstrated ability to lead across teams and influence technical direction at Staff+ scope
A strong sense of ownership and a desire for end-to-end autonomy—from platform design to real-world impact

Staff ML Engineer - ML Infrastructure

Key skills

About this role

Responsibilities:

Requirements: