Hims & Hers is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. They are seeking a Staff Machine Learning Systems Engineer to design, build, and operate the production infrastructure that powers AI across the company, focusing on critical systems that support AI teams in a regulated healthcare environment.

Responsibilities:

Own and scale the AI compute and deployment platform
Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production
Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably
Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release
Drive efficiency and cost management across compute, autoscaling, and inference infrastructure
Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failover
Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level
Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company
Own the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in production
Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders
Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability
Own and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution
Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily
Identify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organization
Build IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access audits
Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-first
Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data access
Drive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolution
Write and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineering
Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systems

Requirements:

8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production
Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration
Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access
Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines
2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale
Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines
Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teams
A systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shipping
Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives
Strong collaboration skills across engineering, ML, product, security, and clinical teams
A deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciences
Experience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routing
Experience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governance
Experience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoring
Experience with Karpenter, cluster autoscaling, and cost optimization for ML compute
Experience with monorepo build systems (Pants, Bazel) and large-scale CI/CD
Experience building automated PR-review / convention-enforcement pipelines and developer-workflow standards
Familiarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services as a stretch growth area
Contributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projects

Staff Machine Learning Systems Engineer (MLOps)

Key skills

About this role

Responsibilities:

Requirements: