Hims & Hers is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. They are seeking a Staff Machine Learning Systems Engineer to design, build, and operate the production infrastructure that powers AI across the company, focusing on critical systems that support AI teams in a regulated healthcare environment.
Responsibilities:
- Own and scale the AI compute and deployment platform
- Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production
- Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably
- Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release
- Drive efficiency and cost management across compute, autoscaling, and inference infrastructure
- Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failover
- Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level
- Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company
- Own the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in production
- Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders
- Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability
- Own and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution
- Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily
- Identify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organization
- Build IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access audits
- Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-first
- Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data access
- Drive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolution
- Write and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineering
- Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systems
Requirements:
- 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production
- Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration
- Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access
- Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines
- 2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale
- Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines
- Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teams
- A systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shipping
- Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives
- Strong collaboration skills across engineering, ML, product, security, and clinical teams
- A deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciences
- Experience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routing
- Experience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governance
- Experience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoring
- Experience with Karpenter, cluster autoscaling, and cost optimization for ML compute
- Experience with monorepo build systems (Pants, Bazel) and large-scale CI/CD
- Experience building automated PR-review / convention-enforcement pipelines and developer-workflow standards
- Familiarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services as a stretch growth area
- Contributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projects