R1 RCM is building healthcare’s first Revenue Operating System, leveraging AI to enhance hospital billing and reimbursement processes. The role involves owning the production runtime for Phare’s ML stack, deploying and scaling models while ensuring system reliability and observability.
Responsibilities:
- You’ll own the production runtime for Phare’s ML stack - deploying, serving, and scaling models across inference endpoints and batch/streaming workflows
- You’ll build progressive delivery pipelines with automated rollouts and rollbacks, manage SLOs for latency and availability, and instrument end-to-end observability (metrics, logs, traces, drift, regression)
- You’ll harden the platform with Terraform, Kubernetes, and CI/CD, ensuring reproducible, auditable ML releases
Requirements:
- At least 5 years of relevant industry experience in software engineering
- At least 2 years of direct MLOps experience
- Experience in deploying and operating models running on GPUs in production - APIs and batch/streaming inference
- Strong with Docker/Kubernetes, IaC (e.g., Terraform), and CI/CD for services and model artifacts
- Experience maintaining environment parity, reproducible releases, and robust model/experiment versioning with data lineage
- Experience using progressive delivery with automated rollouts/rollbacks
- Experience building end-to-end observability (metrics, logs, traces, and model telemetry for drift/regression)
- Experience with actionable alerting, runbooks, and incident response
- Experience managing model registries and stage gates
- Experience designing scheduled or event-driven retraining when appropriate
- Experience enforcing RBAC, secrets management, encryption, and audit logs
- Experience in regulated environments (e.g., healthcare, finance)