Monogram Health is a leading multispecialty provider of in-home, evidence-based care for complex patients with multiple chronic conditions. They are seeking a Staff Engineer in Machine Learning Operations to architect and scale machine learning infrastructure while mentoring teams and driving strategic decisions that impact patient outcomes.
Responsibilities:
- Architect and maintain enterprise-grade ML infrastructure, including model versioning, automated testing frameworks, containerization strategies, CI/CD pipelines, and comprehensive monitoring systems for model performance, data quality, and drift detection
- Drive MLOps strategy and standards across the organization. Mentor data scientists and engineers on production best practices, system design, and scalable architecture patterns
- Own the complete journey from model development through production deployment, including real-time and batch inference systems, A/B testing frameworks, and automated retraining pipelines
- Collaborate with clinical leaders, product teams, and data scientists to translate complex healthcare requirements into robust, scalable ML solutions. Present technical strategies to executive stakeholders
- Build fault-tolerant, compliant systems that meet healthcare security and privacy standards. Establish SLAs, incident response protocols, and disaster recovery procedures for mission-critical ML services
- Evaluate and integrate cutting-edge MLOps tools and practices. Design systems that scale with Monogram's growth while reducing operational overhead and improving model iteration velocity
Requirements:
- Bachelor's degree in computer science, engineering, or related field required; master's degree preferred
- Minimum of ten (10) years in software engineering with five (5) years focused on ML infrastructure, MLOps, or production ML systems and Python development with strong software engineering fundamentals and three (3) years architecting and deploying production ML systems on cloud platforms (Azure preferred)
- Proven track record building and scaling ML platforms from the ground up
- Expert-level proficiency with MLOps tooling (MLflow, Kubeflow, SageMaker, Azure ML, etc.)
- Deep experience with containerization (Docker, Kubernetes), orchestration tools (Airflow, Prefect), and infrastructure-as-code (Terraform, ARM templates)
- Advanced knowledge of CI/CD systems, automated testing strategies, and GitOps workflows
- Data engineering skills: SQL, Spark/PySpark, Databricks, data pipeline optimization
- Expertise in model monitoring, observability, feature stores, and experiment tracking at scale
- Production experience with both batch and real-time inference architectures
- Demonstrated ability to influence technical direction and mentor senior engineers
- Proven communication skills with ability to distill complex technical concepts for diverse audiences
- Track record of driving consensus on architectural decisions across multiple stakeholders
- Systems thinking skills with focus on reliability, scalability, and maintainability
- Healthcare or regulated industry experience strongly preferred
- Understanding of healthcare data standards (FHIR, HL7, claims data) is a plus
- Understanding of security, compliance, and privacy requirements in healthcare (HIPAA) preferred
- Bias toward action with pragmatic approach to technical debt and iterative improvement preferred