Monogram Health is a leading multispecialty provider of in-home, evidence-based care for complex patients. The Staff Engineer, Machine Learning Operations will architect and scale machine learning infrastructure and deployment pipelines while mentoring engineering teams to drive strategic technical decisions that impact patient outcomes.
Responsibilities:
- Architect and maintain enterprise-grade ML infrastructure, including model versioning, automated testing frameworks, containerization strategies, CI/CD pipelines, and comprehensive monitoring systems for model performance, data quality, and drift detection
- Drive MLOps strategy and standards across the organization. Mentor data scientists and engineers on production best practices, system design, and scalable architecture patterns
- Own the complete journey from model development through production deployment, including real-time and batch inference systems, A/B testing frameworks, and automated retraining pipelines
- Collaborate with clinical leaders, product teams, and data scientists to translate complex healthcare requirements into robust, scalable ML solutions. Present technical strategies to executive stakeholders
- Build fault-tolerant, compliant systems that meet healthcare security and privacy standards. Establish SLAs, incident response protocols, and disaster recovery procedures for mission-critical ML services
- Evaluate and integrate cutting-edge MLOps tools and practices. Design systems that scale with Monogram's growth while reducing operational overhead and improving model iteration velocity
Requirements:
- 10+ years in software engineering with 5+ years focused on ML infrastructure, MLOps, or production ML systems
- 5+ years of Python development with strong software engineering fundamentals
- 3+ years architecting and deploying production ML systems on cloud platforms (Azure preferred)
- Proven track record building and scaling ML platforms from the ground up
- Expert-level proficiency with MLOps tooling (MLflow, Kubeflow, SageMaker, Azure ML, etc.)
- Deep experience with containerization (Docker, Kubernetes), orchestration tools (Airflow, Prefect), and infrastructure-as-code (Terraform, ARM templates)
- Advanced knowledge of CI/CD systems, automated testing strategies, and GitOps workflows
- Strong data engineering skills: SQL, Spark/PySpark, Databricks, data pipeline optimization
- Expertise in model monitoring, observability, feature stores, and experiment tracking at scale
- Production experience with both batch and real-time inference architectures
- Demonstrated ability to influence technical direction and mentor senior engineers
- Exceptional communication skills with ability to distill complex technical concepts for diverse audiences
- Track record of driving consensus on architectural decisions across multiple stakeholders
- Bachelor's degree in Computer Science, Engineering, or related field required
- Systems thinking with focus on reliability, scalability, and maintainability
- Deep understanding of security, compliance, and privacy requirements in healthcare (HIPAA)
- Bias toward action with pragmatic approach to technical debt and iterative improvement
- Healthcare or regulated industry experience strongly preferred
- Master's degree or equivalent practical experience preferred
- Understanding of healthcare data standards (FHIR, HL7, claims data) is a plus