Role Overview

Lead the operational architecture, deployment strategy, and reliability engineering for integrating AI into high-stakes Healthcare Information Systems (HIS)
Define the enterprise operational standards, govern the release processes, and build the resilient infrastructure required to maintain models in mission-critical clinical environments
Architect and govern the comprehensive release process, defining enterprise checklists, automated approval gates, release notes, and deployment readiness standards
Establish the deployment execution standards for promoting AI across all environments and ensure customer deployments adhere to strict internal production discipline
Architect and oversee the enterprise model registry, ensuring seamless integration with CI/CD pipelines and full version control traceability
Define and enforce monitoring standards, establishing critical SLAs/SLOs, service health metrics, and comprehensive dashboards across the AI ecosystem
Architect automated checks for input/output data quality and model drift, ensuring proactive detection of system degradation
Establish and lead the production incident process, including rigorous triage workflows, severity escalation paths, postmortems, rollback mechanisms, and recovery infrastructure
Partner with Platform teams to provide essential ATO (Authority to Operate) and compliance support, ensuring complete deployment traceability and strict operational controls
Oversee comprehensive operational reporting, providing leadership with status updates across production systems, pre-prod testing, customer rollouts, and incident metrics
Foster a culture of production discipline, guiding junior engineers in maintaining operational runbooks and reliable deployment pipelines

Requirements

Bachelor's Degree or Higher in Computer Science, Software Engineering, or related technical field
10+ years of experience in software engineering, with at least 6 years dedicated to deploying and maintaining large-scale ML systems in production
Expert-level experience with Cloud Providers (AWS/GCP/Azure) and orchestration tools (Kubernetes, Kubeflow, or Airflow)
Expert-level Python and Java/Go (or similar)
Deep proficiency in backend frameworks, microservices, and system design patterns
Expert knowledge of monitoring stacks (Prometheus, Grafana, Datadog) and establishing enterprise SLAs/SLOs for AI services
Proven track record of designing automated deployment pipelines, managing complex rollback procedures, and enforcing model registry governance at scale.

Tech Stack

Airflow
AWS
Azure
Cloud
Google Cloud Platform
Grafana
Java
Kubernetes
Microservices
Prometheus
Python
Go

Benefits

Medical
Dental & Vision
Health Savings Accounts
Health Care & Dependent Care Flexible Spending Accounts
Disability Benefits
Life Insurance
Voluntary Benefits
Paid Absences
Retirement Benefits

Principal MLOps Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits