CVS Health is a leading healthcare organization focused on building a world of health around every individual. They are seeking an Executive Director, AI Ops Engineering to lead a team responsible for the continuous operation, monitoring, and optimization of CVS's Enterprise AI environment, ensuring high availability and performance across the platform.
Responsibilities:
- Own the SRE vision, strategy, and long-range roadmap with availability (>99.99%), reliability, and scalability as the primary measures of success
- Lead, develop, and integrate all functional teams into a cohesive, always-on operations organization — setting clear ownership, accountability, and performance expectations for each team and each engineer
- Establish and enforce operational baselines across all platform components; ensure deviations are detected, escalated, and resolved within defined SLAs
- Drive end-to-end observability with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles
- Oversee change management ensuring every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment
- Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time
- Build and sustain a high-performing 24/7 operations model — zero mandatory overtime, zero burnout attrition, and measurable team health and retention
- Empower the Security SRE Lead to implement and maintain a world-class security posture, minimizing risk and ensuring robust compliance with frameworks like HIPAA and NIST AI RMF
- Direct Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent degradation before it impacts availability
- Lead GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization
- Manage vendor relationships and performance accountability
- Lead the structured transition of operational ownership from the incumbent managed services provider to CVS's internal SRE organization, governing phased handoffs, competency validation, and milestone sign-offs, ensuring a seamless transition with minimal disruption to platform availability and business operations
- Establish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program close
Requirements:
- 10+ years in SRE, platform operations, or DevOps engineering leadership with a demonstrated focus on availability and reliability outcomes
- 5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations — with measurable team health, retention, and performance outcomes
- Proven success establishing and enforcing operational baselines, SLO/SLI/error budget frameworks, and observability-driven continuous improvement in complex environments
- Deep expertise in Kubernetes/OpenShift, IaC, GPU computing, and AI/ML infrastructure
- Experience managing large-scale MSP transitions or platform operational handoffs while ensuring business continuity and minimizing disruption
- Demonstrated FinOps and GPU cost optimization experience in cloud or on-premises environments
- Security framework implementation and compliance program management in regulated industries (HIPAA, NIST AI RMF)
- Track record building sustainable 24/7 operations models with measurable retention and no burnout-related attrition
- Executive stakeholder communication, vendor negotiation, and budget ownership
- Background in innovation programs, POD structures, or centers of excellence
- Willingness to travel and work off hours as required. Our 24/7 model is designed for sustainable, predictable coverage that eliminates mandatory overtime. As a leader, you will be an escalation point for critical incidents, but our goal is a resilient system and culture that protects our team's time
- Bachelor's in Computer Science, Engineering, or related field
- NVIDIA AI Enterprise, Run:AI, or GPU orchestration platform experience
- Healthcare or regulated industry background
- Certifications: ITIL Expert, PMP, AWS/Azure/GCP, CISSP
- Familiarity with Cisco UCS, VAST storage, EVPN-VXLAN, and RDMA/RoCE protocols
- Chaos engineering and AI-driven operations experience
- Thought leadership: published work or speaking at industry conferences
- Master's degree