Nscale is a rapidly growing AI Infrastructure organization seeking a Principal Program Manager for Support Strategy & Operations. This role is critical for designing and operationalizing a scalable Support Delivery Model, partnering with various teams to ensure operational excellence and reliable support for AI infrastructure platforms.
Responsibilities:
- Lead the design and implementation of a scalable AI Cloud Support Delivery Model aligned with ITIL service management principles
- Establish operational frameworks for 24x7 Incident Management, Problem Management, Change Management, and Customer Success Management
- Define global support workflows, escalation paths, service ownership models, and operational governance structures
- Develop documentation and operational playbooks for new support processes and delivery models
- Partner with Support Leadership to define organizational structure, role definitions, and operational responsibilities
- Lead complex cross-functional programs that improve operational efficiency, service reliability, and customer experience
- Develop and manage program roadmaps, milestones, deliverables, and budgets for strategic support initiatives
- Coordinate execution across multiple teams including Engineering, Datacenter Operations, Customer Success, and Support Engineering
- Identify risks, dependencies, and operational gaps, and proactively drive mitigation plans
- Track program health and progress using defined metrics and reporting frameworks
- Drive adoption of ITIL-aligned service management processes across the organization
- Establish and monitor key operational metrics including MTTR, SLA adherence, incident trends, problem resolution effectiveness, and change success rates
- Implement continuous improvement frameworks to mature operational processes and service reliability
- Facilitate post-incident reviews and drive root cause analysis improvements through Problem Management practices
- Act as a strategic liaison between Support, Engineering, Product, and Datacenter Operations to ensure alignment on operational priorities
- Partner with engineering teams to improve service observability, incident response automation, and operational readiness
- Work with Customer Success and account teams to ensure support delivery aligns with enterprise customer expectations and service commitments
- Develop executive-level dashboards and reporting mechanisms to communicate program progress, operational health, and service performance
- Provide leadership with insights and recommendations based on operational metrics and program outcomes
- Establish governance structures to ensure consistent execution of support initiatives and operational standards
- Translate strategic objectives for AI infrastructure support into actionable programs and operational frameworks
- Lead the planning and execution of large-scale initiatives that enable the organization to support rapid customer growth and platform expansion
- Drive alignment between technical teams and operational teams to ensure scalable service delivery
- Identify opportunities to improve operational efficiency through automation, tooling, and process optimization
- Champion a culture of operational excellence, accountability, and continuous improvement across the support organization
Requirements:
- Bachelor's degree in computer science, Information Systems, Engineering, or a related field (master's degree preferred)
- 7–10+ years of experience in program management, operations strategy, or service delivery within cloud infrastructure, AI infrastructure, or large-scale distributed systems environments
- Experience working within technical support organizations, cloud operations, or site reliability engineering environments
- Proven experience leading cross-functional programs involving engineering, infrastructure, and operations teams
- Experience implementing or operating within ITIL-based service management frameworks
- PMP, PgMP, or ITIL certification is a plus
- Strong understanding of cloud infrastructure operations, AI/ML infrastructure environments, or hyperscale platforms
- Familiarity with ITIL service management practices including Incident, Problem, and Change Management
- Experience working with operational tooling such as ServiceNow, Jira, observability platforms, or incident management systems
- Exceptional ability to manage complex programs across multiple teams and stakeholders
- Strong project planning skills including roadmap development, milestone tracking, and risk management
- Ability to balance strategic thinking with hands-on execution
- Excellent communication skills with the ability to influence both technical teams and executive leadership
- Strong stakeholder management and cross-functional collaboration skills
- Ability to translate technical operational challenges into business insights
- Strong analytical skills with the ability to interpret operational data and drive actionable insights
- Ability to identify systemic operational issues and implement long-term improvements
- Strategic mindset with the ability to build scalable operational frameworks