Zelis is modernizing the healthcare financial experience across payers, providers, and healthcare consumers. This role will lead the next phase of Zelis’ operational transformation by defining and driving the AIOps strategy, combining cloud operations, observability, automation, and AI solutions to enhance operational efficiency and reliability.
Responsibilities:
- Lead the AIOps strategy and architecture for Zelis Price Business Unit as we modernize operations alongside AWS migration and AI-native acceleration
- Define and implement intelligent operational patterns that improve incident detection, triage, remediation, root cause analysis, and operational decision-making
- Architect and scale observability capabilities across cloud and application environments, including metrics, logs, traces, dashboards, alerting, and service health visibility
- Drive operational excellence across AWS environments by establishing scalable patterns for monitoring, resilience, reliability, automation, and governance
- Design and implement AIOps capabilities using AI, agents, and agentic workflows to support incident response, anomaly detection, alert correlation, noise reduction, troubleshooting, and operational automation
- Lead development of agentic and multi-agent operational solutions that coordinate across monitoring, diagnostics, knowledge retrieval, remediation workflows, and operator assistance
- Build and mature ChatOps capabilities to improve collaboration, visibility, response speed, and workflow automation across engineering and operations teams
- Partner with SRE, Cloud Engineering, Infrastructure, Security, and Application teams to embed reliability engineering practices into operational processes and platform design
- Establish standards for operational telemetry, service-level objectives, alert quality, escalation workflows, incident readiness, and post-incident learning
- Drive adoption and optimization of observability tools such as New Relic, OpenSearch, and related monitoring, logging, and analytics platforms
- Identify opportunities to apply AI to reduce manual operational effort, improve mean time to detect and resolve issues, and increase platform stability and operator productivity
- Ensure AIOps solutions are implemented with strong governance, security, auditability, and operational trustworthiness
- Create playbooks, standards, reusable patterns, and operating models that scale AIOps adoption across teams
- Mentor engineers and operators in modern operations practices spanning observability, automation, SRE, ChatOps, and AI-assisted operations
Requirements:
- Typically BS + 12 years or MS + 10 years (or equivalent), with a strong track record leading cloud operations, platform operations, SRE, observability, or AIOps initiatives across complex enterprise environments
- Strong hands-on experience designing and operating workloads on AWS, with expertise across compute, networking, storage, security, automation, and cloud operations patterns
- Deep experience with modern observability and monitoring platforms such as New Relic, OpenSearch, and related tools for metrics, logs, traces, dashboards, alerting, and operational analytics
- Proven experience applying AI to operations use cases such as event correlation, anomaly detection, alert reduction, root cause analysis, remediation support, and operational workflow automation
- Strong experience designing or implementing AI agents, agentic workflows, or multi-agent systems that improve operational processes and operator effectiveness
- Strong grounding in site reliability engineering principles, including service reliability, SLOs/SLIs, error budgets, automation, incident management, resilience, and continuous improvement
- Demonstrated success building or scaling ChatOps practices that improve collaboration, incident response, and operational execution through integrated messaging and workflow automation
- Strong knowledge of scripting, infrastructure automation, operational tooling, APIs, event-driven systems, and platform integration patterns
- Ability to translate operational pain points into scalable technical solutions that improve reliability, speed, and operational maturity
- Able to influence technical teams and senior leaders, build alignment across functions, and communicate complex operational strategies clearly
- Experience implementing operational AI responsibly with appropriate controls for accuracy, security, compliance, explainability, and human oversight
- Experience leading AIOps or intelligent operations initiatives in a cloud-first or large-scale enterprise environment
- Experience supporting AWS migration programs and modern cloud operating models
- Familiarity with incident management tooling, runbook automation, knowledge systems, and operational workflow orchestration platforms
- Experience integrating AI agents with observability, ticketing, collaboration, or operational systems
- Experience in healthcare, regulated environments, or other domains requiring strong reliability and compliance practices
- Exposure to platform engineering, DevOps, and developer experience practices that intersect with operational excellence