Zelis is modernizing the healthcare financial experience across payers, providers, and healthcare consumers. This role will lead the next phase of Zelis’ operational transformation by defining and driving the AIOps strategy, combining cloud operations, observability, automation, and AI solutions to enhance operational efficiency and reliability.

Responsibilities:

Lead the AIOps strategy and architecture for Zelis Price Business Unit as we modernize operations alongside AWS migration and AI-native acceleration
Define and implement intelligent operational patterns that improve incident detection, triage, remediation, root cause analysis, and operational decision-making
Architect and scale observability capabilities across cloud and application environments, including metrics, logs, traces, dashboards, alerting, and service health visibility
Drive operational excellence across AWS environments by establishing scalable patterns for monitoring, resilience, reliability, automation, and governance
Design and implement AIOps capabilities using AI, agents, and agentic workflows to support incident response, anomaly detection, alert correlation, noise reduction, troubleshooting, and operational automation
Lead development of agentic and multi-agent operational solutions that coordinate across monitoring, diagnostics, knowledge retrieval, remediation workflows, and operator assistance
Build and mature ChatOps capabilities to improve collaboration, visibility, response speed, and workflow automation across engineering and operations teams
Partner with SRE, Cloud Engineering, Infrastructure, Security, and Application teams to embed reliability engineering practices into operational processes and platform design
Establish standards for operational telemetry, service-level objectives, alert quality, escalation workflows, incident readiness, and post-incident learning
Drive adoption and optimization of observability tools such as New Relic, OpenSearch, and related monitoring, logging, and analytics platforms
Identify opportunities to apply AI to reduce manual operational effort, improve mean time to detect and resolve issues, and increase platform stability and operator productivity
Ensure AIOps solutions are implemented with strong governance, security, auditability, and operational trustworthiness
Create playbooks, standards, reusable patterns, and operating models that scale AIOps adoption across teams
Mentor engineers and operators in modern operations practices spanning observability, automation, SRE, ChatOps, and AI-assisted operations

Requirements:

Typically BS + 12 years or MS + 10 years (or equivalent), with a strong track record leading cloud operations, platform operations, SRE, observability, or AIOps initiatives across complex enterprise environments
Strong hands-on experience designing and operating workloads on AWS, with expertise across compute, networking, storage, security, automation, and cloud operations patterns
Deep experience with modern observability and monitoring platforms such as New Relic, OpenSearch, and related tools for metrics, logs, traces, dashboards, alerting, and operational analytics
Proven experience applying AI to operations use cases such as event correlation, anomaly detection, alert reduction, root cause analysis, remediation support, and operational workflow automation
Strong experience designing or implementing AI agents, agentic workflows, or multi-agent systems that improve operational processes and operator effectiveness
Strong grounding in site reliability engineering principles, including service reliability, SLOs/SLIs, error budgets, automation, incident management, resilience, and continuous improvement
Demonstrated success building or scaling ChatOps practices that improve collaboration, incident response, and operational execution through integrated messaging and workflow automation
Strong knowledge of scripting, infrastructure automation, operational tooling, APIs, event-driven systems, and platform integration patterns
Ability to translate operational pain points into scalable technical solutions that improve reliability, speed, and operational maturity
Able to influence technical teams and senior leaders, build alignment across functions, and communicate complex operational strategies clearly
Experience implementing operational AI responsibly with appropriate controls for accuracy, security, compliance, explainability, and human oversight
Experience leading AIOps or intelligent operations initiatives in a cloud-first or large-scale enterprise environment
Experience supporting AWS migration programs and modern cloud operating models
Familiarity with incident management tooling, runbook automation, knowledge systems, and operational workflow orchestration platforms
Experience integrating AI agents with observability, ticketing, collaboration, or operational systems
Experience in healthcare, regulated environments, or other domains requiring strong reliability and compliance practices
Exposure to platform engineering, DevOps, and developer experience practices that intersect with operational excellence

AIOps Lead, Software Engineering

Key skills

About this role

Responsibilities:

Requirements: