Architect and deliver end-to-end AIOps and observability solutions, covering data collection, ingestion, correlation, analytics, dashboards, and operational workflows
Design and implement integrations between monitoring/observability platforms and ITSM tools using APIs and service interfaces
Define event and alert management strategies, including de-duplication, noise reduction, anomaly detection, root-cause analysis, and actionable alerting
Design, operationalize, and govern self-healing and runbook automation workflows triggered by events and incidents within ITIL-aligned processes
Establish dashboards, KPIs, and SLA reporting frameworks; define measurement models to track operational efficiency, business outcomes, and ROI
Lead technical POVs, demos, and architecture reviews; conduct tool evaluations and defend solution designs with senior customer stakeholders
Guide onshore and offshore engineering teams through architecture standards, HLD/LLD creation, backlog prioritization, and delivery governance; support RFP solutioning with architecture, roadmap, and estimates
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related discipline (or equivalent practical experience)
12+ years of experience in IT operations, managed services, application support, or transformation roles with significant architecture responsibility
Proven hands-on experience implementing AIOps and observability solutions across at least two enterprise platforms (e.g., Splunk, Dynatrace, Datadog, AppDynamics, Elastic)
Strong expertise in ITSM and ITIL processes, with demonstrated experience integrating monitoring, event management, and automation into ITSM platforms
Solid background in automation and orchestration, including scripting proficiency (Python, shell, and/or PowerShell) for prototyping and integrations
Experience designing or enabling GenAI and agentic AI use cases in IT operations, such as assisted triage, knowledge grounding, and runbook co-pilots
Excellent communication and stakeholder management skills, with the ability to present, influence, and defend technical solutions with customer and executive audiences
Tech Stack
ITSM
Python
Splunk
Benefits
Paid time off based on employee grade (A-F), defined by policy: Vacation: 12-25 days, depending on grade
Company paid holidays
Personal Days
Sick Leave
Medical, dental, and vision coverage (or provincial healthcare coordination in Canada)
Retirement savings plans (e.g., 401(k) in the U.S., RRSP in Canada)
Life and disability insurance
Employee assistance programs
Other benefits as provided by local policy and eligibility