HealthEquity is dedicated to empowering healthcare consumers to save and improve lives. The Observability Engineer I will play a foundational role in ensuring the reliability, performance, and visibility of critical IT infrastructure and business systems by improving monitoring, alert accuracy, and incident response.

Responsibilities:

Enabling faster incident response by improving monitoring coverage, alert accuracy, and root cause visibility
Helping teams shift from reactive to proactive operations by applying telemetry data and AI-driven insights
Empowering service owners with clear dashboards and actionable insights that guide performance improvements
Improving system resilience through continuous feedback and collaboration with other internal teams
Driving data-informed decisions by transforming raw logs and metrics into insights that help drive business outcomes
Support onboarding of new systems and applications into our observability platforms
Maintain and troubleshoot dashboards, alerts, and telemetry integrations
Collaborate closely with ITSM, application support, and infrastructure teams to improve incident detection and root cause analysis
Follow defined operational procedures and participate in change and incident management workflows
Document technical procedures, runbooks, and monitoring standards
Learn and apply observability best practices while developing skills in automation and data analytics

Requirements:

Foundational knowledge of IT infrastructure, applications, and networking concepts (e.g., servers, databases, APIs, web services, cloud platforms)
Curiosity and attention to detail when investigating alerts, logs, metrics, and performance trends
Basic experience or coursework with monitoring/logging tools (e.g., Dynatrace, Splunk, Prometheus, Grafana, ELK, or similar)
A strong working foundational understanding of Dynatrace, LogicMonitor, Prometheus, Open Telemetry, ThousandEyes or similar
Understanding of performance counters and indicators for both systems and applications and how to interpret them
Familiarity with scripting or query languages (e.g., PowerShell, Python, SQL, or log query languages like DQL or SPL) is required
Interest in leveraging AI-powered observability features (e.g., anomaly detection, root cause analysis, predictive alerts) to improve reliability and reduce noise
Strong communication and collaboration skills—able to work with cross-functional teams in IT, application support, architecture and engineering
Willingness to learn cloud native observability practices, ITIL workflows, and continuous improvement methodologies
Accountability and a service-oriented mindset is a must. We are a highly motivated, service oriented team. We care about service availability, resilience, performance, reducing mean time to restore service, and helping teams understand the art of the possible with observability

Observability Engineer I

Key skills

About this role

Responsibilities:

Requirements: