HealthEquity is dedicated to empowering healthcare consumers to save and improve lives. The Observability Engineer I will play a foundational role in ensuring the reliability, performance, and visibility of critical IT infrastructure and business systems by improving monitoring, alert accuracy, and incident response.
Responsibilities:
- Enabling faster incident response by improving monitoring coverage, alert accuracy, and root cause visibility
- Helping teams shift from reactive to proactive operations by applying telemetry data and AI-driven insights
- Empowering service owners with clear dashboards and actionable insights that guide performance improvements
- Improving system resilience through continuous feedback and collaboration with other internal teams
- Driving data-informed decisions by transforming raw logs and metrics into insights that help drive business outcomes
- Support onboarding of new systems and applications into our observability platforms
- Maintain and troubleshoot dashboards, alerts, and telemetry integrations
- Collaborate closely with ITSM, application support, and infrastructure teams to improve incident detection and root cause analysis
- Follow defined operational procedures and participate in change and incident management workflows
- Document technical procedures, runbooks, and monitoring standards
- Learn and apply observability best practices while developing skills in automation and data analytics
Requirements:
- Foundational knowledge of IT infrastructure, applications, and networking concepts (e.g., servers, databases, APIs, web services, cloud platforms)
- Curiosity and attention to detail when investigating alerts, logs, metrics, and performance trends
- Basic experience or coursework with monitoring/logging tools (e.g., Dynatrace, Splunk, Prometheus, Grafana, ELK, or similar)
- A strong working foundational understanding of Dynatrace, LogicMonitor, Prometheus, Open Telemetry, ThousandEyes or similar
- Understanding of performance counters and indicators for both systems and applications and how to interpret them
- Familiarity with scripting or query languages (e.g., PowerShell, Python, SQL, or log query languages like DQL or SPL) is required
- Interest in leveraging AI-powered observability features (e.g., anomaly detection, root cause analysis, predictive alerts) to improve reliability and reduce noise
- Strong communication and collaboration skills—able to work with cross-functional teams in IT, application support, architecture and engineering
- Willingness to learn cloud native observability practices, ITIL workflows, and continuous improvement methodologies
- Accountability and a service-oriented mindset is a must. We are a highly motivated, service oriented team. We care about service availability, resilience, performance, reducing mean time to restore service, and helping teams understand the art of the possible with observability