Sage Care is focused on enhancing the reliability of their AI assistants, and they are seeking an AI Diagnostics & Observability Engineer. This role involves building diagnostic and observability infrastructure, implementing automated triage systems, and enabling rapid root-cause analysis to improve clinical AI performance.

Responsibilities:

Build automated RCA pipelines to detect and classify failure modes:
Hallucinations
Misrouted intents
Leaked/invalid tool calls (Transfer, SayMessage, Hangup, NOOP)
Unrecoverable SOP loops
Broken state transitions
Telephony dropouts / DTMF issues
Implement event tracing infrastructure capturing every agentic decision across LLM, telephony, and SOP execution
Compare expected vs. actual SOP behavior using protocol-driven expectations or human-labeled ground truth
Automatically compute performance, safety, reliability, and coverage metrics
Build live and post-call dashboards that visualize:
Full call timeline
SOP/state machine traversal
Agent reasoning traces
Tool invocation history
Divergence from expected behavior
Design interactive visualizations: heatmaps, decision-path overlays, branching comparisons, and error hotspots
Build triage dashboards for engineering and operations teams to rapidly understand system health
Trace call-level events (dropouts, retries, audio playback issues)
Detect DTMF misfires and incorrect action routing
Analyze turn segmentation, word-error-rate drift, boosting performance, and latency
Visualize errors in context (audio, transcript, aligned timecodes)
Audit intent classification accuracy and subgraph routing
Trace reasoning sequences, missing tool calls, redundant tool calls, or invalid arguments
Validate tool call correctness (maps, SMS, search, internal SOP tools)
Architect a live SOP state-machine tracer with:
Real-time transcript overlays
Current state + next expected state
Deviation alerts
Build dashboards to monitor 10–15 concurrent calls, highlighting sessions with:
Loops
Latency spikes
Failed tool calls
Repeated incorrect decisions
Provide human specialists with escalation alerts and context
Build An Intervention Console For On-call Specialists, Enabling
“Skip step”
“Say apology”
“Escalate to human”
“Send SMS”
“Repeat last message”
Override of SOP steps while maintaining auditability and continuity
Build clustering systems (via embeddings or metadata) to group systemic failure modes:
Intent misroutes under noisy audio
Repeated missing tool calls
Looped state machine traversal
Hallucinated follow-ups or invalid summaries
Generate recurring-failure reports to guide engineering improvements
Design and implement an automated triage and notification system that:
Detects failure category and severity in real time
Routes incidents to the correct module owners:
Telephony
Transcription
LLM orchestration
SOP/decision-tree team
Platform reliability
Sends structured payloads containing:
Trace graphs
Relevant logs
Transcript segments
SOP divergence snapshots
Suggested RCA labels
Notifications May Integrate With
PagerDuty
Slack (rich message blocks)
Jira auto-ticket creation
Internal incident pipelines
Extend pipelines to automatically generate human-readable failure summaries with:
Call-level trace graphs
Tool call sequences
Transcript context
Classified failure types
Suggested root causes
Store snapshots for operational handoff and debugging

AI Diagnostics & Observability Engineer

Key skills

About this role

Responsibilities: