Sage Care is focused on enhancing the reliability of their AI assistants, and they are seeking an AI Diagnostics & Observability Engineer. This role involves building diagnostic and observability infrastructure, implementing automated triage systems, and enabling rapid root-cause analysis to improve clinical AI performance.
Responsibilities:
- Build automated RCA pipelines to detect and classify failure modes:
- Hallucinations
- Misrouted intents
- Leaked/invalid tool calls (Transfer, SayMessage, Hangup, NOOP)
- Unrecoverable SOP loops
- Broken state transitions
- Telephony dropouts / DTMF issues
- Implement event tracing infrastructure capturing every agentic decision across LLM, telephony, and SOP execution
- Compare expected vs. actual SOP behavior using protocol-driven expectations or human-labeled ground truth
- Automatically compute performance, safety, reliability, and coverage metrics
- Build live and post-call dashboards that visualize:
- Full call timeline
- SOP/state machine traversal
- Agent reasoning traces
- Tool invocation history
- Divergence from expected behavior
- Design interactive visualizations: heatmaps, decision-path overlays, branching comparisons, and error hotspots
- Build triage dashboards for engineering and operations teams to rapidly understand system health
- Trace call-level events (dropouts, retries, audio playback issues)
- Detect DTMF misfires and incorrect action routing
- Analyze turn segmentation, word-error-rate drift, boosting performance, and latency
- Visualize errors in context (audio, transcript, aligned timecodes)
- Audit intent classification accuracy and subgraph routing
- Trace reasoning sequences, missing tool calls, redundant tool calls, or invalid arguments
- Validate tool call correctness (maps, SMS, search, internal SOP tools)
- Architect a live SOP state-machine tracer with:
- Real-time transcript overlays
- Current state + next expected state
- Deviation alerts
- Build dashboards to monitor 10–15 concurrent calls, highlighting sessions with:
- Loops
- Latency spikes
- Failed tool calls
- Repeated incorrect decisions
- Provide human specialists with escalation alerts and context
- Build An Intervention Console For On-call Specialists, Enabling
- “Skip step”
- “Say apology”
- “Escalate to human”
- “Send SMS”
- “Repeat last message”
- Override of SOP steps while maintaining auditability and continuity
- Build clustering systems (via embeddings or metadata) to group systemic failure modes:
- Intent misroutes under noisy audio
- Repeated missing tool calls
- Looped state machine traversal
- Hallucinated follow-ups or invalid summaries
- Generate recurring-failure reports to guide engineering improvements
- Design and implement an automated triage and notification system that:
- Detects failure category and severity in real time
- Routes incidents to the correct module owners:
- Telephony
- Transcription
- LLM orchestration
- SOP/decision-tree team
- Platform reliability
- Sends structured payloads containing:
- Trace graphs
- Relevant logs
- Transcript segments
- SOP divergence snapshots
- Suggested RCA labels
- Notifications May Integrate With
- PagerDuty
- Slack (rich message blocks)
- Jira auto-ticket creation
- Internal incident pipelines
- Extend pipelines to automatically generate human-readable failure summaries with:
- Call-level trace graphs
- Tool call sequences
- Transcript context
- Classified failure types
- Suggested root causes
- Store snapshots for operational handoff and debugging