Envision Technology Solutions is seeking a Site Reliability Engineer (SRE) focused on incident automation. The role involves authoring troubleshooting guides, automating incident management workflows, and creating monitors to ensure service health and reliability.
Responsibilities:
- Writing structured troubleshooting guides with: symptoms, diagnostic steps, KQL queries, expected results interpretation, and mitigation actions
- Organizing TSGs into a logical folder hierarchy (by sub-service, monitor, failure category)
- Creating a Root.md entry point that maps incident signals to the right TSG
- Optionally creating TOC files for token optimization (~50% cost reduction)
- Writing clear system prompts that guide the AI agent's investigation and mitigation workflow
- Defining tool usage patterns, decision logic, and structured output format
- Crafting investigation flows that handle cross-service dependency chains
- Iterating prompts based on agent output quality during testing
- Building automation workflows triggered by IcM incidents
- Configuring incident routing, auto-triage, and escalation rules
- Integrating DRI Agent into the IcM incident lifecycle (auto-invoke on incident creation)
- Understanding severity levels, queue paths, and ownership models across sub-service teams
- Creating and tuning monitors/alerts that detect service health issues
- Mapping monitors to TSGs so the agent knows which guide to follow per alert
- Understanding Geneva/MDM metrics for health signal definition
- Writing and validating KQL queries against service telemetry
- Understanding Kusto cluster/database structure and table-to-service mapping
- Time-based filtering, summarization, aggregation, and joins
- Parameterized queries (e.g., placeholders for timestamps, cluster names, tenant IDs)
- Interpreting query results to validate whether a mitigation was successful
Requirements:
- Troubleshooting Guide (TSG) Authoring
- Prompt Engineering
- IcM Automation
- Monitor Creation
- KQL (Kusto Query Language)