-
Notifications
You must be signed in to change notification settings - Fork 4
Feature: Runbooks for Incident Resolution (AI + SRE) #1
Copy link
Copy link
Open
Description
Summary
Add the ability to define and store runbooks for common incidents so both AI agents and SREs can reference structured resolution steps during incident response.
Problem
When incidents occur, resolution steps often live in scattered docs, Slack threads, or individual memory.
This leads to:
- Slower resolution times
- Inconsistent handling
- Knowledge silos
- Limited AI-assisted troubleshooting
We need a structured, queryable way to store and retrieve incident runbooks.
Proposed Solution
Introduce Runbooks as a first-class entity:
- Create / edit runbooks
- Tag by service, severity, category
- Structured steps (checklist format)
- Attach logs, queries, dashboards, or links
- Support markdown
Each runbook should include:
- Title
- Description
- Affected services
- Trigger conditions
- Step-by-step resolution instructions
- Escalation notes
- Post-incident checklist
AI Integration
Runbooks should be:
- Searchable via semantic search
- Automatically suggested during incidents
- Usable by AI agents to execute or recommend steps
- Context-aware based on error signals
Example:
If error rate spikes on
api-service, suggest “High 5xx Errors – API Service” runbook.
Benefits
- Faster MTTR
- Consistent resolution
- Easier onboarding of new SREs
- Enables AI-assisted incident response
- Institutional knowledge capture
Future Extensions
- Link runbooks to specific alert rules
- Auto-trigger runbooks
- Execution logs tied to incidents
- Feedback loop to improve runbooks over time
Would love community feedback on:
- How you currently manage runbooks
- What fields are essential
- Whether AI-assisted execution would be useful
Open to refining the scope before implementation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels