Agent in a Box

Autonomous Engineering Post-Mortem & RCA Agent

engineering

Autonomous Engineering Post-Mortem & RCA Agent

Problem Statement

In high-velocity engineering organizations, incident post-mortems (Post-Incident Reviews) are often the first thing to be sacrificed during "crunch time." When a production outage occurs, the immediate focus is on mitigation. However, the critical phase of Root Cause Analysis (RCA) and preventative action mapping is frequently delayed, leading to "knowledge rot."

Startups face a specific "incidents-to-insights" gap. Senior engineers spend 4-6 hours per incident manually stitching together timelines from PagerDuty, GitHub commits, and Datadog traces. This manual toil costs mid-stage startups roughly $15k-$20k monthly in lost engineering productivity. Without a standardized RCA process, teams often treat symptoms rather than systemic causes. This leads to "re-incidents"—recurring failures that erode customer trust. There is a desperate need for an AI agent that can autonomously ingest the chaotic trail of an incident and produce a high-fidelity, peer-review-ready post-mortem.

What the Agent Does/Doesn't Do

Does:

  • Aggregates data from monitoring tools, version control, and chat logs immediately after an incident is marked "resolved."
  • Constructs a chronological "Sequence of Events" (SoE) with millisecond precision.
  • Identifies the "Triggering Commit" by cross-referencing deployment timestamps with error spikes.
  • Drafts a Five-Whys analysis based on technical telemetry.
  • Suggests specific Jira tickets for long-term remediation.
  • Complements the Autonomous On-Call Incident Remediation Agent by handling the documentation phase.

Doesn't:

  • Perform real-time incident mitigation or automated rollbacks (this is for the remediation agent).
  • Communicate directly with external customers or status pages.
  • Assign blame to specific individuals; it focuses on systemic failure points.

Workflow

  1. Trigger: The agent polls PagerDuty/Opsgenie for incidents transitioned to "Resolved."
  2. Context Harvesting: Input: Incident ID. Output: A JSON bundle containing Slack thread history, Datadog/NewRelic dashboard snapshots, and GitHub deployment logs.
  3. Timeline Synthesis: Input: Context Bundle. Output: A markdown table of events, filtered for relevance (e.g., highlighting 5xx spikes).
  4. RCA Logic Engine: Input: Synthesized Timeline + Code Diffs. Output: A "Five Whys" draft identifying the architectural root cause.
  5. Action Item Generation: Input: RCA Draft. Output: 3-5 prioritized GitHub Issues or Jira Tickets.
  6. Review Loop: The agent posts the draft to a dedicated #incident-reports Slack channel for human approval.

Tool Stack

  • LangGraph - Orchestration for stateful multi-step reasoning.
  • Claude 3.5 Sonnet - LLM superior for long-context log analysis.
  • PagerDuty - Incident management and log entry source.
  • Slack - Communication hub for review loops and context harvesting.
  • Datadog [Unverified] - Monitoring and dashboard snapshots.
  • Opsgenie [Unverified] - Alternative incident triggering.
  • New Relic [Unverified] - Telemetry data source.
  • GitHub [Unverified] - Version control and issue tracking.
  • Jira [Unverified] - Remediation ticket management.
  • Pinecone [Unverified] - Vector database for historical RCA lookup.
  • Vercel Functions / AWS Lambda [Unverified] - Serverless deployment.

Quick Integration

Claude 3.5 Sonnet RCA Analysis

import anthropic

client = anthropic.Anthropic(
    api_key="your_api_key_here",
)

message = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    system="You are an expert Site Reliability Engineer. Your task is to analyze incident logs and generate a Root Cause Analysis (RCA) report.",
    messages=[
        {
            "role": "user",
            "content": "Analyze this incident data and identify the root cause: [2024-05-20 14:02] DB_CONNECTION_ERROR in microservice-a. [2024-05-20 14:01] Deployment of 'v2.4.1' completed. [2024-05-20 14:00] Config change: max_connections reduced from 100 to 10."
        }
    ]
)

print(message.content[0].text)

Source: Anthropic Docs

PagerDuty Timeline Extraction

import requests

# Configuration
API_KEY = 'YOUR_PAGERDUTY_API_KEY'
INCIDENT_ID = 'YOUR_INCIDENT_ID' # e.g., 'Q0123456789ABC'

headers = {
    'Accept': 'application/vnd.pagerduty+json;version=2',
    'Authorization': f'Token token={API_KEY}',
    'Content-Type': 'application/json'
}

def get_incident_log_entries(incident_id):
    """
    Fetches the timeline of events (log entries) for a specific incident.
    """
    url = f'https://api.pagerduty.com/incidents/{incident_id}/log_entries'
    
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        log_entries = response.json().get('log_entries', [])
        for entry in log_entries:
            print(f"[{entry['created_at']}] {entry['type']}: {entry.get('summary', 'No summary')}")
    else:
        print(f"Error: {response.status_code} - {response.text}")

if __name__ == '__main__':
    get_incident_log_entries(INCIDENT_ID)

Source: PagerDuty API Reference

Slack Notification for Review

from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError

client = WebClient(token="xoxb-your-bot-token-here")

try:
    response = client.chat_postMessage(
        channel="#incident-reports",
        text="🚨 New RCA Draft Ready for Review",
        blocks=[
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": "*RCA Draft: Incident #12345*\n\n*Root Cause:* DB Connection Pool Exhaustion\n*Trigger:* Config change v2.4.1"
                }
            }
        ]
    )
except SlackApiError as e:
    print(f"Error: {e.response['error']}")

Source: Slack API Methods

Prompt Skeletons

1. Timeline Reconstruction Prompt

(Existing prompt content remains here)