Claw Mart
← Back to Blog
March 13, 20268 min readClaw Mart Team

AI Agent for BetterUptime: Automate Uptime Monitoring, Incident Alerts, and Status Page Management

Automate Uptime Monitoring, Incident Alerts, and Status Page Management

AI Agent for BetterUptime: Automate Uptime Monitoring, Incident Alerts, and Status Page Management

If you're running BetterUptime, you already know the drill: monitors check your endpoints, alerts fire when something breaks, your on-call engineer gets paged, and hopefully someone remembers to update the status page before customers start tweeting at you.

It works. Until it doesn't.

The moment you have more than a handful of monitors, the cracks show up fast. Five monitors trip at the same time because a single upstream dependency went down, and suddenly your on-call person is drowning in notifications instead of diagnosing the actual problem. An incident gets created, but nobody bothers to pull the relevant deploy logs or check if someone just merged a sketchy PR. The postmortem template sits there, half-filled, for three weeks.

BetterUptime is genuinely good at what it does β€” clean UX, solid Slack integration, status pages that don't look like they were built in 2008. But its built-in automations are basically just routing rules. "If severity equals SEV1, notify these people." That's it. No intelligence, no context, no memory of that time the exact same thing happened six weeks ago and it turned out to be a DNS TTL issue.

This is where a custom AI agent comes in. Not BetterUptime's own features β€” a purpose-built agent that sits on top of BetterUptime's API, consumes its webhooks, and turns it from a notification system into something that actually thinks about your incidents.

Here's how to build one with OpenClaw.


Why BetterUptime's Native Automations Hit a Wall

Let's be specific about what BetterUptime can and can't do on its own, because the gap is the whole point.

What it handles natively:

  • If monitor goes down β†’ page on-call rotation
  • Auto-acknowledge and auto-resolve based on recovery
  • Simple notification routing by severity
  • Webhook firing on state changes

What it cannot do:

  • Correlate five simultaneous alerts into one root cause
  • Pull context from external systems (your APM, your deploy pipeline, your database dashboards)
  • Generate a human-readable status page update that doesn't sound like a robot wrote it
  • Remember that this exact failure pattern happened before and what fixed it
  • Run diagnostic queries or commands based on the specific error
  • Draft a postmortem from the incident timeline
  • Answer natural language questions like "what's been flaky this week?"

BetterUptime's API is actually quite capable β€” full CRUD on monitors, incidents, heartbeats, on-call schedules, status pages. The tooling is there. What's missing is the brain that connects it all together.


The Architecture: OpenClaw + BetterUptime

Here's the integration pattern that actually works in production:

OpenClaw acts as the AI orchestration layer. It receives webhook events from BetterUptime, reasons about them using LLM capabilities with tool use, queries external systems for context, and takes actions back through BetterUptime's API.

The flow looks like this:

BetterUptime webhook fires
    β†’ OpenClaw agent receives event
    β†’ Agent evaluates: Is this a new issue or part of an existing incident?
    β†’ Agent enriches: Pull recent deploys, relevant logs, similar past incidents
    β†’ Agent acts: Update incident, post to Slack, draft status page update
    β†’ Agent learns: Store resolution pattern for future reference

This isn't theoretical. Let me walk through the specific workflows.


Workflow 1: Intelligent Alert Correlation and Deduplication

This is the single highest-value thing an AI agent adds to BetterUptime. Full stop.

The problem: Your API monitor, your health check monitor, and your SSL monitor all fire within 90 seconds because your load balancer hiccuped. BetterUptime creates three separate incidents. Your on-call person gets three pages. They waste ten minutes figuring out it's all the same thing.

The OpenClaw solution:

Configure your BetterUptime webhooks to send all monitor state changes to your OpenClaw agent endpoint. The agent maintains a short-term memory window (say, 5 minutes) and correlates incoming alerts.

# OpenClaw agent tool definition for alert correlation
{
    "tool": "correlate_alerts",
    "description": "Check if incoming alert is related to existing active incidents",
    "parameters": {
        "monitor_id": "string",
        "monitor_url": "string",
        "error_type": "string",
        "timestamp": "datetime",
        "region": "string"
    },
    "logic": """
        1. Fetch all active incidents from BetterUptime API (GET /api/v2/incidents?status=started)
        2. Compare the incoming alert's target URL, error type, and timing
        3. If likely related (same domain, overlapping infrastructure, within correlation window):
           - Add a timeline comment to the existing incident via API
           - Suppress duplicate paging
        4. If genuinely new:
           - Allow normal incident creation
           - Begin enrichment workflow
    """
}

The agent uses LLM reasoning to determine relatedness β€” not just exact URL matching, but understanding that api.yourapp.com/v2/users and api.yourapp.com/v2/orders failing simultaneously probably share a root cause.

This alone cuts alert noise by 40-60% for most teams. That's not a made-up number β€” it's what teams consistently report when they add correlation to any monitoring tool.


Workflow 2: Automatic Context Enrichment

When an incident does get created, the OpenClaw agent immediately starts pulling context that the on-call engineer would otherwise spend 10-15 minutes gathering manually.

Tools you give the agent:

tools = [
    {
        "name": "get_recent_deploys",
        "description": "Fetch deployments from the last 4 hours from GitHub/CI",
        "endpoint": "GET /your-ci/deployments?since={4_hours_ago}"
    },
    {
        "name": "get_betteruptime_monitor_details",
        "description": "Get full monitor config and recent check history",
        "endpoint": "GET /api/v2/monitors/{monitor_id}"
    },
    {
        "name": "query_apm_errors",
        "description": "Pull recent error spikes from your APM tool",
        "endpoint": "GET /your-apm/errors?service={service}&timeframe=1h"
    },
    {
        "name": "check_database_metrics",
        "description": "Get current DB connection pool, query latency, etc.",
        "endpoint": "GET /your-metrics/database/summary"
    },
    {
        "name": "search_past_incidents",
        "description": "Search BetterUptime incident history for similar patterns",
        "endpoint": "GET /api/v2/incidents?q={search_terms}"
    },
    {
        "name": "post_incident_timeline_comment",
        "description": "Add context as a timeline entry on the BetterUptime incident",
        "endpoint": "POST /api/v2/incidents/{incident_id}/timeline"
    }
]

The agent runs these in parallel, synthesizes the results, and posts a summary directly to the BetterUptime incident timeline:

πŸ€– Agent Context Summary:
- 2 deploys in last 2 hours: PR #1847 (auth service refactor) merged by @sarah at 14:32 UTC
- APM showing 340% spike in 500 errors on /api/v2/auth/* endpoints starting at 14:35 UTC
- Database connections normal, query latency normal
- Similar incident on March 12: auth service deploy caused token validation failures (resolved by rollback)
- Suggested action: Investigate PR #1847, consider rollback of auth service

By the time the on-call engineer opens the incident, the context is already there. They're not starting from zero. They're starting from "it's probably this deploy, here's the evidence."


Workflow 3: Smart Status Page Updates

This one sounds minor but it's a huge time sink. When something breaks, the last thing your on-call engineer wants to do is write a customer-facing status update. So they either don't do it, or they write something unhelpful like "We are investigating an issue."

The OpenClaw agent can draft status page updates using BetterUptime's status page API:

{
    "tool": "draft_status_update",
    "description": "Generate a clear, customer-facing status update based on incident context",
    "parameters": {
        "incident_summary": "string",
        "affected_services": "list",
        "severity": "string",
        "current_status": "investigating|identified|monitoring|resolved"
    },
    "output": "POST /api/v2/status-pages/{page_id}/status-updates"
}

The agent generates something like:

Authentication Service β€” Degraded Performance We're currently experiencing issues with our authentication service that may affect login and session management. Our team has identified a recent change as the likely cause and is working on a fix. API endpoints not requiring authentication are unaffected. We'll provide an update within 30 minutes.

That's draft output the on-call engineer reviews and approves with one click. Not perfect every time, but a hell of a lot better than nothing or "investigating an issue" for four hours.


Workflow 4: Automated Postmortem Drafting

BetterUptime has postmortem templates. They're fine. The problem isn't the template β€” it's that nobody fills them out because it's tedious and the incident is already resolved and there are seventeen other things to do.

After an incident resolves, the OpenClaw agent can:

  1. Pull the full incident timeline from BetterUptime's API
  2. Gather all the enrichment data it collected during the incident
  3. Compile the resolution steps from timeline comments
  4. Draft a structured postmortem
# Agent generates postmortem from incident data
{
    "tool": "draft_postmortem",
    "inputs": [
        "GET /api/v2/incidents/{id}/timeline",
        "GET /api/v2/incidents/{id}",
        "agent_memory.get_enrichment_data(incident_id)"
    ],
    "output_format": {
        "summary": "One paragraph, what happened",
        "timeline": "Chronological events with timestamps",
        "root_cause": "Analysis based on gathered evidence",
        "impact": "Duration, affected users/services, error rates",
        "resolution": "What fixed it",
        "action_items": "Suggested preventive measures",
        "lessons_learned": "Patterns the agent noticed"
    }
}

The postmortem isn't final β€” it still needs human review. But going from "blank template that nobody fills out" to "90% complete draft that needs a few edits" is the difference between postmortems actually happening and not.


Workflow 5: Natural Language Querying

This one is more quality-of-life, but it compounds over time. Instead of clicking through BetterUptime's UI or writing API queries, your team can ask the OpenClaw agent questions directly in Slack:

  • "What monitors have been flaky this week?"
  • "Show me all incidents related to the payments service in the last 30 days"
  • "Are there any monitors in a degraded state right now?"
  • "What's our uptime been for the main API this month?"

The agent translates these into BetterUptime API calls, processes the results, and responds conversationally. It's not a groundbreaking capability on its own, but it makes BetterUptime's data accessible to people who aren't going to log into the dashboard.


Implementation: Getting Started with OpenClaw

Here's the practical path to getting this running:

Step 1: Set up your BetterUptime webhook. In BetterUptime, go to Integrations β†’ Webhooks. Point it at your OpenClaw agent's ingestion endpoint. Select the events you want: monitor state changes, incident updates, heartbeat failures.

Step 2: Configure your OpenClaw agent with BetterUptime API tools. Give the agent tools that map to BetterUptime's REST API endpoints. The API uses bearer token auth and is well-documented. Key endpoints:

  • GET /api/v2/monitors β€” list and filter monitors
  • GET/POST /api/v2/incidents β€” manage incidents
  • POST /api/v2/incidents/{id}/timeline β€” add timeline entries
  • GET/POST /api/v2/status-pages/{id}/status-updates β€” manage status page
  • GET /api/v2/on-call-calendars β€” check who's on call

Step 3: Add your external context tools. This is where the real power comes from. Connect your agent to GitHub (recent PRs/deploys), your APM (Datadog, Honeycomb, New Relic), your infrastructure metrics, and anything else your team checks during an incident.

Step 4: Define your agent's decision logic. This is the OpenClaw-specific part β€” you define how the agent should reason about incoming events. Start simple: correlate alerts and enrich incidents. Add complexity as you build confidence.

Step 5: Set up Slack integration for approvals. For anything that takes public-facing action (status page updates, incident severity changes), require human approval via Slack. The agent drafts, a human approves. You can relax this over time as trust builds.


What This Doesn't Replace

Let me be clear about scope. This agent doesn't replace:

  • BetterUptime itself β€” you still need the monitoring infrastructure, the status pages, the on-call scheduling. The agent makes all of that smarter.
  • Human judgment on complex incidents β€” the agent handles the grunt work (correlation, context gathering, drafting). Humans make the hard calls.
  • Proper monitoring coverage β€” if you don't have the right monitors set up, no amount of AI fixes that.

What it does replace is the manual toil that makes incident response slow and postmortems nonexistent. It's the difference between your on-call engineer spending 15 minutes gathering context and 5 minutes fixing the problem, versus spending 5 minutes reading the context the agent already gathered and 5 minutes fixing the problem.


The Compounding Effect

The part that's hard to appreciate until you've seen it: these agents get more valuable over time.

Every incident the agent processes goes into its memory. "Last time this monitor failed with this error pattern, it was caused by X and resolved by Y." After six months, the agent has seen hundreds of your incidents. Its root cause suggestions get better. Its correlation logic gets tighter. Its postmortem drafts get more specific to your infrastructure.

BetterUptime gives you the monitoring and the API. OpenClaw gives you the intelligence layer that turns reactive incident response into something that actually learns from your operational history.


Next Steps

If you're already running BetterUptime and want to stop treating incidents like isolated events that your team forgets about two weeks later, this is the path.

Start with alert correlation and context enrichment β€” those two workflows alone will save your on-call team hours per week and meaningfully reduce alert fatigue.

If you want help scoping this out for your specific stack, check out Clawsourcing β€” we'll work with you to design and build the OpenClaw agent that fits your BetterUptime setup, your external tools, and your team's actual incident response workflows. No generic templates. The agent your team needs, built for how you actually operate.

Claw Mart Daily

Get one AI agent tip every morning

Free daily tips to make your OpenClaw agent smarter. No spam, unsubscribe anytime.

More From the Blog