Debugging Webhook Failures in Production OpenClaw Agents
Debugging Webhook Failures in Production OpenClaw Agents

If you've been running OpenClaw agents in production for more than about a week, you've already hit this. Your agent completes a perfectly good task β maybe it scraped pricing data, processed an inbound support ticket, or finished a multi-step research workflow β and then the webhook that's supposed to notify your downstream service justβ¦ doesn't fire. Or it fires twice. Or it fires but the receiving end chokes on a 4MB JSON blob of the agent's internal monologue.
Webhook failures in AI agent systems are a different animal than webhook failures in traditional software. In a normal app, a failed webhook is annoying but predictable. You know exactly when it should fire, what the payload looks like, and you can replay it easily. With OpenClaw agents, the webhook call might happen at a non-deterministic point in the execution graph, the payload size varies wildly depending on what the agent decided to do, and a failure can silently corrupt the rest of the workflow because the agent assumes the notification went through.
I've spent the last few months debugging these exact issues β both in my own OpenClaw deployments and helping others in the community sort through theirs. Here's everything I've learned, organized by the actual failure modes you'll encounter, with concrete fixes for each.
The Seven Failure Modes (And Why They're Worse With Agents)
Before jumping into solutions, let's be honest about what's actually going wrong. These are ranked roughly by how often they'll bite you.
1. Transient Delivery Failures With No Retry Logic
This is the big one. Your OpenClaw agent finishes a step, calls your webhook endpoint, gets a 502 or a connection timeout, and justβ¦ moves on. The agent doesn't know the notification failed. Your downstream service doesn't know the step completed. Everything looks fine in the agent logs because the agent did its job β it just couldn't tell anyone about it.
The core problem: most people wire up webhook calls as simple HTTP requests inside their OpenClaw tool definitions, with zero retry logic. In traditional software you'd never ship a webhook sender without retries. But when you're building agent tools, you're focused on the AI logic, not the plumbing.
Here's what a naive implementation looks like:
import requests
from openclaw import Tool
class NotifyWebhook(Tool):
name = "notify_completion"
description = "Sends task results to the webhook endpoint"
def run(self, payload: dict) -> str:
response = requests.post(
"https://your-service.com/webhook",
json=payload,
timeout=10
)
return f"Webhook sent: {response.status_code}"
This will work 95% of the time, and the other 5% will cost you hours of debugging. Here's the fix:
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openclaw import Tool
class NotifyWebhook(Tool):
name = "notify_completion"
description = "Sends task results to the webhook endpoint"
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type((
requests.exceptions.ConnectionError,
requests.exceptions.Timeout,
requests.exceptions.HTTPError,
))
)
def _send_with_retry(self, url: str, payload: dict) -> requests.Response:
response = requests.post(url, json=payload, timeout=15)
if response.status_code >= 500:
raise requests.exceptions.HTTPError(
f"Server error: {response.status_code}"
)
return response
def run(self, payload: dict) -> str:
try:
response = self._send_with_retry(
"https://your-service.com/webhook",
payload
)
return f"Webhook delivered: {response.status_code}"
except Exception as e:
# Critical: tell the agent the delivery failed
return f"WEBHOOK FAILED after 5 attempts: {str(e)}. Task completed but notification was not delivered."
Two things to notice. First, exponential backoff. Don't hammer a failing endpoint. Second β and this is the part most people miss β the return message tells the agent the webhook failed. In an OpenClaw agent graph, the tool's return value feeds back into the agent's context. If you silently swallow the error, the agent proceeds with incorrect assumptions. If you clearly state the failure, the agent (or the next node in your graph) can take corrective action.
2. Poor Observability
You know something went wrong because your Slack channel didn't get the notification or your database doesn't have the expected record. But where did it fail? Was it the agent? The HTTP call? The receiving service? The DNS lookup?
OpenClaw's tracing capabilities are solid, but you need to actually instrument your webhook calls to see them in traces. Here's a pattern I use on every project:
import time
import uuid
import logging
from openclaw import Tool
from openclaw.tracing import span
logger = logging.getLogger("webhook")
class ObservableWebhook(Tool):
name = "notify_completion"
description = "Sends task results via webhook with full observability"
def run(self, payload: dict) -> str:
webhook_id = str(uuid.uuid4())
url = "https://your-service.com/webhook"
with span(
name="webhook_delivery",
metadata={
"webhook_id": webhook_id,
"url": url,
"payload_size_bytes": len(str(payload)),
"event_type": payload.get("event_type", "unknown"),
}
) as s:
start = time.time()
try:
response = self._send_with_retry(url, {
**payload,
"webhook_id": webhook_id,
"idempotency_key": webhook_id,
})
duration = time.time() - start
s.set_attribute("http.status_code", response.status_code)
s.set_attribute("duration_ms", round(duration * 1000))
s.set_attribute("success", True)
logger.info(
f"Webhook {webhook_id} delivered in {duration:.2f}s "
f"status={response.status_code}"
)
return f"Webhook delivered: {webhook_id}"
except Exception as e:
duration = time.time() - start
s.set_attribute("success", False)
s.set_attribute("error", str(e))
s.set_attribute("duration_ms", round(duration * 1000))
logger.error(
f"Webhook {webhook_id} FAILED after {duration:.2f}s: {e}"
)
return f"WEBHOOK FAILED: {webhook_id} - {str(e)}"
Now every single webhook attempt shows up in your OpenClaw traces with timing, status codes, payload sizes, and error details. When something goes wrong at 3 AM, you open the trace, click on the webhook_delivery span, and immediately see what happened. No grep-ing through CloudWatch logs.
If you want to go further, pipe these spans into an OpenTelemetry collector and set up alerts on webhook failure rates. A simple threshold β "alert if webhook success rate drops below 95% over 5 minutes" β will save you countless hours.
3. Idempotency and Duplicate Deliveries
Here's a scenario that will definitely happen to you: your OpenClaw agent has a retry-capable webhook tool (good!), the first attempt gets a timeout, it retries, but the first request actually did go through β the response just took too long. Now your downstream service processed the same event twice.
With traditional webhooks, this is annoying. With AI agents, it's dangerous. If the webhook triggers a payment, an email, a database write, or another agent β duplicates create real problems.
The fix is idempotency keys. Every webhook call should include a unique key that the receiving service uses to deduplicate.
import hashlib
import json
def generate_idempotency_key(agent_run_id: str, step_name: str, payload: dict) -> str:
"""
Generate a deterministic idempotency key based on the agent run,
the current step, and the payload content.
"""
content = f"{agent_run_id}:{step_name}:{json.dumps(payload, sort_keys=True)}"
return hashlib.sha256(content.encode()).hexdigest()[:32]
Then on the receiving end:
from fastapi import FastAPI, Request, HTTPException
import redis
app = FastAPI()
r = redis.Redis()
@app.post("/webhook")
async def receive_webhook(request: Request):
body = await request.json()
idempotency_key = body.get("idempotency_key")
if not idempotency_key:
raise HTTPException(400, "Missing idempotency_key")
# Check if we've already processed this event
if r.exists(f"webhook:processed:{idempotency_key}"):
return {"status": "already_processed", "idempotency_key": idempotency_key}
# Process the webhook...
process_event(body)
# Mark as processed with a 48-hour TTL
r.setex(f"webhook:processed:{idempotency_key}", 172800, "1")
return {"status": "processed", "idempotency_key": idempotency_key}
The key insight: the idempotency key should be deterministic based on the agent run and step, not random. If the agent retries the same logical operation, it should generate the same key. If it's doing a genuinely new operation, it should generate a different one.
4. Signature Verification Failures
If your OpenClaw agent sends webhooks to external services (or you receive webhooks from external sources that trigger agent runs), you need HMAC signature verification. This is where a surprising number of people get stuck, usually because of encoding issues.
import hmac
import hashlib
import json
WEBHOOK_SECRET = "your-secret-key" # Store in env vars, obviously
def sign_payload(payload: dict) -> str:
"""Generate HMAC-SHA256 signature for webhook payload."""
body = json.dumps(payload, sort_keys=True, separators=(',', ':'))
return hmac.new(
WEBHOOK_SECRET.encode('utf-8'),
body.encode('utf-8'),
hashlib.sha256
).hexdigest()
def verify_signature(payload_bytes: bytes, signature: str) -> bool:
"""Verify incoming webhook signature."""
expected = hmac.new(
WEBHOOK_SECRET.encode('utf-8'),
payload_bytes,
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, signature)
The gotcha that trips everyone up: you must sign the raw bytes of the request body, not a re-serialized version. JSON serialization is not deterministic β different libraries may order keys differently or handle whitespace differently. On the receiving end, verify against request.body() (raw bytes), not json.dumps(await request.json()).
5. State Loss on Failure
This is the failure mode that's unique to agent systems. Traditional webhook failures are stateless β you retry and move on. But with OpenClaw agents, a failed webhook can mean the agent's state diverges from reality.
Example: your agent processes a customer refund, sends a webhook to your payment service, the webhook fails, and the agent marks the refund as "completed" in its internal state. Now the agent tells the customer their refund was processed, but it wasn't.
OpenClaw's checkpoint system is your best friend here. Use it to create savepoints before critical webhook calls, and implement a verification step after:
from openclaw import Agent, Checkpoint
class RefundAgent(Agent):
def process_refund(self, refund_data: dict):
# Step 1: Process the refund logic
result = self.tools.calculate_refund(refund_data)
# Step 2: Checkpoint BEFORE the webhook
checkpoint = Checkpoint.save(
agent_run_id=self.run_id,
step="pre_webhook_notification",
state={
"refund_data": refund_data,
"result": result,
"webhook_sent": False,
"webhook_confirmed": False,
}
)
# Step 3: Send webhook with confirmation requirement
webhook_result = self.tools.notify_completion({
"event_type": "refund.processed",
"refund_id": refund_data["id"],
"amount": result["amount"],
"requires_confirmation": True,
})
# Step 4: Check if webhook was actually delivered
if "FAILED" in webhook_result:
# Restore checkpoint and flag for manual review
checkpoint.update(state={
**checkpoint.state,
"webhook_sent": False,
"needs_manual_review": True,
"failure_reason": webhook_result,
})
return "Refund processed but notification failed. Flagged for review."
# Step 5: Update checkpoint to reflect success
checkpoint.update(state={
**checkpoint.state,
"webhook_sent": True,
"webhook_confirmed": True,
})
return "Refund processed and confirmed."
The pattern here is checkpoint β attempt β verify β update. Never let the agent assume success without confirmation.
6. Large Payloads and Timeouts
OpenClaw agents can be verbose. If your agent did a 15-step research task, the accumulated context, tool outputs, and reasoning traces can easily produce a payload that's several megabytes. Most webhook receivers have payload limits (often 256KB or 1MB), and even if they accept it, parsing a 5MB JSON blob takes time, causing timeouts.
The fix is simple: send minimal webhook payloads with a reference ID, and let the receiver pull the full data if needed.
class SlimWebhook(Tool):
name = "notify_completion"
description = "Sends a slim notification with a reference to the full results"
def run(self, agent_run_id: str, event_type: str, summary: str) -> str:
# Store full results in your database/object store
results_url = self._store_full_results(agent_run_id)
# Send minimal payload
payload = {
"event_type": event_type,
"agent_run_id": agent_run_id,
"summary": summary[:500], # Hard cap on summary length
"results_url": results_url,
"timestamp": time.time(),
"idempotency_key": generate_idempotency_key(
agent_run_id, "completion", {"event_type": event_type}
),
}
# This payload will always be under 1KB
response = self._send_with_retry(WEBHOOK_URL, payload)
return f"Notification sent. Full results at: {results_url}"
This also has a nice side effect: your webhook logs are actually readable. Instead of scrolling through pages of agent reasoning to find the relevant data, you see clean, structured events.
7. Scaling and Rate Limits
When you're running 10 OpenClaw agents, direct webhook calls work fine. When you're running 500 agents concurrently, you overwhelm downstream services. Most webhook receivers aren't designed for burst traffic from AI workloads.
The solution is to decouple the agent from the webhook delivery using a queue:
import redis
import json
from openclaw import Tool
class QueuedWebhook(Tool):
name = "notify_completion"
description = "Queues a webhook notification for reliable delivery"
def __init__(self):
self.redis = redis.Redis()
def run(self, payload: dict) -> str:
event = {
"payload": payload,
"url": "https://your-service.com/webhook",
"max_retries": 5,
"created_at": time.time(),
"idempotency_key": payload.get("idempotency_key", str(uuid.uuid4())),
}
self.redis.lpush("webhook:outbox", json.dumps(event))
return "Notification queued for delivery."
Then run a separate worker that drains the queue with rate limiting:
import time
import json
import redis
import requests
r = redis.Redis()
RATE_LIMIT = 50 # max webhooks per second
def process_webhook_queue():
while True:
event_json = r.rpop("webhook:outbox")
if not event_json:
time.sleep(0.1)
continue
event = json.loads(event_json)
try:
response = requests.post(
event["url"],
json=event["payload"],
timeout=15,
)
if response.status_code >= 500:
handle_failure(event)
else:
log_success(event)
except Exception as e:
handle_failure(event, error=str(e))
time.sleep(1 / RATE_LIMIT)
def handle_failure(event, error=None):
retries = event.get("retry_count", 0)
if retries < event["max_retries"]:
event["retry_count"] = retries + 1
# Re-queue with delay (exponential backoff via sorted set)
retry_at = time.time() + (2 ** retries)
r.zadd("webhook:retry", {json.dumps(event): retry_at})
else:
# Dead letter queue
r.lpush("webhook:dead_letter", json.dumps({
**event,
"final_error": error,
"failed_at": time.time(),
}))
The dead letter queue is critical. When a webhook has failed all retries, you don't want to lose it forever. Put it somewhere you can inspect and replay manually.
Putting It All Together: A Production-Ready Pattern
Here's the full pattern I recommend for any serious OpenClaw deployment:
- Agent tools queue events instead of making direct HTTP calls
- A delivery worker drains the queue with retries, rate limiting, and dead letter handling
- Every event has an idempotency key derived from the agent run ID and step
- Payloads are slim β reference ID + event type + short summary
- Webhook calls are traced with OpenTelemetry spans or OpenClaw's native tracing
- The agent gets feedback about delivery success/failure through the tool's return value
- Checkpoints bracket critical webhook calls so you can recover from failures
This isn't over-engineering. This is the minimum viable setup for production agent systems that other services depend on. I've seen teams skip these steps and spend weeks debugging phantom failures that would have been obvious with proper observability, or lose real revenue because of duplicate webhook deliveries.
Getting Started Without the Boilerplate
If you're just getting into OpenClaw and don't want to build all of this infrastructure from scratch on day one, Felix's OpenClaw Starter Pack is genuinely the fastest way to get running. It includes pre-configured tool templates, sensible defaults for retry logic, and the kind of project structure that makes adding proper webhook handling straightforward rather than an afterthought. I recommend it to anyone who asks me "where do I start with OpenClaw" because it encodes a lot of the patterns that took the rest of us months of trial and error to figure out.
What To Do Next
If you're currently running OpenClaw agents in production with direct webhook calls and no retry logic: fix that today. It's the single highest-impact improvement you can make. Wrap your HTTP calls with tenacity, add idempotency keys, and make sure your agent tools return clear failure messages.
If you're just starting: set up the queued delivery pattern from the beginning. It's barely more work than the naive approach, and it'll save you from a painful migration later.
If you're at scale: instrument everything with tracing, set up alerting on webhook failure rates, and implement the checkpoint pattern around any webhook that triggers side effects you can't easily undo.
The agents themselves are the exciting part. The webhook plumbing is not glamorous. But the unglamorous plumbing is exactly what separates a cool demo from a system you can actually trust with real work.