How OpenClaw Heartbeats Keep Your Agents Alive 24/7

Let me be real: if you've ever built an autonomous agent that ran beautifully for three hours and then silently died at 2am with zero recoverable state, you already understand why heartbeats matter. You just might not know what to call the fix.

OpenClaw's heartbeat system is that fix. And once you understand how it works, you'll never go back to the fragile, pray-it-doesn't-crash agent loops that most people are still running.

I'm going to walk you through exactly what OpenClaw heartbeats are, why they exist, how to configure them properly, and the specific settings that'll keep your agents alive and recoverable 24/7. No hand-waving. Actual implementation details.

The Problem Nobody Talks About Until It Burns Them

Here's the scenario that plays out constantly: You spin up an agent. It's doing research, calling tools, accumulating context, making decisions. It's working. You go to bed. You wake up and check on it. The process died six hours ago on step 142 of a 200-step workflow. There's no checkpoint. There's no recoverable state. The six hours of work it completed? Gone. You're starting over.

This happens because most agent architectures use a naive loop. It's essentially:

while not done:
    think()
    act()
    observe()

That's it. That's the entire reliability model. A while loop. If the process crashes, if the API rate-limits you, if the hosting platform recycles your container, if memory usage creeps up and triggers an OOM kill — everything accumulated inside that loop vanishes.

People try to patch this with try/except blocks, manual checkpointing, cron-job restarts. It's all duct tape. The fundamental architecture is wrong.

OpenClaw takes a different approach. Instead of treating the agent loop as a continuous process that hopefully doesn't die, it treats every single iteration as a discrete, durable heartbeat — a transactional unit of work that gets persisted before the next one begins.

What an OpenClaw Heartbeat Actually Is

A heartbeat in OpenClaw is one complete cycle of your agent's think-act-observe loop, wrapped in a transactional envelope that guarantees three things:

The full state is persisted before the next heartbeat begins.
The heartbeat emits a structured event so you always know what happened.
If anything fails mid-heartbeat, the system can recover to the last successfully completed one.

Think of it like a save point in a video game. Every heartbeat is an automatic save. Crash on heartbeat 143? Cool, you resume from heartbeat 142 with full context, full tool history, full accumulated knowledge. Nothing lost.

Here's what a basic OpenClaw agent with heartbeats looks like:

# agent.openclaw.yaml
agent:
  name: competitive-intel-researcher
  heartbeat:
    enabled: true
    persistence: postgres
    interval: adaptive
    max_consecutive_failures: 3
    on_failure: pause_and_notify
  
  state:
    store: postgres
    compression: summarize_after_50
    max_context_tokens: 12000

  skills:
    - web_search
    - document_analysis
    - report_generation

  notifications:
    slack_webhook: ${SLACK_WEBHOOK_URL}
    notify_on:
      - heartbeat_failure
      - agent_paused
      - agent_completed

Let's break down what's happening here.

The Configuration That Actually Matters

`interval: adaptive`

This is the setting most people get wrong on their first try, so let's start here.

A fixed heartbeat interval is a trap. If you set it to 2 seconds, you're burning through API calls and racking up LLM costs for no reason during long tool executions. If you set it to 60 seconds, your agent feels unresponsive and you lose granularity on failures.

OpenClaw's adaptive mode solves this cleanly. It adjusts the heartbeat frequency based on what the agent is actually doing:

During active LLM reasoning: heartbeats fire at the natural cadence of completions (effectively after each LLM response).
During tool execution: the heartbeat pauses and waits for the tool to return, then fires immediately after.
During idle/waiting states: heartbeats slow down to a configurable minimum (default: every 30 seconds) just to confirm the agent is still alive.

You can also set explicit bounds:

heartbeat:
  interval: adaptive
  min_interval_ms: 1000
  max_interval_ms: 60000
  tool_wait_timeout_ms: 300000  # 5 min max for any single tool call

That tool_wait_timeout_ms is critical. Without it, a hung API call (looking at you, every web scraper ever) will block your heartbeat indefinitely, and your monitoring will think the agent is still alive when it's actually stuck. With this setting, if a tool call exceeds 5 minutes, the heartbeat system kills it, logs the failure, and lets the agent decide what to do next.

`persistence: postgres`

Every completed heartbeat writes a full state snapshot to your persistence layer. OpenClaw supports Postgres (recommended for production), SQLite (fine for local development), and Redis (for high-frequency, lower-durability use cases).

The Postgres implementation uses JSONB columns with automatic indexing, so you can actually query your agent's historical states:

-- Find the exact heartbeat where something went wrong
SELECT heartbeat_id, created_at, state_summary, tool_calls
FROM agent_heartbeats
WHERE agent_id = 'competitive-intel-researcher'
  AND heartbeat_id BETWEEN 140 AND 145
ORDER BY heartbeat_id;

This alone is a game-changer for debugging. Instead of scrolling through a wall of logs that say "heartbeat 47: thinking..." you get structured, queryable records of every decision your agent made.

`compression: summarize_after_50`

Here's a subtlety that trips people up: if you're persisting full state on every heartbeat, and your agent runs for 500+ heartbeats, you're going to blow past context window limits and storage is going to balloon.

The summarize_after_50 setting tells OpenClaw to automatically compress older heartbeat states using an LLM-generated summary. The 50 most recent heartbeats retain full fidelity. Everything older gets summarized into a condensed representation that preserves key decisions, findings, and tool outputs without the token bloat.

You can tune this aggressively or conservatively:

state:
  compression: summarize_after_20    # aggressive, saves tokens
  summary_model: fast                # uses a smaller/cheaper model for summaries
  preserve_tool_outputs: true        # always keep raw tool outputs even in summaries

That preserve_tool_outputs: true flag is one I'd recommend always keeping on. Summaries can lose nuance, but raw data from tool calls (search results, API responses, file contents) should be preserved verbatim whenever possible.

`on_failure: pause_and_notify`

This is where OpenClaw earns its keep for production deployments. When consecutive heartbeat failures hit your max_consecutive_failures threshold, the system doesn't just crash. It:

Persists the current state (including the error context).
Pauses the agent.
Sends a notification through your configured channel (Slack, email, webhook).
Waits for either a manual resume command or an automatic retry after a configurable backoff period.

Here's what a more sophisticated failure handling config looks like:

heartbeat:
  max_consecutive_failures: 3
  on_failure: pause_and_notify
  retry:
    strategy: exponential_backoff
    initial_delay_ms: 5000
    max_delay_ms: 300000
    max_retries: 5
  circuit_breaker:
    enabled: true
    error_threshold: 0.5        # if 50%+ of last 10 heartbeats failed
    cooldown_period_ms: 600000  # wait 10 minutes before retrying

The circuit breaker is important. If your LLM provider is having an outage, you don't want your agent hammering a broken API every 5 seconds, burning through your error budget and potentially getting your API key throttled. The circuit breaker detects the pattern and backs off intelligently.

Setting Up Observability

A heartbeat system without observability is like a smoke detector without a battery. OpenClaw emits OpenTelemetry-compatible traces for every heartbeat, which means you can pipe them into whatever monitoring stack you're already using — Grafana, Datadog, Honeycomb, whatever.

observability:
  opentelemetry:
    enabled: true
    endpoint: ${OTEL_ENDPOINT}
    service_name: openclaw-agents
  
  structured_logs:
    enabled: true
    level: info
    format: json
    include_state_diff: true   # log what changed between heartbeats

The include_state_diff flag is incredibly useful for debugging. Instead of logging the entire state on every heartbeat (which gets massive), it logs only what changed. So you can see exactly: "Heartbeat 89: added 3 new search results, updated research_summary, incremented tool_call_count from 12 to 13."

For a quick health dashboard, you can query heartbeat metrics directly:

-- Agent uptime and health over the last 24 hours
SELECT 
  date_trunc('hour', created_at) as hour,
  COUNT(*) as total_heartbeats,
  COUNT(*) FILTER (WHERE status = 'success') as successful,
  COUNT(*) FILTER (WHERE status = 'failed') as failed,
  AVG(duration_ms) as avg_duration_ms
FROM agent_heartbeats
WHERE agent_id = 'competitive-intel-researcher'
  AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;

Human-in-the-Loop at Heartbeat Boundaries

One of the cleanest patterns in OpenClaw is using heartbeat boundaries as natural interruption points for human review. Instead of trying to interrupt an agent mid-thought (which is messy and error-prone), you define specific heartbeat conditions that trigger a pause for human input:

heartbeat:
  human_review:
    enabled: true
    trigger_conditions:
      - condition: cost_exceeds
        threshold_usd: 5.00
        action: pause_for_approval
      - condition: confidence_below
        threshold: 0.6
        action: pause_for_guidance
      - condition: every_n_heartbeats
        n: 100
        action: pause_for_checkpoint_review

This means your agent will automatically pause and ask for human input when it's spent more than $5, when it's uncertain about a decision, or every 100 heartbeats as a routine check-in. The state is fully persisted at the pause point, so the human can take hours to respond without any risk of state loss.

A Real-World Example: The Multi-Day Research Agent

Let me walk through a concrete scenario. You're running a competitive intelligence agent that needs to:

Monitor 15 competitor websites daily.
Analyze new product launches, pricing changes, and content updates.
Compile a weekly summary report.

This agent runs continuously. Without heartbeats, you're praying that a Python process stays alive for 7 days straight. With OpenClaw heartbeats, here's what actually happens:

Heartbeat 1–50: Agent crawls competitor sites, extracts data, stores findings.
Heartbeat 51: API rate limit hit on web search tool. Heartbeat fails.
Heartbeat 52: Automatic retry with backoff. Succeeds. Continues.
Heartbeat 53–200: More research, analysis, intermediate findings persisted.
Heartbeat 201: Hosting platform recycles the container. Process dies.
Agent restarts automatically (via your process manager, container orchestrator, etc.).
Heartbeat 202: Resumes from persisted state at heartbeat 200. Two heartbeats of work lost, max. Not 200.
Heartbeat 300–312: 3am. Agent hits an unexpected edge case. Three consecutive failures trigger pause_and_notify.
You wake up at 7am, see the Slack notification, review the state, tweak the agent's instructions, hit resume.
Heartbeat 313: Continues from exactly where it paused.
Heartbeat 500: Weekly report generated and delivered.

Zero manual babysitting required. Zero catastrophic state loss. Full auditability of every step.

Getting Started Without the Configuration Pain

I've laid out a lot of configuration above, and honestly, getting all of this right from scratch takes some trial and error. The persistence setup, the adaptive intervals, the compression settings, the observability pipeline — each piece is individually straightforward but getting them all tuned and working together takes time.

If you don't want to set all this up manually, Felix's OpenClaw Starter Pack on Claw Mart is genuinely the fastest way to get running. It's $29 and includes pre-configured skills with heartbeat settings already tuned for the most common agent patterns — long-running research, monitoring, multi-step workflows. The persistence layer comes pre-wired for both Postgres and SQLite, the observability config is set up with sensible defaults, and the failure handling is already configured with the circuit breaker and notification patterns I described above. I burned a solid weekend getting my first heartbeat config production-ready. This pack would have saved me that weekend.

What To Do Next

Here's my recommended path:

Start with a simple agent — one skill, one tool, heartbeats enabled with SQLite persistence. Get comfortable with the heartbeat lifecycle (start → persist → resume → complete).
Add failure handling — intentionally break things. Kill the process mid-heartbeat. Simulate API failures. Watch the recovery work. Build trust in the system.
Move to Postgres for production — SQLite is great for development but you want real concurrency and queryability for production agents.
Set up observability early — don't wait until something breaks at 3am. Wire up structured logs and OpenTelemetry from the start. Future you will be grateful.
Tune compression settings based on your specific use case. Research agents that accumulate lots of data need aggressive summarization. Simple monitoring agents might never need it.
Add human-in-the-loop triggers once your agent is stable. Start with cost-based triggers (pause if spend exceeds X) and expand from there.

The fundamental insight behind OpenClaw heartbeats is simple: autonomous agents are not short-lived scripts. They're long-running, failure-prone processes operating in an unreliable world. Treating them like durable workflows instead of fragile loops is the difference between agents that demo well and agents that actually run in production.

Build accordingly.