OpenClaw Agent Keeps Restarting: What to Do
OpenClaw Agent Keeps Restarting: What to Do

Look, if you're reading this, your OpenClaw agent is probably stuck in a restart loop right now. Maybe it crashed at step 37 of a 50-step research task. Maybe it's burning through API credits like a bonfire while accomplishing absolutely nothing. Maybe you walked away for lunch, came back, and realized it had been restarting itself for 45 minutes straight.
I've been there. Most people building with agents have been there. And the frustrating part isn't the crash itself — it's that most frameworks treat this as your problem to solve, when it should have been handled at the infrastructure level from day one.
Let's fix it. No hand-waving, no "it depends" — just the actual steps to diagnose why your OpenClaw agent keeps restarting and what to do about it.
Why Agents Restart (The Real Reasons, Not the Obvious Ones)
Before you start randomly tweaking config files, you need to understand the five actual causes of agent restart loops. They're not all the same problem, and they don't all have the same fix.
1. Unhandled API Failures
This is the most common one. Your agent makes an LLM call, the API returns a 429 (rate limit) or 500 (server error), and instead of retrying gracefully, the whole process crashes. OpenClaw restarts the agent because that's what it's supposed to do when a process dies. But if the underlying issue isn't resolved, you get a loop.
2. State Loss Between Restarts
Your agent was on step 23. It crashes. It restarts. But it doesn't know it was on step 23, so it starts from step 1. It hits the same conditions that caused the crash in the first place. Restart. Repeat. This is the classic "no durable state" problem that plagues almost every agent framework.
3. Context Window Overflow
Long-running agents accumulate conversation history. At some point, the context exceeds the model's window. The call fails, the agent restarts, it rebuilds context, hits the limit again. Restart loop.
4. Infinite Decision Loops
Sometimes the agent isn't crashing at all — it's deciding to restart itself. It gets stuck in a reasoning loop ("I should search for X... but first I need Y... but to get Y I need X..."), hits a timeout or step limit, and the orchestrator restarts it. The agent then makes the exact same decisions.
5. Resource Exhaustion
Memory leaks, runaway file handles, disk space from accumulated logs — the process gets killed by the OS or container orchestrator, not by any logic error. OpenClaw dutifully restarts it, the leak starts again, and you've got a loop.
Here's how to figure out which one is hitting you.
Step 1: Check the Logs (Actually Check Them)
I know, obvious. But most people skim the logs looking for the word "error" and miss the actual signal. Here's what to actually look for:
openclaw logs --agent-id YOUR_AGENT_ID --tail 200 --level debug
You're looking for three things:
- The last successful step before the restart. This tells you where the failure happens. If it's always the same step, you have a deterministic bug. If it's random, you probably have a transient infrastructure issue.
- The restart reason. OpenClaw tags each restart with a reason code. Look for
PROCESS_EXIT,HEALTH_CHECK_TIMEOUT,OOM_KILLED, orSTEP_LIMIT_EXCEEDED. Each one points to a different root cause. - The time between restarts. If it's getting shorter each time, you likely have a state accumulation problem. If it's consistent, it's probably hitting the same wall each time.
openclaw logs --agent-id YOUR_AGENT_ID --filter "restart_reason" --format json | jq '.[] | {timestamp, reason, last_step}'
This gives you a clean timeline. Nine times out of ten, the pattern becomes obvious once you see it laid out.
Step 2: Enable Checkpointing (If You Haven't Already)
This is the single biggest thing you can do to prevent restart-related pain, and it's wild how many people skip it during initial setup.
OpenClaw has built-in checkpointing as a first-class feature — it's not some bolted-on afterthought. But it's not enabled by default in the most minimal configurations because it adds some overhead that not every use case needs.
Here's how to turn it on:
# openclaw.config.yaml
agent:
name: "research-agent"
checkpoint:
enabled: true
backend: "sqlite" # Options: sqlite, postgres, redis
interval: "every_step" # Options: every_step, every_n_steps, on_tool_call
retention: 50 # Keep last 50 checkpoints
recovery:
strategy: "resume_from_last" # Options: resume_from_last, resume_from_stable, restart_clean
max_retries: 3
backoff: "exponential"
The key settings:
backend: SQLite is fine for local development and single-agent setups. If you're running multiple agents or need durability across server restarts, use Postgres or Redis.interval:every_stepgives you the finest granularity but adds latency. For most agents,on_tool_callis the sweet spot — it checkpoints before any external interaction (API calls, file writes, etc.), which is where failures almost always happen.recovery.strategy:resume_from_lastpicks up exactly where the agent left off.resume_from_stablerolls back to the last checkpoint where all tool calls succeeded (useful when the tool itself is the problem).restart_cleanis the default when checkpointing is off — full restart from scratch.
Once this is enabled, your restart loop should at minimum stop being expensive, because the agent won't repeat work it's already done.
Step 3: Add Proper Error Boundaries
Checkpointing handles recovery after a crash. But the better move is to stop the crash from happening in the first place.
OpenClaw lets you define error boundaries around specific tools and steps. Think of these like try/catch blocks, but for agent behavior:
from openclaw import Agent, ErrorBoundary, RetryPolicy
agent = Agent("research-agent")
# Define retry policy for flaky APIs
api_retry = RetryPolicy(
max_attempts=3,
backoff="exponential",
initial_delay=2.0,
max_delay=60.0,
retryable_errors=[429, 500, 502, 503, 504]
)
@agent.tool("web_search", error_boundary=ErrorBoundary(
retry_policy=api_retry,
fallback="skip_and_note", # Options: skip_and_note, use_cached, raise, ask_human
timeout=30.0
))
def web_search(query: str) -> str:
# Your search implementation
results = search_api.query(query)
return results.to_text()
@agent.tool("file_write", error_boundary=ErrorBoundary(
retry_policy=RetryPolicy(max_attempts=2),
fallback="raise", # File writes should fail loudly
timeout=10.0
))
def file_write(path: str, content: str) -> str:
with open(path, 'w') as f:
f.write(content)
return f"Written to {path}"
The fallback option is crucial:
skip_and_note: The agent skips the failed tool call and gets a note in its context saying "web_search failed after 3 retries, proceeding without this result." This is usually the right choice for non-critical information gathering.use_cached: Returns the last successful result from this tool with similar inputs. Great for search and lookup tools.raise: Propagates the error up. Use this for tools where failure means the whole task is invalid (like writing final output).ask_human: Pauses the agent and waits for human input. Best for production agents where you have someone monitoring.
The skip_and_note pattern alone eliminates probably 60% of restart loops I've seen, because most crashes come from a search API timing out or a rate limit hit — and the agent usually doesn't need that one specific result to continue.
Step 4: Handle Context Window Overflow
If your agent runs for more than about 20-30 steps, you'll eventually hit context limits. Here's the configuration to handle this:
agent:
context:
max_tokens: 120000 # Stay under model limit
overflow_strategy: "summarize_and_compress"
compression_threshold: 0.8 # Compress when 80% full
preserve_recent_steps: 10 # Always keep the last 10 steps in full
preserve_system_prompt: true
What this does: when the context hits 80% of the max, OpenClaw automatically summarizes older steps into a compressed format while keeping recent steps intact. The agent gets a summary like "Steps 1-15: Researched topic X, found sources A, B, C, extracted key data points D, E, F" instead of the full verbose history of every single thought and action.
You can also handle this programmatically:
from openclaw import Agent, ContextManager
agent = Agent("long-running-agent")
@agent.on("context_threshold")
def handle_context_overflow(ctx: ContextManager):
# Custom compression logic
summary = ctx.summarize_steps(
start=0,
end=ctx.current_step - 10,
strategy="key_decisions_and_outputs"
)
ctx.replace_steps(start=0, end=ctx.current_step - 10, replacement=summary)
ctx.add_note("Context was compressed. Full history available in checkpoints.")
This prevents the silent context overflow crash that leads to mysterious restart loops.
Step 5: Detect and Break Decision Loops
This is the trickiest one because the agent isn't failing — it's just not progressing. OpenClaw has a loop detection mechanism you should configure:
agent:
loop_detection:
enabled: true
max_similar_steps: 3 # Flag after 3 similar consecutive steps
similarity_threshold: 0.85 # How similar steps need to be (semantic)
action_on_loop: "inject_guidance"
guidance_on_loop: |
You appear to be repeating similar actions without making progress.
Please try a different approach or summarize what you've accomplished
so far and identify what's specifically blocking you.
When the agent makes three consecutive steps that are semantically similar (same tool calls with similar arguments, same reasoning patterns), OpenClaw injects the guidance message into the context. This is surprisingly effective — it's like tapping someone on the shoulder who's been staring at the same spreadsheet for an hour.
For more aggressive loop breaking:
@agent.on("loop_detected")
def handle_loop(loop_info):
if loop_info.iteration_count > 5:
# Nuclear option: force a different approach
agent.inject_context(
f"SYSTEM: Previous approach failed after {loop_info.iteration_count} attempts. "
f"The repeated action was: {loop_info.repeated_action}. "
f"You MUST try a completely different strategy."
)
agent.clear_tool_cache() # Force fresh results
Step 6: Set Up Proper Monitoring
Once you've fixed the immediate restart issue, set up monitoring so you catch problems before they become loops:
from openclaw import Agent, Monitor
agent = Agent("production-agent")
monitor = Monitor(agent)
# Alert if agent restarts more than twice in 10 minutes
monitor.add_alert(
condition="restart_count > 2",
window="10m",
action="pause_and_notify",
notify={"webhook": "https://your-slack-webhook.com/..."}
)
# Track cost per run
monitor.track("token_usage", aggregate="per_run")
monitor.track("tool_calls", aggregate="per_step")
monitor.track("restart_count", aggregate="per_hour")
You can also view this in the OpenClaw dashboard:
openclaw dashboard --agent-id YOUR_AGENT_ID
This gives you a real-time view of step progression, token usage, restart events, and checkpoint status. It's the difference between guessing what's wrong and actually seeing it.
The Full Production-Ready Config
Here's the complete configuration that incorporates everything above. This is what I use as a starting point for any agent I expect to run for more than a few minutes:
# openclaw.config.yaml - Production template
agent:
name: "my-agent"
checkpoint:
enabled: true
backend: "postgres"
interval: "on_tool_call"
retention: 100
recovery:
strategy: "resume_from_stable"
max_retries: 3
backoff: "exponential"
max_backoff: 120
context:
max_tokens: 120000
overflow_strategy: "summarize_and_compress"
compression_threshold: 0.75
preserve_recent_steps: 15
preserve_system_prompt: true
loop_detection:
enabled: true
max_similar_steps: 3
similarity_threshold: 0.85
action_on_loop: "inject_guidance"
timeouts:
step_timeout: 120
total_timeout: 3600
tool_timeout: 30
resource_limits:
max_memory_mb: 2048
max_steps: 200
max_tokens_per_run: 500000
This config alone prevents the vast majority of restart loops. Copy it, adjust the numbers for your use case, and save yourself hours of debugging.
Getting Started Without the Pain
If all of this feels like a lot of configuration to get right on your first try — it is. There's a real learning curve to setting up durable, production-ready agents, and the difference between a config that works and one that subtly doesn't can be a single misplaced setting.
That's why I'd seriously recommend checking out Felix's OpenClaw Starter Pack. It comes with pre-configured templates that have sensible defaults for checkpointing, error boundaries, and recovery — basically all the stuff I walked through above, but already wired together and tested. Instead of spending your first weekend debugging restart loops, you can start with a setup that actually works and customize from there. It's the kind of thing I wish existed when I was first getting started.
What To Do Right Now
If your agent is currently stuck in a restart loop:
- Run
openclaw logswith the debug flag and identify the restart reason. - Enable checkpointing with
resume_from_stablerecovery. This alone might fix it. - Add error boundaries to your flakiest tools (usually web search and external APIs).
- Check your context window — if the agent runs for 20+ steps, you probably need compression.
- Enable loop detection so the agent can self-correct when it gets stuck.
If your agent is working fine but you want to prevent future restart issues:
- Use the production config template above as your baseline.
- Set up monitoring with restart alerts so you know about problems before your users do.
- Test failure scenarios — kill the process mid-run and verify it resumes correctly.
The whole point of OpenClaw's approach to durability is that you shouldn't have to babysit your agents. Set up the guardrails once, configure recovery properly, and let the framework handle the chaos of real-world execution. Your agents will still encounter errors — every agent does. The difference is whether those errors mean "lost an hour of work and $15 in API credits" or "recovered in 3 seconds and kept going."
Build the resilient version. Your future self (and your API bill) will thank you.