Claw Mart
← Back to Blog
March 20, 20268 min readClaw Mart Team

OpenClaw Rate Limit Errors: How to Stay Under the Cap

OpenClaw Rate Limit Errors: How to Stay Under the Cap

OpenClaw Rate Limit Errors: How to Stay Under the Cap

If you've been building agents on OpenClaw for more than about forty-eight hours, you've already met the error. You know the one:

openclawError: Rate limit reached for model "ocl-4-agent" on requests per minute (RPM). Limit: 10, Used: 10, Requested: 1. Please try again in 6.023s.

Or its even more annoying cousin:

openclawError: Rate limit reached on tokens per minute (TPM). Limit: 60000, Used: 58921, Requested: 4302.

Your agent was humming along — it had pulled data from three sources, synthesized a summary, started drafting the output — and then it just died. Fifteen steps of accumulated context, tool calls, and intermediate reasoning, gone. You stare at the terminal. You mutter something unprintable. You hit the up arrow and run it again, knowing it'll probably die at step 18 this time instead of step 15.

This is the single most common frustration I see from people building on OpenClaw, and honestly, it's one of the top three reasons people prematurely conclude that AI agents "aren't production ready." They are. You just need to understand the rate limit system and engineer around it properly. That's what this entire post is about.

Why OpenClaw Rate Limits Exist (And Why They're Stricter Than You Think)

OpenClaw, like any platform routing requests to large language models, operates under capacity constraints. The limits exist to prevent any single user from monopolizing inference resources and to maintain quality of service across the platform.

Here's what trips people up: there are actually two separate limits running simultaneously.

  1. RPM (Requests Per Minute) — the raw number of API calls you can make in a 60-second window.
  2. TPM (Tokens Per Minute) — the total number of tokens (input + output) processed in a 60-second window.

You can hit either one independently. A lot of developers only think about RPM and get blindsided by TPM, especially when their agents are passing long conversation histories or large tool outputs back into the context window.

On OpenClaw's free tier, the limits are genuinely tight — sometimes as low as 3–10 RPM for the more powerful models. Even on paid tiers, the limits are lower than most people expect, especially if you're running multi-step agents that make parallel tool calls.

Here's the math that kills most agent workflows:

  • Your agent has a supervisor node and three worker nodes.
  • Each step involves 2–4 LLM calls (reasoning + tool selection + tool execution + synthesis).
  • A 20-step research task = 40–80 LLM calls.
  • At 10 RPM, that task takes a minimum of 4–8 minutes if you perfectly space every call.
  • But you're not perfectly spacing them. Your agent fires calls in bursts. So you hit the limit at step 6, the whole thing crashes, and you start over.

This is the rate limit death spiral, and getting out of it requires a combination of backoff strategies, concurrency control, proactive throttling, and state persistence.

Let's go through each one.

Strategy 1: Exponential Backoff with Jitter

This is the absolute bare minimum. If you're not doing this, stop reading and implement it right now.

The idea is simple: when you get a 429 error, wait before retrying — and wait longer each subsequent time, with a random element so you don't create synchronized retry storms.

The tenacity library is the gold standard for this in Python:

from tenacity import retry, wait_exponential_jitter, stop_after_attempt, retry_if_exception_type
from openclaw import OpenClawError

@retry(
    wait=wait_exponential_jitter(initial=2, max=120, jitter=5),
    stop=stop_after_attempt(8),
    retry=retry_if_exception_type(OpenClawError),
)
def call_openclaw_model(client, messages, model="ocl-4-agent"):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
    )
    return response

What this does:

  • First retry waits ~2 seconds (plus up to 5 seconds of random jitter).
  • Second retry waits ~4 seconds + jitter.
  • Third retry waits ~8 seconds + jitter.
  • Keeps going up to a max of 120 seconds between attempts.
  • Gives up after 8 total attempts.

The jitter is critical. Without it, if you have multiple agent processes running, they'll all retry at the exact same time and immediately hit the limit again. Jitter spreads them out.

This alone will fix maybe 40% of your rate limit crashes. But it's not enough for serious agent work because it only handles the retry — it doesn't prevent the problem in the first place.

Strategy 2: Concurrency Control (The One Most People Skip)

Here's where the real leverage is. Instead of letting your agent fire off LLM calls as fast as it wants and then dealing with the 429s reactively, you proactively limit how many calls can be in flight at once.

If your RPM limit is 10, you should never have more than a few concurrent requests:

import asyncio
from openclaw import AsyncOpenClaw

# Semaphore limits concurrent requests
MAX_CONCURRENT_REQUESTS = 3
semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)

# Simple delay between calls to spread them out
MIN_DELAY_BETWEEN_CALLS = 7  # seconds (gives ~8-9 RPM max)

last_call_time = 0
call_lock = asyncio.Lock()

async def throttled_openclaw_call(client, messages, model="ocl-4-agent"):
    global last_call_time
    
    async with semaphore:
        async with call_lock:
            now = asyncio.get_event_loop().time()
            wait_time = MIN_DELAY_BETWEEN_CALLS - (now - last_call_time)
            if wait_time > 0:
                await asyncio.sleep(wait_time)
            last_call_time = asyncio.get_event_loop().time()
        
        response = await client.chat.completions.create(
            model=model,
            messages=messages,
        )
        return response

This approach is like a governor on an engine. Your agent might want to go faster, but the throttle keeps it at a sustainable speed. You'll never see a 429 error from RPM again if you tune MIN_DELAY_BETWEEN_CALLS correctly.

For more sophisticated control, the aiolimiter library gives you a proper token bucket implementation:

from aiolimiter import AsyncLimiter

# Allow 8 requests per 60 seconds (leaving 2 RPM headroom)
rate_limiter = AsyncLimiter(8, 60)

async def rate_limited_call(client, messages, model="ocl-4-agent"):
    async with rate_limiter:
        response = await client.chat.completions.create(
            model=model,
            messages=messages,
        )
        return response

Clean, simple, and it works. The AsyncLimiter automatically handles the timing for you.

Strategy 3: Token Budget Tracking (For TPM Limits)

RPM is the limit that's easy to understand and easy to control. TPM is the sneaky one.

Your agent's context window grows with every step. By step 15, you might be sending 8,000 tokens of conversation history with every request. If those requests are also generating 2,000 tokens of output each, you're burning 10,000 tokens per call. At 60,000 TPM, you can only make 6 calls per minute before you hit the token limit — even if your RPM limit says 10.

The fix is to track token usage proactively:

import tiktoken

class TokenBudgetTracker:
    def __init__(self, tpm_limit=60000, window_seconds=60):
        self.tpm_limit = tpm_limit
        self.window_seconds = window_seconds
        self.usage_log = []  # list of (timestamp, token_count)
    
    def _clean_old_entries(self):
        import time
        cutoff = time.time() - self.window_seconds
        self.usage_log = [(t, c) for t, c in self.usage_log if t > cutoff]
    
    def current_usage(self):
        self._clean_old_entries()
        return sum(c for _, c in self.usage_log)
    
    def can_send(self, estimated_tokens):
        return self.current_usage() + estimated_tokens < self.tpm_limit * 0.85  # 15% buffer
    
    def record_usage(self, tokens_used):
        import time
        self.usage_log.append((time.time(), tokens_used))
    
    async def wait_for_capacity(self, estimated_tokens):
        import asyncio, time
        while not self.can_send(estimated_tokens):
            await asyncio.sleep(2)

def estimate_tokens(messages, model="ocl-4-agent"):
    """Rough estimate of input tokens."""
    enc = tiktoken.encoding_for_model("gpt-4")  # Use closest available encoding
    total = 0
    for msg in messages:
        total += len(enc.encode(msg.get("content", ""))) + 4  # overhead per message
    total += 2  # reply priming
    return total + 500  # buffer for expected output

Now before every LLM call, you check:

tracker = TokenBudgetTracker(tpm_limit=60000)

async def smart_openclaw_call(client, messages, model="ocl-4-agent"):
    est_tokens = estimate_tokens(messages, model)
    await tracker.wait_for_capacity(est_tokens)
    
    async with rate_limiter:
        response = await client.chat.completions.create(
            model=model,
            messages=messages,
        )
        actual_tokens = response.usage.total_tokens
        tracker.record_usage(actual_tokens)
        return response

This is the difference between "my agent works in testing" and "my agent works in production." Token budgeting is boring, unsexy infrastructure work, and it's absolutely essential.

Strategy 4: State Persistence (So Crashes Don't Kill You)

Even with all the throttling and backoff in the world, things will occasionally fail. Network hiccups, platform outages, unexpected error codes, your laptop going to sleep during a long run — the list is long.

The answer is to save agent state after every successful step so you can resume from where you left off:

import json
import os

class AgentCheckpointer:
    def __init__(self, checkpoint_dir="./checkpoints"):
        os.makedirs(checkpoint_dir, exist_ok=True)
        self.checkpoint_dir = checkpoint_dir
    
    def save(self, run_id, step, state):
        path = os.path.join(self.checkpoint_dir, f"{run_id}_step_{step}.json")
        with open(path, "w") as f:
            json.dump({
                "run_id": run_id,
                "step": step,
                "messages": state.get("messages", []),
                "tool_results": state.get("tool_results", []),
                "metadata": state.get("metadata", {}),
            }, f, indent=2)
        return path
    
    def load_latest(self, run_id):
        files = [f for f in os.listdir(self.checkpoint_dir) if f.startswith(run_id)]
        if not files:
            return None
        latest = sorted(files, key=lambda f: int(f.split("_step_")[1].split(".")[0]))[-1]
        with open(os.path.join(self.checkpoint_dir, latest)) as f:
            return json.load(f)

In your agent loop:

checkpointer = AgentCheckpointer()

async def run_agent(run_id, task):
    # Try to resume from checkpoint
    existing = checkpointer.load_latest(run_id)
    if existing:
        state = existing
        start_step = existing["step"] + 1
        print(f"Resuming from step {start_step}")
    else:
        state = {"messages": [{"role": "user", "content": task}], "tool_results": [], "metadata": {}}
        start_step = 0
    
    for step in range(start_step, 50):  # max 50 steps
        response = await smart_openclaw_call(client, state["messages"])
        
        # Process response, run tools, update state...
        state["messages"].append({"role": "assistant", "content": response.choices[0].message.content})
        
        # Save checkpoint after every successful step
        checkpointer.save(run_id, step, state)
        
        if is_task_complete(response):
            break
    
    return state

Now when your agent hits a rate limit that exceeds your retry budget, or the process crashes for any reason, you restart and it picks up right where it left off. No lost work. No re-running the first 14 steps to get back to step 15.

Strategy 5: Model Fallback

Sometimes the smartest thing to do when you're rate-limited on your primary model is to fall back to a lighter one. OpenClaw's model tiers have different rate limits, and the smaller models typically have much more generous allowances:

MODEL_FALLBACK_CHAIN = [
    "ocl-4-agent",      # Primary: most capable
    "ocl-4-mini",       # Fallback 1: faster, higher limits
    "ocl-3-turbo",      # Fallback 2: even higher limits
]

async def resilient_openclaw_call(client, messages):
    for model in MODEL_FALLBACK_CHAIN:
        try:
            return await smart_openclaw_call(client, messages, model=model)
        except OpenClawError as e:
            if "rate_limit" in str(e).lower() and model != MODEL_FALLBACK_CHAIN[-1]:
                print(f"Rate limited on {model}, falling back...")
                continue
            raise

This is especially useful for agent steps that don't require maximum reasoning capability — things like summarization, formatting, or simple tool-call routing can often be handled by a lighter model perfectly well.

Putting It All Together

Here's the thing about all of this: it's a lot of code. Every piece is straightforward on its own, but wiring together exponential backoff, concurrency control, token budgeting, state persistence, and model fallback into a coherent system takes time. I know because I spent a weekend doing it, and then another weekend debugging edge cases I hadn't considered.

If you don't want to set all of this up manually, Felix's OpenClaw Starter Pack on Claw Mart includes a pre-built version of this entire rate limiting stack as one of its bundled skills. For $29, you get pre-configured retry logic, throttling, checkpointing, and fallback chains that work out of the box with OpenClaw's agent framework. I genuinely wish it had existed when I started — it would have saved me that entire debugging weekend and then some. It's the fastest way to go from "my agent crashes every run" to "my agent runs reliably in production," and the fact that someone packaged it up with sensible defaults for OpenClaw's specific rate limit tiers is exactly the kind of thing the ecosystem needs.

The Cheat Sheet

For those who just want the quick reference, here's what to implement and in what order:

Level 1 (Do this today):

  • Add tenacity retry with exponential backoff + jitter to every LLM call.
  • Set stop_after_attempt(8) and max=120 seconds.

Level 2 (Do this before going to production):

  • Add concurrency control with asyncio.Semaphore or aiolimiter.
  • Set your rate limiter to 80–85% of your actual RPM limit to leave headroom.

Level 3 (Do this for any agent that runs more than 5 steps):

  • Implement state persistence / checkpointing after every step.
  • Add resume-from-checkpoint logic to your agent loop.

Level 4 (Do this for production systems):

  • Add token budget tracking to stay under TPM limits.
  • Implement model fallback chains.
  • Add monitoring/logging so you can see when and why rate limits are being hit.

Level 5 (Or just do this):

What Comes Next

Rate limits are annoying, but they're a solved problem if you're willing to engineer around them. The patterns above aren't theoretical — they're what people running production OpenClaw agents actually use.

The larger lesson here is one that applies to all agent development: agents fail. Constantly. The difference between a demo and a product is how gracefully you handle the failures. Rate limits are just the most common and most predictable failure mode. Once you've built robust handling for them, you'll find yourself naturally applying the same patterns (retry, checkpoint, fallback) to tool failures, context window overflow, and hallucination recovery.

Start with the tenacity wrapper. Add the semaphore. Implement checkpointing. Track your tokens. And if you want to skip the yak-shaving and get straight to building the interesting parts of your agent, the starter pack exists for exactly that reason.

Go build something that doesn't crash at step 18.

Claw Mart Daily

Get one AI agent tip every morning

Free daily tips to make your OpenClaw agent smarter. No spam, unsubscribe anytime.

More From the Blog