Reducing Token Usage to Prevent Context Overflow
Reducing Token Usage to Prevent Context Overflow

If you've spent more than a weekend building agents on OpenClaw, you've hit the wall. You know exactly which wall I'm talking about β your agent hums along beautifully for eight, maybe twelve turns, and then it justβ¦ falls apart. It starts repeating actions it already took. It forgets the original goal you gave it. Or, worst case, you get the dreaded context length exceeded error and the whole run dies mid-task.
This isn't a bug. It's physics. Every LLM has a finite context window, and agents are ravenous consumers of tokens. Every system prompt, every user message, every tool call, every observation your agent receives β all of it piles up inside that window until there's no room left. And the failure mode isn't graceful. It's sudden, expensive, and maddeningly hard to debug.
I've burned more hours (and more API credits) on this problem than I'd like to admit. But after a lot of trial and error, I've landed on a set of patterns inside OpenClaw that actually work β not just in demos, but in production agents that run for dozens of turns without collapsing. This post is everything I've learned about reducing token usage and preventing context overflow, laid out so you can implement it this week.
Why Context Overflow Happens (And Why It's Worse Than You Think)
Before we fix anything, let's be honest about the mechanics. A typical OpenClaw agent turn involves:
- System prompt: 500β2,000 tokens (sometimes more if you're doing few-shot examples)
- User message: 50β500 tokens
- Chat history: grows linearly with every turn
- Tool calls: each function call definition adds tokens
- Tool observations: this is the silent killer β a single API response or document retrieval can dump 2,000β10,000 tokens into context per call
Do the math on a 15-turn agent run with three tool calls per turn, each returning a moderate observation. You're easily looking at 60,000β80,000 tokens of accumulated context. Even with a 128k window, you're burning through half of it β and here's the part people miss: quality degrades long before you hit the hard limit.
This is the "lost in the middle" problem that's well-documented in the research literature. LLMs pay the most attention to the beginning and end of their context window. Everything in the middle β which is where your critical early instructions and important mid-run observations live β gets progressively ignored. So your agent doesn't just run out of room. It gets dumber as the context fills up, even when technically within the token limit.
That's why the fix isn't "just use a model with a bigger window." The fix is using fewer tokens more intelligently.
The OpenClaw Toolkit for Context Management
OpenClaw gives you several levers for managing context. Most people only use one (or none). The trick is combining them. Here's the full stack, ordered from simplest to most powerful.
1. Trim Your Tool Observations Ruthlessly
This is the single highest-ROI change you can make, and it takes five minutes.
By default, most people let their tools return whatever they return β full JSON payloads, complete documents, raw API responses. This is insane. Your agent doesn't need the full response. It needs the relevant parts of the response.
In your OpenClaw tool definitions, add a post-processing step that extracts only what matters:
# BAD: returning the full API response
@tool
def search_inventory(query: str) -> str:
response = api.search(query)
return json.dumps(response) # Could be 5,000+ tokens
# GOOD: returning only what the agent needs
@tool
def search_inventory(query: str) -> str:
response = api.search(query)
results = response.get("items", [])[:5] # Top 5 only
trimmed = [
{"name": item["name"], "price": item["price"], "in_stock": item["available"]}
for item in results
]
return json.dumps(trimmed) # ~200 tokens
That's a 25x reduction in token usage from a single tool call. Over a 15-turn run with multiple tool invocations, this saves you tens of thousands of tokens.
Rule of thumb: No tool observation should ever exceed 500 tokens. If it does, you're returning too much. Summarize, truncate, or restructure.
2. Implement Sliding Window History
The naive approach is keeping every message in history forever. The smarter approach is a sliding window that keeps the last N turns and discards the rest.
In OpenClaw, you can implement this in your agent's message management:
class ManagedConversation:
def __init__(self, max_turns=10):
self.max_turns = max_turns
self.system_prompt = None
self.messages = []
def set_system_prompt(self, prompt: str):
self.system_prompt = {"role": "system", "content": prompt}
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# Keep only the last N turns (each turn = user + assistant)
if len(self.messages) > self.max_turns * 2:
self.messages = self.messages[-(self.max_turns * 2):]
def get_context(self):
# System prompt always stays (never gets windowed out)
context = [self.system_prompt] if self.system_prompt else []
context.extend(self.messages)
return context
This is a huge improvement over keeping everything, but it has an obvious flaw: you lose information from early turns. Which brings us to the real solution.
3. Summarize-and-Compress: The Pattern That Actually Works
The most reliable pattern I've found on OpenClaw combines a sliding window with periodic summarization. Instead of just dropping old messages, you summarize them into a compressed "memory" block that stays in context.
Here's the full implementation:
class SmartMemoryManager:
def __init__(self,
max_recent_turns=6,
summarize_every=4,
max_summary_tokens=300):
self.system_prompt = None
self.running_summary = ""
self.recent_messages = []
self.max_recent_turns = max_recent_turns
self.summarize_every = summarize_every
self.max_summary_tokens = max_summary_tokens
self.turn_count = 0
def add_turn(self, user_msg: str, assistant_msg: str,
tool_results: list = None):
self.recent_messages.append({"role": "user", "content": user_msg})
if tool_results:
for result in tool_results:
self.recent_messages.append({
"role": "tool",
"content": self._truncate_observation(result)
})
self.recent_messages.append({
"role": "assistant",
"content": assistant_msg
})
self.turn_count += 1
# Trigger summarization when we accumulate enough turns
if self.turn_count % self.summarize_every == 0:
self._summarize_old_messages()
def _truncate_observation(self, observation: str, max_chars=800) -> str:
"""Hard truncation as a safety net."""
if len(observation) > max_chars:
return observation[:max_chars] + "\n[...truncated]"
return observation
def _summarize_old_messages(self):
"""Compress older messages into the running summary."""
if len(self.recent_messages) <= self.max_recent_turns * 2:
return
# Messages to summarize (everything beyond recent window)
cutoff = -(self.max_recent_turns * 2)
old_messages = self.recent_messages[:cutoff]
self.recent_messages = self.recent_messages[cutoff:]
# Build summarization prompt
history_text = "\n".join(
f"{m['role']}: {m['content']}" for m in old_messages
)
summary_prompt = f"""Summarize the following conversation history
into a concise paragraph. Focus on: decisions made, information gathered,
current progress toward the goal, and any constraints discovered.
Previous summary: {self.running_summary or 'None yet.'}
New messages to incorporate:
{history_text}
Write a concise updated summary (max {self.max_summary_tokens} tokens):"""
# Use OpenClaw to generate the summary
self.running_summary = openclaw.complete(
prompt=summary_prompt,
max_tokens=self.max_summary_tokens,
temperature=0.2 # Low temp for factual summarization
)
def get_context(self):
context = []
if self.system_prompt:
context.append({"role": "system", "content": self.system_prompt})
# Inject running summary as context
if self.running_summary:
context.append({
"role": "system",
"content": f"[Conversation Summary]: {self.running_summary}"
})
context.extend(self.recent_messages)
return context
This is the pattern. Your agent always has:
- The system prompt (pinned, never summarized)
- A compressed summary of everything that happened before the recent window
- The last N turns in full detail
The result? Relatively flat token usage regardless of how many turns the agent runs. I've used this to run 50+ turn agents on OpenClaw that stay sharp the entire time.
4. Design Smaller, Focused Agents
This is an architectural change, not a code trick, but it's the most important thing on this list.
One giant agent that does everything β researches, plans, executes, validates, reports β will always blow up its context window. Instead, build a pipeline of smaller agents, each with a focused job and a clean context:
# Instead of one mega-agent:
# research_and_plan_and_execute_and_validate_agent()
# Build a pipeline:
def run_pipeline(task: str):
# Agent 1: Research (starts with clean context)
research = openclaw.run_agent(
system_prompt="You are a research agent. Find relevant information.",
task=task,
tools=[search_tool, fetch_tool],
max_turns=8
)
# Extract only the findings (not the full conversation)
findings = research.final_output # Concise result
# Agent 2: Plan (starts with clean context + research summary)
plan = openclaw.run_agent(
system_prompt="You are a planning agent. Create an action plan.",
task=f"Based on this research: {findings}\n\nCreate a plan for: {task}",
tools=[],
max_turns=4
)
# Agent 3: Execute (clean context + plan only)
result = openclaw.run_agent(
system_prompt="You are an execution agent. Follow the plan precisely.",
task=f"Execute this plan: {plan.final_output}",
tools=[execute_tool, validate_tool],
max_turns=10
)
return result
Each agent starts fresh. No accumulated garbage. No "lost in the middle" degradation. The only information that passes between agents is a concise summary of what the previous one found. This is how you build agents that actually work at scale on OpenClaw.
5. Token Budgeting: Know Your Numbers
You can't manage what you don't measure. Build token counting into your agent loop:
import tiktoken
class TokenBudget:
def __init__(self, model="gpt-4", max_context=128000, reserve=4096):
self.encoder = tiktoken.encoding_for_model(model)
self.max_usable = max_context - reserve # Reserve for response
def count_tokens(self, messages: list) -> int:
total = 0
for msg in messages:
total += len(self.encoder.encode(msg["content"]))
total += 4 # Message overhead tokens
return total
def check_budget(self, messages: list) -> dict:
used = self.count_tokens(messages)
remaining = self.max_usable - used
utilization = used / self.max_usable
return {
"used": used,
"remaining": remaining,
"utilization": f"{utilization:.1%}",
"should_summarize": utilization > 0.6, # Trigger at 60%
"critical": utilization > 0.85
}
Wire this into your agent loop. When utilization crosses 60%, trigger summarization. When it crosses 85%, force-trim tool observations and non-essential history. Never let it hit the wall.
The "Lost in the Middle" Fix
Even with all the above, you should structure your context to fight attention degradation. The rule is simple: put the most important information at the very beginning and the very end of your context.
In practice, this means:
- Beginning: System prompt + goal + constraints + running summary
- Middle: Historical messages (least critical, most likely to be partially ignored)
- End: Most recent messages + current task
Your get_context() method should enforce this ordering. The conversation summary goes right after the system prompt (beginning), and recent messages are always last (end). The middle β where attention is weakest β holds only messages you could afford to lose.
Getting Started Without Reinventing the Wheel
If this all feels like a lot to implement from scratch, you're right β and that's exactly why I'd point you toward Felix's OpenClaw Starter Pack. It bundles pre-built patterns for agent development on OpenClaw, including memory management scaffolding, so you're not writing your own SmartMemoryManager from scratch on day one. It's the fastest way to go from "my agent dies after 10 turns" to "my agent runs reliably in production." If you're serious about building on OpenClaw and you don't want to spend your first two weeks debugging context overflow, just start there.
The Checklist
Before you close this tab, here's the implementation order I'd recommend. Each step compounds on the last:
- Truncate tool observations (30 minutes, immediate impact)
- Add token counting to your agent loop (1 hour, gives you visibility)
- Implement sliding window + summarization (2β3 hours, solves 80% of overflow issues)
- Refactor monolithic agents into focused pipelines (half a day, solves the remaining 20%)
- Structure context to fight "lost in the middle" (1 hour, improves quality even when tokens aren't an issue)
Context overflow isn't a fundamental limitation. It's an engineering problem with known solutions. The agents that work in production aren't the ones with the biggest context windows β they're the ones that use their context windows most intelligently. Start building that way on OpenClaw and you'll stop fighting the wall entirely.