Optimizing LLM Costs When Running Multiple OpenClaw Agents

If you've ever woken up to a surprise LLM bill that made you question every life decision that led to this moment, congratulations — you're running AI agents. Welcome to the club.

Here's the thing nobody tells you when you spin up your first OpenClaw agent: one agent is manageable. Two agents are fine. But the moment you're running five, ten, or twenty agents handling different workflows — claw game inventory management, customer interactions, pricing optimization, whatever — your LLM costs stop being a line item and start being a lifestyle.

I've spent the last several months building and optimizing multi-agent setups on OpenClaw, and I've made basically every expensive mistake you can make. Agents looping infinitely at 2 AM. GPT-4-class models being used to do what amounts to a string comparison. Context windows stuffed with repeated information that nobody — human or machine — needed to see twice.

The good news: most of these problems are fixable, and the fixes aren't that complicated. You just need to know where the money is actually going and apply a handful of patterns that the community has converged on. Let me walk you through exactly what I do.

First, Understand Where Your Money Actually Goes

Before you optimize anything, you need visibility. This is the number one mistake I see people make — they try to cut costs without knowing what's expensive.

In a typical multi-agent OpenClaw setup, your costs break down across a few dimensions:

Input tokens (what you send to the model): system prompts, conversation history, tool descriptions, retrieved context
Output tokens (what the model generates): responses, reasoning chains, tool calls, structured data
Model tier: the difference between a frontier model and a mini model can be 10-30x per token
Call volume: how many LLM calls each agent makes per task

Most people assume output tokens are the problem. Usually, it's input tokens — specifically, bloated context that gets re-sent on every single call in an agent loop. An agent that takes 8 steps to complete a task is sending your system prompt, tool definitions, and full conversation history 8 times. That adds up fast.

Here's a quick way to audit this in your OpenClaw setup. Add logging to track token usage per agent step:

import json
from datetime import datetime

class CostTracker:
    def __init__(self, budget_limit=5.0):
        self.total_cost = 0.0
        self.budget_limit = budget_limit
        self.log = []

    def track_call(self, agent_name, step_name, model, input_tokens, output_tokens):
        # Approximate costs per 1K tokens (adjust for your models)
        pricing = {
            "gpt-4o": {"input": 0.005, "output": 0.015},
            "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
            "claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
            "claude-3-haiku": {"input": 0.00025, "output": 0.00125},
        }

        rates = pricing.get(model, {"input": 0.005, "output": 0.015})
        call_cost = (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
        self.total_cost += call_cost

        self.log.append({
            "timestamp": datetime.now().isoformat(),
            "agent": agent_name,
            "step": step_name,
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost": round(call_cost, 6),
            "cumulative": round(self.total_cost, 4)
        })

        if self.total_cost >= self.budget_limit:
            raise BudgetExceededError(
                f"Budget limit ${self.budget_limit} exceeded. "
                f"Current spend: ${self.total_cost:.4f}"
            )

        return call_cost

class BudgetExceededError(Exception):
    pass

Wire this into every LLM call your agents make. Within a day, you'll know exactly which agent, which step, and which model is eating your budget. I guarantee you'll be surprised by at least one thing you find.

The Tiered Model Strategy (This Is the Big One)

This single pattern will probably cut your costs by 50-70%. I'm not exaggerating.

The idea is simple: stop using the same model for everything. Most agent steps don't need frontier-level intelligence. They need to follow instructions, parse some data, or make a straightforward decision. That's mini-model territory.

Here's how I structure it for OpenClaw agents:

Tier 1 — Cheap and fast (80% of calls): Use gpt-4o-mini or claude-3-haiku for:

Routing and classification ("Is this a pricing question or an inventory question?")
Data extraction and parsing
Simple tool selection
Format conversion
Validation checks

Tier 2 — Mid-range (15% of calls): Use gpt-4o or claude-3-5-sonnet for:

Complex reasoning that requires nuance
Multi-step planning
Final response synthesis when quality matters
Edge cases that cheaper models consistently get wrong

Tier 3 — Heavy artillery (5% of calls): Use claude-3-opus or equivalent only for:

Tasks where you've verified cheaper models fail
High-stakes outputs (financial calculations, legal-adjacent text)
Complex multi-document analysis

Here's a practical router implementation:

class ModelRouter:
    def __init__(self):
        self.routing_rules = {
            "classify": "gpt-4o-mini",
            "extract": "gpt-4o-mini",
            "validate": "gpt-4o-mini",
            "route": "gpt-4o-mini",
            "plan": "gpt-4o",
            "synthesize": "gpt-4o",
            "reason": "gpt-4o",
            "complex_analysis": "claude-3-5-sonnet",
        }

    def get_model(self, step_type, fallback="gpt-4o-mini"):
        return self.routing_rules.get(step_type, fallback)

    def call_with_routing(self, step_type, messages, cost_tracker, agent_name):
        model = self.get_model(step_type)

        # Make your LLM call here with the selected model
        response = call_llm(model=model, messages=messages)

        cost_tracker.track_call(
            agent_name=agent_name,
            step_name=step_type,
            model=model,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens
        )

        return response

The key insight: you're not degrading quality. You're matching capability to requirement. A mini model classifying "is this message about inventory or pricing?" is going to get that right 99% of the time. Paying 30x more for a frontier model to do the same classification is just lighting money on fire.

Caching: The Second Biggest Win

If your agents repeatedly handle similar queries or process similar data patterns, you're probably paying for the same computation over and over. There are two caching strategies worth implementing:

Exact-Match Caching

For deterministic operations — looking up the same product info, running the same tool with the same parameters, answering the same FAQ — use a simple hash-based cache:

import hashlib
import json

class ExactCache:
    def __init__(self):
        self.cache = {}

    def _hash_key(self, model, messages):
        content = json.dumps({"model": model, "messages": messages}, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, model, messages):
        key = self._hash_key(model, messages)
        return self.cache.get(key)

    def set(self, model, messages, response):
        key = self._hash_key(model, messages)
        self.cache[key] = response

    def cached_call(self, model, messages, cost_tracker, agent_name, step_name):
        cached = self.get(model, messages)
        if cached:
            # Log as zero cost
            cost_tracker.track_call(agent_name, step_name + "_cached", model, 0, 0)
            return cached

        response = call_llm(model=model, messages=messages)
        self.set(model, messages, response)
        cost_tracker.track_call(
            agent_name, step_name, model,
            response.usage.input_tokens, response.usage.output_tokens
        )
        return response

For production, swap the in-memory dict for Redis. But even this simple version will catch more repeat calls than you'd expect.

Semantic Caching

This is for "close enough" matches. Customer asks "how much does the blue claw machine cost?" and ten minutes later another asks "what's the price of the blue claw machine?" — same question, different words. Semantic caching uses embeddings to catch these:

import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.entries = []  # List of (embedding, response) tuples
        self.threshold = similarity_threshold

    def _get_embedding(self, text):
        # Use a cheap embedding model
        response = call_embedding(model="text-embedding-3-small", input=text)
        return np.array(response.data[0].embedding)

    def _cosine_similarity(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def find_similar(self, query):
        query_embedding = self._get_embedding(query)
        best_score = 0
        best_response = None

        for stored_embedding, stored_response in self.entries:
            score = self._cosine_similarity(query_embedding, stored_embedding)
            if score > best_score:
                best_score = score
                best_response = stored_response

        if best_score >= self.threshold:
            return best_response
        return None

    def store(self, query, response):
        embedding = self._get_embedding(query)
        self.entries.append((embedding, response))

Set that similarity threshold conservatively at first (0.95+). You can lower it as you verify quality isn't degrading.

Kill the Runaway Agents

This is the one that causes real financial pain. An agent hits an error, retries, hits the error again, retries with more context, and suddenly it's been looping for 45 minutes burning through tokens at 3 AM.

Non-negotiable safeguards:

class AgentGuardrails:
    def __init__(self, max_steps=15, max_cost=2.0, max_time_seconds=120):
        self.max_steps = max_steps
        self.max_cost = max_cost
        self.max_time = max_time_seconds

    def check(self, current_step, current_cost, elapsed_seconds):
        if current_step >= self.max_steps:
            return "STOP", f"Max steps ({self.max_steps}) reached"
        if current_cost >= self.max_cost:
            return "STOP", f"Budget limit (${self.max_cost}) reached"
        if elapsed_seconds >= self.max_time:
            return "STOP", f"Time limit ({self.max_time}s) reached"
        return "CONTINUE", None

Put this check before every LLM call. No exceptions. I don't care if you think your agent logic is bulletproof — it's not. I've seen perfectly reasonable-looking agent loops turn into $200 mistakes because of a malformed API response that the agent kept trying to "fix" by asking the LLM to reason about it.

Also: implement graceful degradation instead of hard stops where possible. When an agent approaches its budget limit, switch it to a cheaper model rather than killing it mid-task. A slightly worse answer is almost always better than no answer.

Trim Your Context Windows

Every token you send that the model doesn't need to see is money wasted. Here are the specific things I trim:

1. Summarize conversation history instead of sending it raw.

After 5-6 turns in an agent loop, summarize the prior turns into a compact paragraph and replace the full history. This alone can cut input tokens by 40-60% on longer agent runs.

def compress_history(messages, summarizer_model="gpt-4o-mini"):
    if len(messages) <= 6:
        return messages

    # Keep system prompt and last 2 exchanges
    system = [m for m in messages if m["role"] == "system"]
    recent = messages[-4:]  # Last 2 user/assistant pairs
    middle = messages[len(system):-4]

    # Summarize the middle
    summary_prompt = f"Summarize this conversation concisely, preserving key decisions and data:\n\n"
    for m in middle:
        summary_prompt += f"{m['role']}: {m['content']}\n"

    summary = call_llm(
        model=summarizer_model,
        messages=[{"role": "user", "content": summary_prompt}]
    )

    return system + [{"role": "system", "content": f"Prior context summary: {summary.content}"}] + recent

2. Use structured outputs aggressively.

When an agent step needs to return data, force JSON output. Models generating free-form text will ramble. Models generating structured JSON stay concise. This reduces output tokens significantly.

3. Keep tool descriptions minimal.

If your agent has access to 10 tools, those tool descriptions are sent with every call. Make them as short as possible while remaining unambiguous. Cut examples from tool descriptions unless the model consistently misuses the tool without them.

Putting It All Together for Multi-Agent Setups

When you're running multiple OpenClaw agents, each handling different parts of your workflow, the combined pattern looks like this:

Global cost tracker shared across all agents with per-agent and total budget limits
Model router that assigns cheap models to simple steps, expensive models to hard steps
Shared cache layer so Agent B doesn't re-compute what Agent A already figured out
Per-agent guardrails with step limits, cost limits, and time limits
Context compression that kicks in automatically as conversations grow

The multiplier effect of these patterns is real. In isolation, each one saves you 15-30%. Combined, I've consistently seen 60-80% cost reductions on multi-agent workloads without meaningful quality degradation.

Getting Started Without the Pain

If all of this sounds like a lot of infrastructure to build before you even get to the actual agent logic you care about — you're right. That's the honest truth about running agents in production. The LLM calls are the easy part. The cost management, observability, and guardrails are where the real engineering work lives.

This is exactly why I recommend starting with Felix's OpenClaw Starter Pack. It gives you a pre-configured OpenClaw setup with sensible defaults already baked in — the kind of defaults that took me weeks of expensive trial and error to arrive at. Instead of building all this plumbing from scratch, you get a working foundation and can focus on customizing the agent logic for your specific use case. If I were starting over today, it's what I'd grab first.

The Mindset Shift

The fundamental shift you need to make when running multiple agents is this: treat LLM calls like database queries, not like magic.

Nobody would write a web app that runs an unindexed full-table scan on every request and then complain that their database bill is too high. But that's exactly what most people do with LLM calls — send maximum context to the most expensive model with no caching, no routing, and no limits, then act surprised when the bill arrives.

Be intentional. Cache what you can. Route cheap tasks to cheap models. Set hard limits. Compress your context. Monitor everything.

The agents themselves are incredible — OpenClaw makes building sophisticated multi-agent workflows genuinely accessible. But "accessible" and "cheap by default" are different things. The platform gives you the power. Managing the costs is on you.

Next Steps

Here's what I'd do this week:

Add basic cost tracking to every LLM call in your current setup. Even just logging to a file. You can't optimize what you can't measure.
Identify your top 3 most expensive agent steps and ask yourself: does this actually need a frontier model?
Implement a hard budget limit per agent run. Pick a number that feels generous. You can tighten it later.
Set up exact-match caching on your most common tool calls. This is the lowest-effort, highest-impact change.
Grab the OpenClaw Starter Pack if you want a head start on the infrastructure side.

The difference between an agent setup that costs $50/day and one that costs $8/day doing the same work usually isn't one big thing. It's twenty small decisions compounding. Start making those decisions deliberately, and your future self (and your credit card) will thank you.