Changing LLM Models in OpenClaw: Claude, GPT-4o or Local?

Look, I'm going to save you the thirty minutes of frustration I went through the first time I tried to swap models in OpenClaw. Because the documentation kind of assumes you already know what you're doing, and if you're reading this, you probably don't. That's fine. Neither did I.

Here's the situation: you've got an OpenClaw agent running. Maybe it's doing research, maybe it's handling customer support tickets, maybe it's doing something weird and creative that only makes sense to you. It works. But it's running on GPT-4o, and you want to try Claude. Or you're burning cash on API calls and want to test a local model. Or you heard Anthropic's latest Claude model is better at reasoning and you want to see for yourself.

Whatever the reason, changing the underlying LLM in OpenClaw is one of those things that should be simple but has about four gotchas that will eat your afternoon if you don't know about them upfront.

Let me walk you through all of it.

Why You'd Want to Switch Models in the First Place

Before we get into the how, let's talk about the why — because the reason you're switching actually determines how you should approach it.

Cost. This is the big one. GPT-4o is excellent but expensive at scale. If your agent is making dozens of tool calls per run, those tokens add up fast. I've seen people in the OpenClaw community burning $30-50/day on agents that could run for $3-5/day on a different model or a local setup. Claude 3.5 Sonnet, for example, often gives comparable quality at a lower per-token cost depending on your use case. And local models via Ollama? Essentially free after the initial hardware investment.

Quality for your specific task. Different models are better at different things. Claude tends to be stronger at following complex multi-step instructions and producing well-structured output. GPT-4o is generally better at creative tasks and has a wider knowledge base. Local models like Llama 3.1 or Mistral are surprisingly good for focused, narrow tasks where you don't need the full power of a frontier model. The point is: the best model for your agent depends entirely on what your agent does.

Reliability and fallbacks. This is the one most people don't think about until they get bitten. OpenAI has outages. Anthropic has outages. If your agent is running in production and the API goes down, you need a fallback. OpenClaw makes this possible, but only if you've set up multiple model configurations ahead of time.

Privacy and data control. If you're processing sensitive information, running a local model means your data never leaves your machine. For some use cases, this isn't optional — it's a requirement.

The Basics: How OpenClaw Handles Models

OpenClaw uses a model configuration layer that sits between your agent logic and the actual LLM provider. This is one of the things that makes it genuinely useful compared to writing raw API calls — your skills, tools, and agent graphs don't need to know or care which model is running underneath.

The core concept is the model config, which lives in your agent's configuration. Here's what a basic one looks like:

model:
  provider: openai
  model_name: gpt-4o
  temperature: 0.7
  max_tokens: 4096
  api_key: ${OPENAI_API_KEY}

Simple enough. This tells OpenClaw to use GPT-4o via the OpenAI API. The ${OPENAI_API_KEY} syntax pulls from your environment variables, which is the right way to handle secrets — never hardcode API keys.

Now here's where people get tripped up: changing the model isn't just changing the model_name field. Different providers have different parameter names, different tool-calling formats, and different quirks. OpenClaw abstracts most of this away, but you need to update the config correctly.

Switching to Claude

Let's say you want to move from GPT-4o to Claude 3.5 Sonnet. Here's the updated config:

model:
  provider: anthropic
  model_name: claude-3-5-sonnet-20241022
  temperature: 0.7
  max_tokens: 4096
  api_key: ${ANTHROPIC_API_KEY}

Three things changed: the provider, the model_name, and the api_key reference. That's the easy part.

Here's gotcha number one: Claude handles system prompts differently than OpenAI. OpenAI lets you pass a system message as part of the message array. Anthropic's API has a separate system parameter. OpenClaw handles this translation for you if you're using the standard skill format. But if you've written custom message construction in any of your skills, you'll need to check that your system prompts are being passed correctly.

Here's what proper skill-level system prompt handling looks like:

from openclaw.skills import Skill

class ResearchSkill(Skill):
    system_prompt = """You are a research assistant. 
    Find relevant information and summarize it concisely.
    Always cite your sources."""
    
    def run(self, query: str):
        response = self.llm.invoke(
            messages=[{"role": "user", "content": query}],
            system=self.system_prompt
        )
        return response.content

When you define the system prompt using the system parameter in the invoke call (or set it as a class attribute that OpenClaw picks up), the framework handles the provider-specific formatting. If you're doing something like manually constructing a messages list with {"role": "system", "content": "..."}, that'll work fine with OpenAI but might behave unpredictably with Anthropic. Fix it before you switch.

Gotcha number two: tool calling format differences. If your agent uses tools (and if it doesn't, why are you using an agent framework?), the way tools are defined and called varies between providers. OpenAI uses function calling with a specific JSON schema. Anthropic uses a similar but not identical format. Again, OpenClaw's tool abstraction handles this — but only if you're using OpenClaw's @tool decorator or tool registration system:

from openclaw.tools import tool

@tool
def search_web(query: str, num_results: int = 5) -> str:
    """Search the web for information.
    
    Args:
        query: The search query
        num_results: Number of results to return
    """
    # Your search implementation
    results = perform_search(query, num_results)
    return format_results(results)

The docstring and type hints are critical here. OpenClaw uses them to generate the tool schema for whichever provider you're using. If your tool definitions are sloppy — missing type hints, vague docstrings — the model will struggle to use them correctly, and this problem gets worse when you switch providers because each model interprets ambiguous tool descriptions differently.

Switching to a Local Model

This is where things get interesting. Running a local model with OpenClaw means you get zero API costs and full data privacy, but the setup requires a few more steps.

The most common approach is using Ollama, which gives you a local API that's compatible with the OpenAI format. First, install Ollama and pull a model:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.1:8b

# Or for a more capable model if you have the VRAM
ollama pull llama3.1:70b

Then update your OpenClaw config:

model:
  provider: ollama
  model_name: llama3.1:8b
  base_url: http://localhost:11434
  temperature: 0.7
  max_tokens: 4096

No API key needed — it's running on your machine.

Here's gotcha number three, and this is the big one for local models: most local models are significantly worse at tool calling than GPT-4o or Claude. If your agent relies heavily on structured tool use — calling functions with specific parameters, interpreting the results, deciding what to call next — you'll likely see a quality drop with smaller local models.

The workaround is to use local models for the parts of your agent that don't require tool calling (summarization, drafting, simple Q&A) and keep a cloud model for the orchestration and tool-calling nodes. OpenClaw supports this with per-node model configuration in your agent graph:

from openclaw.graph import AgentGraph, Node

graph = AgentGraph()

# Orchestrator uses Claude for reliable tool calling
graph.add_node(
    Node(
        name="orchestrator",
        model_config={
            "provider": "anthropic",
            "model_name": "claude-3-5-sonnet-20241022"
        },
        skills=["research", "analysis"]
    )
)

# Writer uses local model to save costs
graph.add_node(
    Node(
        name="writer",
        model_config={
            "provider": "ollama",
            "model_name": "llama3.1:8b"
        },
        skills=["drafting"]
    )
)

graph.add_edge("orchestrator", "writer")

This hybrid approach is honestly the best of both worlds. Your expensive, decision-heavy work runs on a smart cloud model. Your cheap, high-volume text generation runs locally. I've seen people cut their costs by 60-70% this way without meaningful quality loss.

Setting Up Model Fallbacks

This is the production-readiness move that separates hobby projects from real systems. OpenClaw lets you define fallback chains so that if your primary model's API is down or returns an error, the agent automatically tries the next option:

model:
  primary:
    provider: anthropic
    model_name: claude-3-5-sonnet-20241022
    api_key: ${ANTHROPIC_API_KEY}
  fallback:
    - provider: openai
      model_name: gpt-4o
      api_key: ${OPENAI_API_KEY}
    - provider: ollama
      model_name: llama3.1:8b
      base_url: http://localhost:11434

With this setup, your agent tries Claude first, falls back to GPT-4o if Anthropic is down, and falls back to a local model as a last resort. The agent keeps running no matter what. Your users never see an error.

Gotcha number four: test your fallbacks regularly. It's not enough to set them up and forget about them. Models update, API formats change, and what worked last month might not work today. Set up a simple test that runs your agent against each configured model on a schedule. OpenClaw's built-in health check can do this:

from openclaw.health import ModelHealthCheck

checker = ModelHealthCheck(config_path="./agent_config.yaml")
results = checker.run_all()

for model, status in results.items():
    print(f"{model}: {'✅' if status.healthy else '❌'} ({status.latency_ms}ms)")

The Structured Output Problem

One more thing that bites people when switching models: structured output. If your skills expect JSON responses — and they probably should for anything that feeds into another skill or system — different models have different levels of reliability with JSON formatting.

GPT-4o with response_format: { type: "json_object" } is very reliable. Claude is also good but occasionally wraps JSON in markdown code blocks. Local models are... unpredictable.

OpenClaw's output parser handles this, but you need to use it:

from openclaw.parsers import JSONParser

class AnalysisSkill(Skill):
    output_parser = JSONParser(
        schema={
            "summary": str,
            "confidence": float,
            "sources": list
        },
        retry_on_fail=True,
        max_retries=3
    )
    
    def run(self, data: str):
        response = self.llm.invoke(
            messages=[{"role": "user", "content": f"Analyze this: {data}"}]
        )
        return self.output_parser.parse(response.content)

The retry_on_fail=True is key. If the model returns malformed JSON, the parser will automatically re-prompt with an error message asking for corrected output. This makes your agent dramatically more robust across model changes.

The Fastest Way to Get All of This Right

I've laid out a lot of configuration, gotchas, and code. If you're the type who likes to build everything from scratch, you now have everything you need to go do that.

But honestly? If you just want this to work — model switching, fallbacks, structured output parsing, hybrid local/cloud setups — already configured and tested, Felix's OpenClaw Starter Pack on Claw Mart is worth the $29. It includes pre-configured skills with all the model-agnostic patterns I described above, working fallback chains, and a hybrid model setup out of the box. I spent probably eight hours getting my first multi-model agent working correctly. Felix's pack would have saved me at least six of those hours. It's not magic — it's just someone who already solved these problems packaging the solution up nicely.

If you don't want to spend the money, totally fine. Everything in this post works. But if your time is worth more than $5/hour, the math favors the starter pack.

My Recommendation

Here's what I'd actually do if I were starting a new OpenClaw agent today:

Start with Claude 3.5 Sonnet as your primary model. Best balance of quality, tool-calling reliability, and cost right now.
Set up GPT-4o as your first fallback. Different failure modes than Claude, so they complement each other well.
Add a local Ollama model as your last-resort fallback and for any high-volume, low-complexity nodes.
Use OpenClaw's output parsers everywhere. Never trust raw model output to be in the format you expect.
Test across all your configured models before deploying. An agent that works on Claude but breaks on GPT-4o isn't production-ready.

The beauty of OpenClaw's architecture is that once you set this up correctly, you can swap models with a config change instead of a code rewrite. New model drops from Anthropic? Change one line. Want to test Google's Gemini? Add a provider config. Your agent logic stays the same.

That's the whole point of using a framework instead of raw API calls. Make the framework do the boring, error-prone work so you can focus on building the thing that actually matters — the agent itself.

Now go change some models.

Changing LLM Models in OpenClaw: Claude, GPT-4o or Local?

Why You'd Want to Switch Models in the First Place

The Basics: How OpenClaw Handles Models

Switching to Claude

Switching to a Local Model

Setting Up Model Fallbacks

The Structured Output Problem

The Fastest Way to Get All of This Right

My Recommendation

Get one AI agent tip every morning

More From the Blog

How to Automate Vendor Invoice Reconciliation with AI

Automate Seasonal Menu Forecasting: Build an AI Agent That Predicts Demand

How to Automate Guest Feedback Analysis with AI