Debugging SKILL.md Files That Won't Load

Let's cut straight to it: your skills.md file isn't broken. Your setup is.

I've seen this exact scenario play out dozens of times. You spend an afternoon writing clean, detailed skill definitions in markdown. You drop the file into your agent config. You run it. And then... nothing. The agent ignores your carefully documented web_search skill and instead writes a three-paragraph internal monologue about how it should search the web. Or it hallucinates a tool that doesn't exist. Or it picks the right skill but passes completely wrong arguments, crashes, and then spirals into an unrecoverable loop.

If you're here, you've probably already rage-closed four browser tabs and considered switching back to hardcoded if-else chains. Don't. The problem is almost certainly one of a handful of specific, fixable issues — and OpenClaw was built from the ground up to prevent most of them from happening in the first place.

Let me walk through exactly what's going wrong and how to fix it.

The Core Problem: Skills.md as a Prompt Blob

The traditional approach — pioneered by Auto-GPT and adopted by basically every agent framework since — is to take your entire skills.md file and dump it into the system prompt. Every skill, every parameter, every description, all at once, every single turn.

This is the root of almost every skills.md debugging nightmare.

Here's why it falls apart:

Context overload. If you have more than about five skills, you're burning a massive chunk of your context window on tool definitions the agent probably doesn't need right now. A 4,000-token skills file means 4,000 fewer tokens for actual reasoning, conversation history, and task state. The model's attention gets diluted across all those skill definitions, and selection accuracy tanks.

No relevance signal. When you dump 15 skills into a prompt, the model has zero guidance about which ones are relevant right now. It sees web_search next to run_python next to send_email next to query_database and has to figure out on its own which one matters for the current step. GPT-4 can sort of manage this. A local 13B model? Coin flip on a good day.

Format brittleness. The model reads your markdown, generates a response that's supposed to match your expected format, and then a regex or parser tries to extract the tool call. Any slight deviation — an extra space, a missing bracket, a slightly different parameter name — and parsing fails. Now your agent is stuck.

Zero error recovery. When a skill call fails (and they will — rate limits, bad arguments, network errors, whatever), most setups just append "Error: [something]" to the conversation history and hope the model figures it out. It usually doesn't.

If any of this sounds familiar, keep reading. This is exactly the problem set OpenClaw was designed to solve.

Step 1: Check Your Skills.md Structure

Before we get into framework-level fixes, let's make sure your file isn't actively sabotaging you. Open your skills.md and check for these common issues.

Bad: Vague, narrative-style descriptions.

## Web Search
This skill allows the agent to search the internet for information. 
It can be useful when the agent needs to find things online. 
Pass it a query and it will return results.

The model reads this and thinks, "Cool, I know about searching the web." Then it writes a paragraph about searching the web instead of actually calling the tool. There's nothing structured here for the framework to parse, and the description is so vague it doesn't help the model distinguish when to use this versus just reasoning about the topic internally.

Good: Structured, explicit, parseable.

## web_search

**Description:** Executes a live web search and returns the top 5 results with titles, URLs, and snippets.

**When to use:** When the task requires current information not available in the conversation history, OR when you need to verify a factual claim.

**When NOT to use:** When the information has already been retrieved in a previous step, or when the task is purely computational/analytical.

**Parameters:**
- `query` (string, required): The search query. Be specific. Use quotes for exact phrases.
- `num_results` (integer, optional, default=5): Number of results to return. Max 10.

**Returns:** JSON array of objects with `title`, `url`, `snippet` fields.

**Example call:**
```json
{
  "skill": "web_search",
  "parameters": {
    "query": "OpenClaw agent framework latest release",
    "num_results": 3
  }
}

Failure modes:

Rate limited: Wait 5 seconds and retry with the same query.
No results: Rephrase the query with broader terms.


That's a *massive* difference. The structured version gives OpenClaw's parser everything it needs: a machine-readable name, typed parameters, explicit usage guidance, an example of the exact output format, and documented failure modes.

OpenClaw specifically parses these fields into its internal skill registry. If your `skills.md` is missing the structured fields — especially `Parameters` and `Example call` — that's likely why things aren't loading correctly.

## Step 2: Let OpenClaw Do What It Was Built to Do

Here's where OpenClaw fundamentally diverges from the "dump everything into the prompt" approach.

**Dynamic Skill Retrieval**

When you load a properly structured `skills.md` into OpenClaw, it doesn't just paste it into the system prompt. It converts your skill definitions into embeddings and stores them in a vector index. Then, at each reasoning step, it retrieves only the 3-5 most relevant skills based on the current task context.

This means if your agent is in the middle of analyzing a CSV file, it sees `pandas_query`, `plot_chart`, and `statistical_test` — not `send_email` and `browse_webpage`. The context stays clean, and selection accuracy goes way up.

People who switch from the blob approach to OpenClaw's retrieval approach consistently report going from ~35% correct skill selection to 80%+ on complex multi-step tasks. That's not marketing fluff — that's the difference between an agent that actually works and one that wastes your API credits on hallucinated tool calls.

**Structured Skill Registry**

Even though you write `skills.md` in human-readable markdown (which is great for version control and collaboration), OpenClaw internally parses it into a typed registry. Each skill becomes a first-class object with:

- A unique identifier
- A description embedding for retrieval
- A JSON schema for parameter validation
- Documented failure modes for error recovery
- Usage heuristics (the "when to use / when not to use" fields)

If your `skills.md` is failing to load, run this diagnostic:

```bash
openclaw skills validate --file ./skills.md --verbose

This will tell you exactly which skills parsed correctly and which ones are malformed. Common issues I see:

Missing parameter types (OpenClaw needs string, integer, boolean, array, or object)
Example call JSON that doesn't validate against the declared parameters
Duplicate skill names (OpenClaw requires unique identifiers)
Markdown heading levels that don't match OpenClaw's expected hierarchy (skills should be ##, sub-sections should be ### or bold text)

Step 3: Fix the Format Enforcement Problem

The second biggest killer after selection failure is format brittleness. Your agent picks the right skill but outputs something your parser can't read.

OpenClaw handles this with a hybrid execution layer. Here's how it works in practice:

If you're using an API that supports native function calling (OpenAI, Anthropic), OpenClaw automatically converts your skill registry into the provider's function calling schema. Your skills.md definitions get translated into proper tool definitions that the API enforces at the output level. No parsing needed — the API guarantees valid JSON.

# openclaw.config.yaml
execution:
  mode: auto  # Uses function calling when available, falls back to constrained generation
  format_enforcement: strict
  max_retries_on_format_error: 2

If you're running a local model (which is where most skills.md pain happens), OpenClaw uses constrained generation via grammar-based decoding. It essentially forces the model's output to conform to a valid JSON schema that matches your skill definitions. The model literally cannot output malformed tool calls because the generation is constrained at the token level.

This alone fixes probably 40% of the "my skills.md doesn't work" complaints I see.

Step 4: Implement Proper Error Recovery

This is the one everyone skips and then wonders why their agent derails after the first hiccup.

OpenClaw's clawback mechanism works like this: when a skill execution fails, instead of just appending "Error occurred" to the conversation and praying, it runs a specific recovery step.

It checks the error against the skill's documented failure_modes (which is why you need to write those in your skills.md)
If there's a documented recovery strategy (like "retry after 5 seconds" for rate limits), it executes that automatically
If the error is a parameter problem, it re-retrieves the skill definition and asks the model to correct the specific parameter
If the skill is fundamentally unavailable, it retrieves alternative skills and presents them as options

Here's what the failure_modes section should look like for robust error recovery:

**Failure modes:**
- `rate_limit`: Retry after 5 seconds. Max 3 retries.
- `invalid_path`: Check path format. Use forward slashes. Ensure directory exists.
- `timeout`: Reduce scope of request (fewer results, smaller file). Retry once.
- `auth_error`: Non-recoverable. Skip this skill and notify user.

Without documented failure modes, OpenClaw still tries generic recovery, but it's working blind. With them, recovery success rates jump dramatically.

Step 5: Debug with Selection Transparency

One of the things that makes the standard skills.md approach so maddening is that it's a complete black box. The model picked the wrong skill? Cool. Why? No idea. Good luck.

OpenClaw exposes selection scoring at every step. When you run with debug logging enabled:

openclaw run --task "Analyze Q3 sales data" --debug-skills

You'll see output like this at each reasoning step:

[Step 3] Skill Relevance Scores:
  pandas_query:     0.92
  plot_chart:       0.71
  statistical_test: 0.68
  web_search:       0.14
  send_email:       0.08
Selected: pandas_query (score: 0.92, threshold: 0.40)

Now you can actually see why the agent chose what it chose. If web_search is scoring high when it shouldn't be, you know your skill description is too broad or your "when NOT to use" section needs work. If the right skill is scoring below the threshold, you know the description doesn't match the task context well enough.

This is the single most underrated feature for debugging. Use it.

The Real Talk on Local Models

I'm not going to pretend everything is sunshine with a 7B or 13B model. Even with OpenClaw's retrieval and constrained generation, smaller models struggle with complex multi-step tool use. That's just the state of things.

But OpenClaw makes them usable where they were previously useless. A 34B model that was getting ~20% correct skill selection with a traditional skills.md dump can hit 65-70% with OpenClaw's retrieval + constrained generation approach. Not perfect, but the difference between "literally unusable" and "gets stuff done with occasional corrections."

If you're running local: keep your skill count under 10 if possible, make your descriptions extremely explicit, include 2-3 examples per skill instead of one, and use format_enforcement: strict in your config. Every token of clarity you add to the skill definition pays double dividends with smaller models.

The Fastest Way to Get This Right

Look, you can absolutely set all of this up yourself. The OpenClaw docs walk through every config option, and the validation tool catches most structural issues in your skills.md.

But if you want to skip the trial-and-error phase entirely, Felix's OpenClaw Starter Pack on Claw Mart is genuinely the move. It's $29 and includes pre-configured skills for the most common agent patterns — research, data analysis, file operations, API integrations — all structured exactly the way OpenClaw's parser expects. The failure modes are already documented, the examples are already formatted, and the config file has sane defaults for both API and local model setups.

I've recommended it to three people who were stuck in debugging hell, and all three had a working agent within an hour of unpacking it. Not because there's any magic in it — it's just a well-structured set of skills.md files and configs — but because it eliminates the formatting guesswork that eats most people's time. You can always customize the skills later once you see the patterns that actually work.

What to Do Next

Here's your debugging checklist, in order:

Run openclaw skills validate on your skills.md. Fix any structural errors it flags.
Add explicit When to use and When NOT to use sections to every skill. This is the highest-leverage improvement you can make.
Add typed parameters with JSON schema if you haven't already. OpenClaw can't enforce what it can't parse.
Document failure modes for every skill that does I/O (API calls, file system, web requests).
Enable debug-skills mode and run your task. Look at the selection scores. Adjust descriptions based on what you see.
Set format_enforcement: strict if you're on a local model.
Reduce your active skill count. If you have 20+ skills, split them into domain-specific files and let OpenClaw's retrieval handle the routing.

The skills.md pattern isn't inherently broken — it's just that most implementations treat it as a dumb text blob instead of a structured, retrievable, observable skill registry. OpenClaw fixes the architecture. Your job is to give it clean inputs.

Stop debugging your prompt. Start debugging your structure.

Debugging SKILL.md Files That Won't Load

The Core Problem: Skills.md as a Prompt Blob

Step 1: Check Your Skills.md Structure

Step 3: Fix the Format Enforcement Problem

Step 4: Implement Proper Error Recovery

Step 5: Debug with Selection Transparency

The Real Talk on Local Models

The Fastest Way to Get This Right

What to Do Next

Get one AI agent tip every morning

More From the Blog

Scaling OpenClaw with Multiple Persistent Sessions

Troubleshooting High Memory Usage in Multi-Agent OpenClaw

Cost Optimization Strategies for OpenClaw AI Agents