Claw Score: Auditing Your OpenClaw AI Agent Architecture
You gave your AI agent shell access, email permissions, and a browser. Now it is running tasks at 3 AM while you sleep. Here is the question that should keep you up at night: do you actually know what it is doing?

You gave your AI agent shell access, email permissions, and a browser. Now it is running tasks at 3 AM while you sleep. Here is the question that should keep you up at night: do you actually know what it is doing?
Most people building with OpenClaw can tell you what their agent should do. Almost nobody can tell you how it behaves when it encounters an edge case, a malicious prompt injection, or a memory conflict between yesterday instructions and today. That gap between should and does is where data leaks happen, where downtime starts, and where regulatory fines land.
Claw Score is an AI agent architecture audit tool that grades your agent across six dimensions, gives you a composite score, and tells you exactly where the structural weaknesses are — before they become incidents. It is file-based, OpenClaw-native, and low-overhead. No dashboards to configure, no SaaS to subscribe to, no telemetry pipeline to stand up.
Let us break down what it actually measures, why each dimension matters, and how to use the results.
Key Takeaways
- Claw Score evaluates six architectural dimensions: Identity, Memory, Security, Autonomy, Proactive Patterns, and Learning — each weighted by real-world impact.
- Agents with tool access are liability surfaces, not just productivity tools. Auditing catches permission drift, injection vulnerabilities, and memory rot before they cause damage.
- The scoring system uses five tiers from Shrimp (1.0) to Mega Claw (5.0), giving you a fast read on overall agent maturity.
- It is built for OpenClaw architectures specifically — no adapter layers, no platform lock-in, no overhead. Drop it in, run it, read the results.
- Auditing is not a one-time event. Your agent changes as its memory grows, its tools expand, and its context shifts. Regular scoring catches regressions early.
Why Audit an AI Agent at All?
There is a temptation to treat AI agents like software deployments — ship it, monitor the logs, fix bugs when users report them. That mental model breaks down fast with autonomous agents because agents do not just execute code. They make decisions, retain context, and take actions with real-world consequences.
Cisco security research flagged OpenClaw skill exfiltration as a concrete attack vector. That means a malicious prompt or a poisoned tool definition can cause your agent to leak its own capabilities, context, or data to an external endpoint. This is not theoretical. It is documented.
Beyond security, there are three practical reasons to audit:
Permission drift. Your agent started with read access to a database. Six weeks later, after iterating on tool definitions, it has write access to production, can send emails on your behalf, and has a browser session with saved credentials. Nobody made a single bad decision — it just accumulated surface area one tool at a time.
Debuggability. When an agent produces a wrong output, can you trace why? Can you identify which memory it retrieved, which identity principle it applied, and at what autonomy level it was operating? Without architectural clarity, debugging an agent is guesswork.
Justifying spend. If you are running agents in a team or enterprise context, someone is going to ask what they are getting for the compute and API costs. A Claw Score gives you a structured answer: here is the agent maturity, here are its gaps, here is what improving by one tier would require.
The Six Dimensions of Claw Score
Each dimension is independently scored on a 1.0–5.0 scale, then combined using a weighted formula that reflects real-world criticality. Here is what each one covers and why it is weighted the way it is.
1. Identity Architecture (15%)
What it measures: Does your agent know who it is?
This is not about giving your agent a cute name. Identity Architecture evaluates whether your agent has a principles-based personality that governs its behavior consistently. When your agent encounters a request that is ambiguous — something that could be interpreted multiple ways — does it have a stable identity to fall back on?
A well-architected identity includes:
- Core principles that dictate tone, boundaries, and decision-making style
- Role definition that is explicit, not implied by the system prompt alone
- Consistency under pressure — the agent behaves the same way whether the user is polite or adversarial
Agents with weak identity architecture are unpredictable. They shift tone between conversations, agree to contradictory instructions, and have no stable basis for resolving ambiguity. That is not a personality quirk — it is a reliability problem.
2. Memory Systems (20%)
What it measures: Can your agent learn and remember effectively?
Memory gets the second-highest weight because it directly impacts every other dimension. An agent that cannot remember properly cannot maintain identity, cannot learn from mistakes, and cannot make good autonomy decisions.
Claw Score evaluates memory across several sub-criteria:
- Domain-separated storage. Does your agent keep different types of memory (user preferences, task history, tool outputs, environmental context) in distinct stores? Or is everything dumped into one context window and hoped for the best?
- Decay models. Not all memories should persist forever. Does your agent have a mechanism for deprioritizing stale information? A six-month-old user preference might be irrelevant. A security credential from yesterday definitely still matters. Smart memory systems know the difference.
- Retrieval accuracy. When your agent pulls from memory, does it get the right memory? Retrieval that is slightly off-target is often worse than no retrieval at all because the agent will confidently act on wrong context.
Most OpenClaw agents score poorly here because memory is treated as an afterthought — a vector database bolted on after the core agent works. Claw Score catches that.
3. Security Posture (20%)
What it measures: Can your agent be manipulated?
This is tied for the highest weight, and for good reason. An agent with shell access and weak security is not an AI assistant — it is an attack surface with a chat interface.
Security Posture evaluates:
- Prompt injection defense. Can a user (or a tool output, or a retrieved document) override your agent instructions? The classic ignore all previous instructions attack is just the beginning. Sophisticated injections hide in tool responses, web page content, and even file metadata.
- Trust boundaries. Does your agent distinguish between instructions from you (the operator), instructions from users, and content from external sources? These should have different trust level. Most agents treat them all the same.
- Permission scoping. Are your agent tool permissions as narrow as possible? An agent that needs to read a specific database table should not have credentials for the entire database server.
- Output sanitization. When your agent generates a shell command or an API call, is the output validated before execution?
If you only care about one dimension, care about this one.
4. Autonomy Gradients (15%)
What it measures: Does your agent know when to act versus when to ask?
A fully autonomous agent sounds great until it autonomously deletes a production database because it misinterpreted clean up the old records. A fully dependent agent that asks permission for everything is just a chatbot with extra steps.
Good autonomy architecture defines trust levels — categories of action with different approval requirements:
- Low-risk actions (reading data, summarizing content) execute without asking
- Medium-risk actions (sending emails, modifying non-critical files) might notify you but proceed
- High-risk actions (financial transactions, production deployments, external communications) require explicit approval
- Forbidden actions (deleting backups, sharing credentials) are hard-blocked regardless of instruction
Claw Score also evaluates escalation paths. When your agent encounters something outside its trust level, does it escalate gracefully? Or does it fail silently, retry in a loop, or hallucinate a workaround?
5. Proactive Patterns (15%)
What it measures: Does your agent take useful initiative?
A reactive agent waits for instructions. A proactive agent notices things and acts on them. The difference between a tool and a teammate.
This dimension evaluates:
- Heartbeat processes. Does your agent periodically check on things without being asked? Monitoring a deployment, checking for new emails, verifying that a long-running process has not stalled.
- Background maintenance. Does your agent clean up after itself? Archiving old conversation threads, consolidating fragmented memories, updating stale cached data.
- Anticipatory actions. Can your agent predict what you will need next based on patterns? If you always review pull requests on Monday morning, does it have a summary ready?
Proactive patterns are where agents go from useful to indispensable. But they also require solid scores in the other five dimensions — a proactive agent with weak security and no autonomy gradients is a disaster waiting to happen.
6. Learning Architecture (15%)
What it measures: Does your agent improve over time?
This is the dimension that separates an agent that is the same quality in month six as it was on day one from an agent that compounds its value.
Claw Score evaluates:
- Regression tracking. When your agent makes a mistake, does it record the mistake and the correction? Can it detect when it is about to make the same mistake again?
- Daily synthesis. Does your agent consolidate what it learned each day into durable insights? Or does every conversation start from scratch?
- Performance trending. Over time, is the agent getting better at its core tasks? Faster? More accurate? Or is it plateauing — or worse, degrading as memory bloat and context conflicts accumulate?
Learning Architecture is what makes the difference between an agent you replace in three months and one you rely on for years.
The Scoring Tiers
Your composite Claw Score falls into one of five tiers:
| Score Range | Tier | What It Means |
|---|---|---|
| 1.0 – 1.9 | Shrimp | Minimal architecture. Basically a prompt wrapper with tools attached. High risk for any production use. |
| 2.0 – 2.9 | Crab | Some structure, but major gaps. Likely missing security hardening and memory management. Fine for personal experiments, risky for anything else. |
| 3.0 – 3.9 | Lobster | Solid foundation. Most dimensions are addressed, though some may be shallow. Ready for supervised production use. |
| 4.0 – 4.5 | King Crab | Well-architected across all dimensions. Minor refinements needed. Suitable for enterprise deployment with standard oversight. |
| 4.6 – 5.0 | Mega Claw | Top-tier architecture. Deep implementation across all six dimensions. Minimal residual risk. Audit-ready for compliance and regulatory review. |
Most agents built with default OpenClaw configurations land in the Crab tier. That is not a failure — it is a starting point. The value of Claw Score is showing you specifically which dimensions are dragging your composite down so you can prioritize improvements.
How Claw Score Compares to Other Tools
You might be wondering how this stacks up against existing options. Here is the honest breakdown:
- LangSmith is great for debugging chains and tracing LLM calls. It does not evaluate architecture.
- Arize AI handles observability — monitoring performance metrics in production. It tells you what happened, not why your architecture allowed it to happen.
- Vertex AI provides enterprise evaluation tooling, but it is platform-locked to Google Cloud and generalized across use cases.
- AgentBench is an academic benchmarking framework. Useful for research comparisons, less useful for should I trust this agent with my production database.
Claw Score is different because it is architecture-focused, OpenClaw-native, and file-based. There is no platform to sign up for, no SDK to integrate, no ongoing subscription. You run it against your agent configuration files, get a scored report, and act on the results. The overhead is near zero, which means you can run it regularly without disrupting your workflow.
Running Your First Audit
Here is what the process looks like in practice:
- Get Claw Score. It is available as a single product listing: Claw Score on Claw Mart — $20.
- Point it at your agent configuration. Claw Score reads your OpenClaw architecture files directly — system prompts, tool definitions, memory configurations, permission schemas.
- Run the audit. The tool evaluates each of the six dimensions, generates sub-scores, and computes your weighted composite.
- Read the report. You get a per-dimension breakdown with specific findings: what is strong, what is weak, and what is missing entirely.
- Prioritize fixes. Start with Security Posture and Memory Systems (they carry the most weight). Then address whichever dimension has the lowest individual score — that is your biggest liability.
- Re-audit after changes. This is the part most people skip. Every architectural change should be followed by a re-score to confirm the improvement landed and did not introduce regressions elsewhere.
Practical Next Steps
If you are running an OpenClaw agent in any capacity beyond casual experimentation, here is what to do this week:
First, run an honest self-assessment against the six dimensions above. Just read through each one and ask yourself: Do I have this? Is it real or is it a TODO comment in my config? That alone will surface your biggest gaps.
Second, pick up Claw Score and run an actual audit. Your self-assessment will miss things — that is the whole point of a structured evaluation tool. Twenty dollars is cheap insurance against the kind of incident that costs days of debugging or worse.
Third, make auditing a habit. Not daily, but at every meaningful architecture change — new tool added, memory system modified, autonomy levels adjusted. Your agent risk profile changes every time its capabilities change. Your audit schedule should match.
The agents that cause problems are not the ones that were badly built. They are the ones that were built well enough to get deployed and then never examined again. Do not let yours be one of them.
Recommended for this post



