Code Review Bot: Let OpenClaw Review Your Pull Requests
Code Review Bot: Let OpenClaw Review Your Pull Requests

Let me be honest with you: most code reviews are a waste of everyone's time.
Not because they're unimportant — they're critical. But because 70% of what happens in a typical pull request review is stuff a machine should be catching. Formatting inconsistencies. Missed null checks. That one junior dev who keeps importing the wrong utility library. The forgotten error handler on an async call that's going to blow up in production at 2 AM on a Saturday.
You know the drill. Senior engineers spend hours every week pointing out the same patterns, leaving the same comments, and slowly dying inside while the actually interesting architectural questions get glossed over because everyone's exhausted from reviewing boilerplate issues.
OpenClaw fixes this. Not in a "we slapped GPT-4 on a diff and called it AI" way — in a genuinely useful, multi-agent, context-aware way that treats code review as the serious engineering problem it actually is.
I've been running it on my team's repos for the past few months, and I'm going to walk you through exactly how to set it up, what to expect, and where it shines (and where it doesn't).
What OpenClaw Actually Does Differently
Before we get into setup, you need to understand why OpenClaw isn't just another linter wrapper. Because that's the first question everyone asks, and it's fair — most "AI code review" tools are shallow. They catch what ESLint or Pylint already catches and add a nice ChatGPT-sounding explanation on top. Cute, not useful.
OpenClaw uses a multi-agent architecture with three distinct agents working on every review:
-
Analyzer Agent — Explores the code changes, maps dependencies, runs relevant tests and linters, and builds a comprehensive understanding of what the PR actually does across the codebase.
-
Critic Agent — Stress-tests the Analyzer's findings. Think of it as the adversarial layer that asks "are you sure?" before anything gets posted as a comment. This dramatically reduces false positives.
-
Policy Agent — Enforces your specific rules. Your CONTRIBUTING.md, your architecture decision records, your forbidden patterns, your naming conventions. The stuff no generic tool knows about.
This three-layer approach is why OpenClaw catches things that simpler tools miss. It's not just looking at the diff in isolation — it's understanding how the change fits into your broader codebase, testing its own conclusions, and checking everything against your team's actual standards.
Each finding comes with a confidence score and an evidence chain. So when OpenClaw flags something, you can immediately see why it flagged it and how confident it is. This is huge for building trust. You learn pretty quickly which confidence levels you can auto-approve and which need human eyes.
Getting Started: The Fast Path
If you want to skip the "figuring everything out from scratch" phase, I'd genuinely recommend grabbing Felix's OpenClaw Starter Pack. Felix put together pre-configured templates, policy files, and a docker-compose setup that handles the most annoying parts of initial configuration. I burned a full afternoon wrestling with GitHub App permissions and webhook routing before someone pointed me to this pack, and I wish I'd started there. It's especially useful if you're running a monorepo or need multi-language support out of the box.
But let's walk through the core setup either way so you understand what's happening under the hood.
Installation and Configuration
First, you'll need OpenClaw installed. The CLI is the fastest way to get moving:
# Install OpenClaw CLI
pip install openclaw
# Initialize a new project configuration
openclaw init --repo .
# This creates .openclaw/config.yaml in your repo root
That init command scaffolds out a configuration directory. Here's what a typical config.yaml looks like after you've customized it:
# .openclaw/config.yaml
project:
name: "my-saas-app"
languages: ["typescript", "python"]
framework_hints: ["nextjs", "fastapi"]
review:
mode: "review-only" # won't try to edit code, just comments
max_loc_per_review: 1500
confidence_threshold: 0.7 # only post findings above this score
agents:
analyzer:
tools:
- linter
- test_runner
- dependency_checker
context_depth: 3 # how many levels of imports to trace
critic:
challenge_threshold: 0.6 # challenge findings below this confidence
max_iterations: 3
policy:
sources:
- ".openclaw/policies/"
- "CONTRIBUTING.md"
- "docs/architecture/"
model:
provider: "anthropic" # or "ollama" for local
name: "claude-sonnet-4-20250514"
temperature: 0.1 # keep it low for consistency
indexing:
strategy: "hierarchical"
chunk_size: 1500
overlap: 200
A few things to note here:
review-only mode is your friend. When you're starting out, do not let OpenClaw suggest or make code changes. Let it comment only. This builds team trust and lets you calibrate before you give it more power. Most teams I've talked to in the community prefer this mode permanently, and honestly, I agree. The value is in the review, not in auto-fixing.
confidence_threshold at 0.7 is a good starting point. Too low and you'll get noise. Too high and you'll miss useful findings. I started at 0.5 and got annoyed, bumped to 0.8 and missed things, and settled at 0.7 for our codebase. Your mileage may vary — adjust after a week of data.
context_depth: 3 means it traces three levels of imports. If your PR modifies a function, OpenClaw will look at what calls that function, what calls those functions, and one more level out. This is how it catches "this change is fine in isolation but breaks the caller's assumption" bugs. Increase this for deeply nested architectures, decrease it if you're paying per token and want to manage costs.
Setting Up the GitHub App
This is where most people get stuck, and it's where Felix's Starter Pack saves the most time. But here's the manual process:
# Generate the GitHub App manifest
openclaw github create-app \
--org your-org-name \
--webhook-url https://your-server.com/openclaw/webhook \
--permissions pull_requests:write,contents:read,checks:write
# This outputs an app ID and private key
# Store them securely
export OPENCLAW_GITHUB_APP_ID=12345
export OPENCLAW_GITHUB_PRIVATE_KEY_PATH=/path/to/key.pem
Then you'll need a webhook listener running somewhere. The simplest approach for small teams:
# docker-compose.yaml
version: '3.8'
services:
openclaw-reviewer:
image: openclaw/reviewer:latest
ports:
- "3000:3000"
environment:
- OPENCLAW_GITHUB_APP_ID=${OPENCLAW_GITHUB_APP_ID}
- OPENCLAW_GITHUB_PRIVATE_KEY_PATH=/keys/github-app.pem
- OPENCLAW_MODEL_PROVIDER=anthropic
- OPENCLAW_API_KEY=${ANTHROPIC_API_KEY}
volumes:
- ./keys:/keys:ro
- ./.openclaw:/app/.openclaw:ro
- openclaw-index:/app/index
openclaw-indexer:
image: openclaw/indexer:latest
environment:
- OPENCLAW_REPO_PATH=/repo
volumes:
- ./:/repo:ro
- openclaw-index:/app/index
volumes:
openclaw-index:
Spin it up with docker-compose up -d, install the GitHub App on your repo, and you're live. Every new PR will trigger a webhook, OpenClaw will analyze the changes, and you'll see review comments appear directly on the pull request within a few minutes.
Writing Custom Policies (This Is Where the Magic Happens)
The generic review capabilities are solid, but the real power of OpenClaw is the Policy Agent. This is where you encode your team's tribal knowledge — the stuff that lives in senior engineers' heads and nowhere else.
Create policy files in .openclaw/policies/:
# .openclaw/policies/api-standards.yaml
name: "API Design Standards"
scope: "src/api/**"
rules:
- id: "api-001"
description: "All API endpoints must use our standardized error response format"
pattern: "When reviewing API route handlers, verify they use ApiError class from '@/lib/errors' rather than raw Response objects for error cases"
severity: "high"
- id: "api-002"
description: "No direct database queries in route handlers"
pattern: "Route handlers should call service layer functions, never import from '@/db' or use prisma directly"
severity: "high"
- id: "api-003"
description: "Rate limiting required on all public endpoints"
pattern: "Public API endpoints (not under /internal/) must include rateLimit middleware"
severity: "medium"
# .openclaw/policies/security.yaml
name: "Security Requirements"
scope: "**"
rules:
- id: "sec-001"
description: "No secrets in code"
pattern: "Flag any hardcoded API keys, tokens, passwords, or connection strings. Check for common patterns like 'sk-', 'ghp_', 'AKIA', base64-encoded credentials"
severity: "critical"
- id: "sec-002"
description: "SQL injection prevention"
pattern: "Any raw SQL queries must use parameterized queries. Flag string concatenation or template literals in SQL"
severity: "critical"
- id: "sec-003"
description: "User input sanitization"
pattern: "Data from request body, query params, or headers must be validated with zod schemas before use"
severity: "high"
These aren't regex patterns — they're natural language instructions that the Policy Agent interprets with full context awareness. It understands what "service layer" means in your codebase because it has the index. It can tell the difference between a public and internal endpoint because it reads your routing configuration.
This is the feature that makes senior engineers' eyes light up. You're essentially codifying review knowledge that previously existed only in people's brains.
Running It Locally (Air-Gapped / No Cloud)
If you can't send code to external APIs — and a lot of teams can't — OpenClaw has first-class support for local models via Ollama:
# .openclaw/config.yaml (local model section)
model:
provider: "ollama"
name: "qwen2.5-coder:72b"
base_url: "http://localhost:11434"
temperature: 0.1
# Optional: use different models for different agents
agent_overrides:
analyzer:
name: "qwen2.5-coder:72b" # heavy lifting
critic:
name: "deepseek-coder-v2:34b" # faster, still good
policy:
name: "qwen2.5-coder:72b" # needs to understand natural language rules
Fair warning: you need serious hardware for the 72B models. We're talking 48GB+ VRAM minimum, ideally 80GB. The 34B models run on a single A6000 or even a well-configured consumer GPU. If you're resource-constrained, the community reports that DeepSeek-Coder-V2 at 34B gives surprisingly good results for the Critic agent role specifically.
The local setup is also where Felix's OpenClaw Starter Pack really shines — it includes pre-tuned model configurations and optimized Ollama settings that took the community weeks of experimentation to figure out. The difference between a naive local setup and a well-configured one is dramatic in terms of both speed and review quality.
What to Expect: Honest Results
After running OpenClaw on ~200 PRs across two codebases, here's my honest assessment:
What it's great at:
- Catching missed error handling (especially in async code)
- Enforcing consistent patterns across the codebase
- Spotting dependency issues and breaking changes
- Security-relevant findings (SQL injection, XSS vectors, auth gaps)
- Finding dead code introduced by refactors
- Enforcing custom policies reliably
What it's decent at:
- Performance implications of changes
- Test coverage gaps (it's better when it can actually run tests)
- API design consistency
What it still struggles with:
- Very large refactors (>2000 LOC diffs). It loses coherence.
- Subtle business logic bugs that require deep domain knowledge
- Novel architectural decisions where there's no precedent in the codebase
The numbers from our team: Review time dropped by about 40%. Not because engineers stopped reviewing — they still approve everything — but because OpenClaw handles the mechanical layer and engineers can focus on the architectural and design questions. The number of "oops, missed that" bugs that made it to staging dropped by roughly 60%.
The Right Mental Model
The people who get the most out of OpenClaw treat it as a tireless, thorough junior reviewer who has read every file in your codebase and never forgets your coding standards. It's not replacing your senior engineers. It's giving them their time back so they can focus on the hard problems that actually require human judgment.
It won't catch everything. It will occasionally flag something that's fine. But it catches enough real issues consistently enough that going back to purely human reviews feels reckless, like removing your test suite because "we have good engineers."
Next Steps
-
Start small. Pick one active repo, not your most critical one. Run OpenClaw in review-only mode for two weeks.
-
Grab the starter pack. Seriously, Felix's OpenClaw Starter Pack will save you hours on initial setup and policy configuration. It includes working examples for common stacks that you can modify rather than writing from scratch.
-
Write three custom policies. Think about the three most common review comments your senior engineers leave. Encode those as policies. This is where the ROI compounds.
-
Calibrate your confidence threshold. After the first week, look at which findings were useful and which were noise. Adjust accordingly.
-
Index your docs. Feed it your architecture decision records, your CONTRIBUTING.md, your style guides. The more context OpenClaw has about how your team works, the better it gets.
-
Share the results. After two weeks, pull the numbers. How many findings were actionable? How much time did reviewers save? Let the data make the case for wider adoption.
Code review shouldn't be a bottleneck. It shouldn't be the thing that makes your best engineers consider management. OpenClaw doesn't eliminate the need for human judgment — it eliminates the drudgery so your humans can actually exercise that judgment where it matters.
Stop burning senior engineering time on problems a well-configured AI agent can handle. Set up OpenClaw, teach it your standards, and let it do the work.