AI Lead Gen Agents with Real-Time Web Scraping

Most "lead generation strategies" are just glorified manual labor wearing a trench coat.

You know the drill. Spend three hours on LinkedIn. Copy-paste names into a spreadsheet. Google their company. Find an email. Write a generic "I'd love to pick your brain" message. Send it into the void. Repeat until your soul leaves your body or you actually book a meeting — whichever comes first.

Here's the thing: this entire workflow — the searching, the scraping, the enriching, the writing — can be automated with AI agents that run in real time. Not "automation" in the Zapier-moves-a-row-to-another-spreadsheet sense. I mean actual autonomous agents that search the web, scrape company data, find decision-makers, write personalized outreach based on what they found, and queue everything up for you to review and send.

The result? You go from manually generating maybe 10-15 leads per day to having a system that produces 50-100 hyper-targeted, personalized leads while you're doing literally anything else. And the outreach actually converts because it's based on real, scraped context — not some mail-merge template that screams "I know nothing about you."

Let me show you exactly how to build this.

Why Traditional Lead Gen Is Broken

The fundamental problem isn't that lead gen is hard. It's that the valuable parts (identifying good prospects, understanding their pain points, crafting relevant messages) are buried under hours of repetitive mechanical work.

Think about what actually happens when a good salesperson generates leads manually:

Search — They Google something like "Series A fintech startups NYC 2024" or browse LinkedIn with filters
Identify — They scan results, pick companies that fit their ICP
Research — They visit the company website, read the about page, check recent news, find the right contact
Enrich — They track down an email address, maybe a phone number
Personalize — They write a message that references something specific about the company
Send — They fire it off and move to the next one

Steps 1 through 4 are pure mechanical work. A machine can do them better, faster, and without getting distracted by Twitter. Step 5 is where the magic happens — but even that can be dramatically improved when an LLM has rich, scraped context to work with instead of a human trying to remember what they read three tabs ago.

The unlock is building a system of specialized AI agents, each handling one part of this pipeline, passing structured data to the next agent in line.

The Architecture: A Crew of Specialized Agents

The approach I'll walk through uses a multi-agent system. Instead of one monolithic script, you build a team of agents that collaborate:

[Researcher Agent] → Searches the web for leads matching your criteria
        ↓
[Scraper Agent] → Visits each result, extracts emails, company info, recent news
        ↓
[Personalizer Agent] → Crafts custom outreach using the scraped context
        ↓
[Outreach Agent] → Formats, reviews, and queues messages for sending

Each agent has a specific role, specific tools, and a specific output format. The Researcher doesn't try to scrape. The Scraper doesn't try to write emails. Division of labor — the same principle that makes real sales teams work.

You can build and orchestrate this entire system on OpenClaw. OpenClaw is built for exactly this kind of multi-agent workflow — you define your agents, give them tools, chain their tasks together, and let them run. It handles the orchestration, the LLM calls, the tool integration, and the execution pipeline so you're not duct-taping together five different frameworks.

Setting Up Your Tools

Before building agents, you need the tools they'll use. Two are essential:

Web Search (SerpAPI)

SerpAPI gives your agents the ability to Google things programmatically and get structured results back. This is how your Researcher agent finds leads without you manually typing queries.

pip install google-search-results

You'll need an API key from serpapi.com (free tier gives you 100 searches/month, which is plenty for testing). Set it as an environment variable:

export SERPER_API_KEY=your_key_here

Website Scraping (Playwright + BeautifulSoup)

Once your Researcher finds URLs, your Scraper agent needs to actually visit those sites and extract data. Playwright handles JavaScript-heavy pages (which is most of the modern web), and BeautifulSoup parses the HTML.

pip install beautifulsoup4 playwright requests crewai crewai-tools
playwright install

The Custom Scraping Tool

Here's the scraping tool your agent will use. It visits a URL, waits for the page to fully load, then extracts key data points:

from crewai_tools import SerperDevTool
from langchain.tools import tool
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import re

# Built-in SerpAPI wrapper for CrewAI
serper = SerperDevTool()

@tool("Website Scraper")
async def scrape_website(url: str) -> str:
    """Scrape a company website for contact info, about page content, and recent news."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        
        content = await page.content()
        soup = BeautifulSoup(content, 'html.parser')
        
        # Extract what matters
        title = soup.title.string if soup.title else ""
        emails = re.findall(
            r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 
            content
        )
        
        # Try to find about/description content
        about = soup.find('div', class_='about') or soup.find('meta', attrs={'name': 'description'})
        about_text = about.get('content', about.text) if about else 'N/A'
        
        # Look for recent blog posts or news
        news_items = soup.find_all(['article', 'h2', 'h3'], limit=5)
        news = [item.text.strip() for item in news_items if item.text.strip()]
        
        await browser.close()
        
        return (
            f"Title: {title}\n"
            f"Emails: {', '.join(emails[:3]) if emails else 'None found'}\n"
            f"About: {str(about_text)[:500]}\n"
            f"Recent News/Content: {'; '.join(news[:3]) if news else 'None found'}"
        )

This is intentionally simple. In production, you'd add error handling, timeouts, proxy rotation, and probably hit an enrichment API like Hunter.io for verified emails. But this gets you 80% of the way there.

Building the Agents

Now the fun part. Each agent gets a role, a goal, a backstory (which shapes how the LLM approaches the task), and the tools it can use.

When you're building on OpenClaw, you define these agents within the platform and it handles the underlying orchestration. Here's what the agent definitions look like:

from crewai import Agent, Task, Crew

# Agent 1: The Researcher
researcher = Agent(
    role='Lead Researcher',
    goal='Find high-potential {industry} companies in {location} that match our ICP',
    backstory=(
        'You are an expert B2B researcher who knows exactly how to craft '
        'Google queries to find companies that are actively growing, recently '
        'funded, or showing buying signals. You focus on quality over quantity.'
    ),
    tools=[serper],
    llm='gpt-4o-mini',
    verbose=True
)

# Agent 2: The Scraper
scraper = Agent(
    role='Intelligence Gatherer',
    goal='Extract actionable contact info and company insights from lead websites',
    backstory=(
        'You are a meticulous data collector who visits company websites and '
        'pulls out the specific details that make outreach personal: founding '
        'story, recent milestones, key team members, contact emails, and any '
        'pain points visible from their public content.'
    ),
    tools=[scrape_website],
    llm='gpt-4o-mini',
    verbose=True
)

# Agent 3: The Personalizer
personalizer = Agent(
    role='Outreach Copywriter',
    goal='Write cold emails that feel warm using real scraped company data',
    backstory=(
        'You write outreach that gets replies. Your secret: radical specificity. '
        'You never use generic templates. Every email references something real '
        'about the company — their recent news, their tech stack, their growth '
        'stage. You keep it to 3-4 sentences max. No fluff.'
    ),
    llm='gpt-4o-mini',
    verbose=True
)

# Agent 4: The Outreach Manager
outreach_manager = Agent(
    role='Outreach QA Specialist',
    goal='Review, format, and prepare all personalized messages for sending',
    backstory=(
        'You are the final checkpoint before outreach goes live. You verify '
        'email formatting, check for compliance issues, ensure personalization '
        'is accurate, and flag anything that needs human review. You format '
        'output as a clean, structured list ready for import into a CRM.'
    ),
    llm='gpt-4o-mini',
    verbose=True
)

Notice the Personalizer has no tools. It doesn't need them. Its job is pure language — taking the structured data from the Scraper and turning it into compelling copy. This is where LLMs genuinely shine.

Wiring Up the Pipeline

Tasks define what each agent actually does, and context parameters create the data flow between them:

def build_lead_gen_crew(industry: str, location: str, num_leads: int = 5):
    
    research_task = Task(
        description=(
            f'Search Google for the top {num_leads} {industry} companies in '
            f'{location}. Focus on companies that are growing: recently funded, '
            f'hiring, or launching new products. Return a list with company name, '
            f'website URL, and a one-line description of why they are a good lead.'
        ),
        agent=researcher,
        expected_output='Numbered list of leads with company name, URL, and reasoning'
    )
    
    scrape_task = Task(
        description=(
            'Visit each company URL from the research results. For each company, '
            'extract: (1) contact email addresses, (2) company description/mission, '
            '(3) any recent news, blog posts, or milestones, (4) key team members '
            'if visible. Return structured data for each lead.'
        ),
        agent=scraper,
        context=[research_task],
        expected_output='JSON-formatted lead profiles with scraped data'
    )
    
    personalize_task = Task(
        description=(
            'For each lead, write a personalized cold email (3-4 sentences max). '
            'The email MUST reference something specific from the scraped data — '
            'their recent news, their mission, a specific challenge in their '
            'industry. Use this framework:\n'
            '- Line 1: Specific observation about them (proves you did research)\n'
            '- Line 2: Bridge to how you can help\n'
            '- Line 3: Clear, low-friction CTA (e.g., "Worth a 15-min chat?")\n'
            'No generic phrases like "I hope this finds you well." Ever.'
        ),
        agent=personalizer,
        context=[scrape_task],
        expected_output='List of personalized emails, one per lead'
    )
    
    outreach_task = Task(
        description=(
            'Review all personalized emails for: (1) accuracy of scraped references, '
            '(2) professional tone, (3) GDPR/CAN-SPAM compliance basics, '
            '(4) proper formatting. Output a final structured list with: '
            'company name, contact email, subject line, email body, and a '
            'confidence score (1-10) for each lead.'
        ),
        agent=outreach_manager,
        context=[personalize_task],
        expected_output='Final outreach list with confidence scores, ready to send'
    )
    
    crew = Crew(
        agents=[researcher, scraper, personalizer, outreach_manager],
        tasks=[research_task, scrape_task, personalize_task, outreach_task],
        verbose=True
    )
    
    return crew

Running It

crew = build_lead_gen_crew(
    industry="fintech",
    location="New York",
    num_leads=5
)

result = crew.kickoff()
print(result)

When this runs, you'll see each agent working in sequence. The Researcher fires off SerpAPI queries, finds five fintech companies in New York, and passes the results along. The Scraper visits each website, pulls emails and company context. The Personalizer writes emails that reference actual, real things about each company. The Outreach Manager does a final QC pass.

Here's what a sample output looks like:

Lead #1: Acme Fintech (acmefintech.com)
  Contact: sarah@acmefintech.com
  Subject: Your Series A and what comes next
  Body: "Hi Sarah — saw Acme just closed a $10M Series A 
  (congrats). Most fintech teams at your stage hit a wall 
  scaling outbound without burning through SDR budget. We 
  helped [similar company] 3x their pipeline in 60 days 
  with AI-driven prospecting. Worth a 15-min look?"
  Confidence: 8/10

Lead #2: ...

That email references their actual funding round, scraped in real time from their website or news coverage. It's not a template. It's not a mail merge. It's a message that reads like a human spent 10 minutes researching the company — because an AI agent actually did.

Making Personalization Actually Work

The difference between 2% and 20% reply rates on cold outreach comes down to one thing: does the recipient believe you actually know who they are?

Here are the personalization techniques that matter most, all of which your scraping agent can feed into the Personalizer:

Recent funding or milestones — "Congrats on the Series B" is the easiest warm opener because it's specific and positive. Your Scraper agent pulls this from news results or the company's blog.

Tech stack signals — If you scrape their job postings or use BuiltWith data, you can reference their actual tools: "Noticed you're running on Stripe Connect — we integrate natively."

Content they've published — If their CEO wrote a blog post last week, reference it. Nothing says "I did my homework" like citing their own words.

Hiring signals — If they're hiring 5 SDRs, they're clearly trying to scale outbound. That's a pain point you can address directly.

You can add a Lead Qualifier agent between the Scraper and Personalizer that scores each lead on these signals:

qualifier = Agent(
    role='Lead Qualifier',
    goal='Score each lead 1-10 based on buying signals and ICP fit',
    backstory=(
        'You evaluate leads based on: funding stage, growth signals '
        '(hiring, new products), tech stack fit, and company size. '
        'Only pass through leads scoring 7+.'
    ),
    llm='gpt-4o-mini'
)

This prevents your system from wasting personalization effort on low-quality leads. Filter early, personalize deeply.

Scaling This to Production

Once you've validated the pipeline works with 5 leads, here's how to scale it:

Email enrichment APIs — The regex-based email scraping in the code above is a starting point. For production, integrate Hunter.io ($49/mo) or Apollo.io ($99/mo) to get verified email addresses with deliverability scores. You can add these as additional tools on OpenClaw.

Proxy rotation — If you're scraping at volume, you'll hit rate limits. Services like BrightData or ScrapingBee handle IP rotation so your Scraper agent doesn't get blocked.

Scheduling and triggers — Set your crew to run daily, weekly, or on triggers. New companies matching your ICP? Run the pipeline automatically. OpenClaw handles the orchestration layer, so you can schedule runs and pipe outputs into your CRM or outreach tool.

CRM integration — The Outreach Manager agent can format its output as CSV for HubSpot import, or you can wire it directly to SendGrid, Instantly, or Lemlist via API.

Cost math — This is surprisingly cheap to run:

SerpAPI: ~$0.001 per search
LLM tokens (GPT-4o-mini): ~$0.01-0.03 per lead for all four agents
Total: roughly $0.02-0.05 per fully personalized lead

Compare that to $15-25/hour for a human SDR doing the same work at maybe 10-15 leads per hour. The math is absurd.

The Enrichment Stack

Your agents are only as good as the data they can access. Here's what I'd recommend layering in:

Tool	What It Does	Cost
SerpAPI	Web search results	Free tier: 100/mo
Hunter.io	Verified email finder	$49/mo
Clearbit	Company firmographics	Free tier available
Apollo.io	Contact database + emails	$99/mo
BuiltWith	Tech stack detection	Free tier available
Playwright	Dynamic page scraping	Free (open source)

Each of these can be wrapped as a CrewAI tool and made available to your agents on OpenClaw. The Scraper agent gets smarter with every tool you add.

Legal and Ethical Guardrails

I'm not going to pretend this section is fun, but it'll save you from getting sued or blacklisted:

Respect robots.txt — If a site says don't scrape, don't scrape. SerpAPI handles Google's TOS for you, but for individual sites, check first.
GDPR/CAN-SPAM compliance — You need a legitimate business interest for B2B outreach in most jurisdictions. Include an unsubscribe mechanism. Don't scrape and email personal (non-business) addresses.
Rate limiting — Don't hammer sites with 1,000 requests per minute. Be a good citizen. Add delays between scrapes.
Human review — The Outreach Manager agent flags messages for review, but you should still eyeball the first few batches before going full autopilot.

This isn't about spam. The entire point of this system is that it enables better, more personalized outreach — the kind recipients actually appreciate because it's clearly relevant to them.

What to Build on OpenClaw

Everything I've described — the multi-agent pipeline, the tool integrations, the task orchestration — this is exactly what OpenClaw is designed for. You're not cobbling together random scripts and praying they work. You're building a production-grade agent system with proper orchestration, monitoring, and iteration capabilities.

Here's what I'd recommend exploring on the Claw Mart marketplace:

Web scraping agent templates — Pre-built agents with SerpAPI and Playwright integrations
Lead enrichment workflows — Agents that chain multiple data sources for comprehensive lead profiles
Outreach personalization crews — Multi-agent systems specifically designed for cold email and LinkedIn outreach
CRM connector agents — Agents that format and push leads directly into HubSpot, Salesforce, or Pipedrive

The marketplace has ready-to-deploy components that you can customize for your specific ICP and outreach style, rather than building everything from scratch.

Next Steps

Here's what I'd do this week:

Set up your tools — Get SerpAPI and OpenClaw accounts. Install the dependencies. 15 minutes.
Define your ICP — Be specific. "Series A fintech companies in NYC with 20-50 employees" is infinitely better than "fintech companies."
Build a 5-lead test — Run the pipeline against 5 leads. Read every output. Tweak agent prompts based on what you see.
Add enrichment — Wire in Hunter.io or Apollo for verified emails. Your reply rates will jump immediately.
Scale to 25-50 leads/day — Once the quality is dialed in, crank the volume. Set up daily runs.
Track and iterate — Monitor open rates and reply rates. Feed winning messages back into your Personalizer agent's prompts as few-shot examples.

Expect 10-20% reply rates with strong personalization on cold outreach. I've seen teams hit 25-30% when the ICP targeting is tight and the scraped context is genuinely relevant.

The companies that figure out AI-powered lead gen in the next 12 months are going to have an almost unfair advantage. The data is out there. The tools exist. The cost is negligible. The only question is whether you build the system or keep copy-pasting from LinkedIn like it's 2019.

Start building on OpenClaw. Start small. Iterate fast. Let the agents do the work.

AI Lead Gen Agents with Real-Time Web Scraping

Why Traditional Lead Gen Is Broken

The Architecture: A Crew of Specialized Agents

Setting Up Your Tools

Web Search (SerpAPI)

Website Scraping (Playwright + BeautifulSoup)

The Custom Scraping Tool

Building the Agents

Wiring Up the Pipeline

Running It

Making Personalization Actually Work

Scaling This to Production

The Enrichment Stack

Legal and Ethical Guardrails

What to Build on OpenClaw

Next Steps

Outbound Sales Engine

Sub-Agent Orchestrator

AI Cofounder

More From the Blog

OpenClaw for Med Spas: Automate Consultations and Treatment Reminders

OpenClaw for Gyms and CrossFit Boxes: Automate Member Retention and Class Management

OpenClaw for Breweries: Automate Taproom Events and Distribution