Crawl4AI for RAG: How to Get Actually Clean Content from the Web

Raw web pages are full of noise that degrades RAG quality. Here is how to configure Crawl4AI to extract the content that actually matters.

The Noise Problem

A raw web page contains hundreds of elements that have nothing to do with the content you want: navigation menus, cookie banners, ads, footers, social share buttons, sidebar widgets, and CSS class names. When you feed this raw HTML into an LLM or a RAG pipeline, these noise elements consume tokens, dilute semantic meaning, and cause the model to retrieve or generate based on irrelevant content.

Crawl4AI converts web pages to LLM-ready markdown, but the default conversion still includes more noise than most RAG pipelines need. This article shows you how to configure it for clean, high-signal output.

Installation and Basic Setup

pip install crawl4ai
# First-time setup (downloads Playwright browsers)
crawl4ai-setup

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 
async def basic_crawl():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://docs.example.com/getting-started",
            config=CrawlerRunConfig(
                # Basic markdown conversion
            )
        )
        print(result.markdown)  # raw markdown -- still noisy
 
asyncio.run(basic_crawl())

Step 1: Use the Fit Markdown Extractor

Crawl4AI's fit_markdown mode applies heuristics to remove navigation, headers, footers, and other non-content elements. It dramatically reduces noise for documentation-style pages.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
 
config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(
            threshold=0.48,           # higher = more aggressive pruning
            threshold_type="fixed",
            min_word_threshold=0,
        ),
        options={
            "ignore_links": True,     # remove hyperlinks (keep text only)
            "skip_internal_links": True,
        },
    ),
)
 
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://docs.example.com/api", config=config)
    # result.markdown_v2.fit_markdown -- cleaned content
    clean_content = result.markdown_v2.fit_markdown
    print(f"Raw length: {len(result.markdown)}")
    print(f"Clean length: {len(clean_content)}")  # typically 40-70% smaller

Step 2: Use CSS Selectors to Target Specific Content

For sites with consistent structure (documentation, blogs), target the main content element directly with a CSS selector. This is the most reliable way to get exactly the content you want and nothing else.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 
config = CrawlerRunConfig(
    # Only extract content from the main article element
    css_selector="article.main-content",
 
    # Or for documentation sites:
    # css_selector="div[role='main']",
    # css_selector=".markdown-body",          # GitHub-style
    # css_selector="#content",                 # many docs sites
    # css_selector=".documentation-content",
)
 
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://docs.example.com/concepts/agents",
        config=config,
    )
    print(result.markdown)  # only the targeted element, no nav/footer

Open the target page in Chrome DevTools, right-click the main content area, and inspect the element. Find the most specific CSS selector or ID that wraps all the content you want. This takes 2 minutes and gives you dramatically cleaner output than any heuristic approach.

Step 3: Extract Structured Data with LLM Extraction

For pages where you want structured fields rather than prose (product pages, job listings, API references), Crawl4AI's LLM extraction strategy converts page content directly into a Pydantic schema.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import Optional, List
 
class ProductInfo(BaseModel):
    name: str
    price: Optional[str]
    description: str
    features: List[str]
    availability: Optional[str]
 
config = CrawlerRunConfig(
    extraction_strategy=LLMExtractionStrategy(
        provider="anthropic/claude-haiku-4-5-20251001",  # cheap model for extraction
        api_token="your-anthropic-key",
        schema=ProductInfo.model_json_schema(),
        extraction_type="schema",
        instruction=(
            "Extract the product information from this page. "
            "Only include information explicitly stated on the page."
        ),
    ),
)
 
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://shop.example.com/product/123", config=config)
    import json
    product = json.loads(result.extracted_content)
    print(product)  # structured ProductInfo fields

Step 4: Batch Crawling for Documentation Sites

Most RAG pipelines need to index an entire documentation site, not just one page. Crawl4AI supports crawling multiple URLs in parallel.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, arun_many
 
# Build a list of URLs to crawl (from sitemap, link discovery, or manual list)
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api/agents",
    "https://docs.example.com/api/tools",
    "https://docs.example.com/guides/deployment",
]
 
config = CrawlerRunConfig(
    css_selector="article",
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(threshold=0.48),
    ),
)
 
async def crawl_docs():
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(
            urls=urls,
            config=config,
            max_concurrent=5,  # parallel crawls -- respect rate limits
        )
 
    documents = []
    for result in results:
        if result.success:
            documents.append({
                "url": result.url,
                "content": result.markdown_v2.fit_markdown,
                "title": result.metadata.get("title", ""),
            })
        else:
            print(f"Failed: {result.url} -- {result.error_message}")
 
    return documents
 
docs = asyncio.run(crawl_docs())
print(f"Crawled {len(docs)} pages successfully")

Quick Reference

Use PruningContentFilter with threshold=0.48 for general noise removal
Use css_selector to target the main content element for consistent results
Use LLMExtractionStrategy + Pydantic schema for structured data extraction
Use arun_many with max_concurrent for batch crawling -- stay under 10 concurrent to avoid rate limits
result.markdown_v2.fit_markdown gives the cleaned content; result.markdown gives raw
Avoid ignore_links=False unless you specifically need to index hyperlinks