Raw web pages are full of noise that degrades RAG quality. Here is how to configure Crawl4AI to extract the content that actually matters.
The Noise Problem
A raw web page contains hundreds of elements that have nothing to do with the content you want: navigation menus, cookie banners, ads, footers, social share buttons, sidebar widgets, and CSS class names. When you feed this raw HTML into an LLM or a RAG pipeline, these noise elements consume tokens, dilute semantic meaning, and cause the model to retrieve or generate based on irrelevant content.
Crawl4AI converts web pages to LLM-ready markdown, but the default conversion still includes more noise than most RAG pipelines need. This article shows you how to configure it for clean, high-signal output.
Installation and Basic Setup
pip install crawl4ai
# First-time setup (downloads Playwright browsers)
crawl4ai-setupimport asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def basic_crawl():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com/getting-started",
config=CrawlerRunConfig(
# Basic markdown conversion
)
)
print(result.markdown) # raw markdown -- still noisy
asyncio.run(basic_crawl())Step 1: Use the Fit Markdown Extractor
Crawl4AI's fit_markdown mode applies heuristics to remove navigation, headers, footers, and other non-content elements. It dramatically reduces noise for documentation-style pages.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, # higher = more aggressive pruning
threshold_type="fixed",
min_word_threshold=0,
),
options={
"ignore_links": True, # remove hyperlinks (keep text only)
"skip_internal_links": True,
},
),
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://docs.example.com/api", config=config)
# result.markdown_v2.fit_markdown -- cleaned content
clean_content = result.markdown_v2.fit_markdown
print(f"Raw length: {len(result.markdown)}")
print(f"Clean length: {len(clean_content)}") # typically 40-70% smallerStep 2: Use CSS Selectors to Target Specific Content
For sites with consistent structure (documentation, blogs), target the main content element directly with a CSS selector. This is the most reliable way to get exactly the content you want and nothing else.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(
# Only extract content from the main article element
css_selector="article.main-content",
# Or for documentation sites:
# css_selector="div[role='main']",
# css_selector=".markdown-body", # GitHub-style
# css_selector="#content", # many docs sites
# css_selector=".documentation-content",
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com/concepts/agents",
config=config,
)
print(result.markdown) # only the targeted element, no nav/footerOpen the target page in Chrome DevTools, right-click the main content area, and inspect the element. Find the most specific CSS selector or ID that wraps all the content you want. This takes 2 minutes and gives you dramatically cleaner output than any heuristic approach.Step 3: Extract Structured Data with LLM Extraction
For pages where you want structured fields rather than prose (product pages, job listings, API references), Crawl4AI's LLM extraction strategy converts page content directly into a Pydantic schema.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import Optional, List
class ProductInfo(BaseModel):
name: str
price: Optional[str]
description: str
features: List[str]
availability: Optional[str]
config = CrawlerRunConfig(
extraction_strategy=LLMExtractionStrategy(
provider="anthropic/claude-haiku-4-5-20251001", # cheap model for extraction
api_token="your-anthropic-key",
schema=ProductInfo.model_json_schema(),
extraction_type="schema",
instruction=(
"Extract the product information from this page. "
"Only include information explicitly stated on the page."
),
),
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://shop.example.com/product/123", config=config)
import json
product = json.loads(result.extracted_content)
print(product) # structured ProductInfo fieldsStep 4: Batch Crawling for Documentation Sites
Most RAG pipelines need to index an entire documentation site, not just one page. Crawl4AI supports crawling multiple URLs in parallel.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, arun_many
# Build a list of URLs to crawl (from sitemap, link discovery, or manual list)
urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api/agents",
"https://docs.example.com/api/tools",
"https://docs.example.com/guides/deployment",
]
config = CrawlerRunConfig(
css_selector="article",
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48),
),
)
async def crawl_docs():
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=config,
max_concurrent=5, # parallel crawls -- respect rate limits
)
documents = []
for result in results:
if result.success:
documents.append({
"url": result.url,
"content": result.markdown_v2.fit_markdown,
"title": result.metadata.get("title", ""),
})
else:
print(f"Failed: {result.url} -- {result.error_message}")
return documents
docs = asyncio.run(crawl_docs())
print(f"Crawled {len(docs)} pages successfully")Quick Reference
- Use PruningContentFilter with threshold=0.48 for general noise removal
- Use css_selector to target the main content element for consistent results
- Use LLMExtractionStrategy + Pydantic schema for structured data extraction
- Use arun_many with max_concurrent for batch crawling -- stay under 10 concurrent to avoid rate limits
- result.markdown_v2.fit_markdown gives the cleaned content; result.markdown gives raw
- Avoid ignore_links=False unless you specifically need to index hyperlinks