Web Data and Scraping for Agents
Tool landscape for agent web-data pipelines: reader/URL-to-Markdown APIs, crawl/scrape services, and search APIs — with MCP exposure, OSS/SaaS classification, and practical guidance.
Agents cannot use raw HTML efficiently: it bloats the context window, embeds navigation noise, and costs 3–10x more tokens than clean Markdown of the same content. A web-data layer transforms live web content into agent-consumable form. Three distinct jobs require different tools.
Three jobs — pick the right tool for each
| Job | What you need | Tools |
|---|---|---|
| 1. Read a known URL → clean text | URL-to-Markdown conversion | Jina Reader, Firecrawl /scrape, trafilatura, Mozilla Readability |
| 2. Crawl a site / scrape at scale | Multi-URL crawl + JS rendering + anti-bot | Firecrawl /crawl, Apify, Crawlee, Browserbase, Bright Data, ScrapingBee |
| 3. Search the web | Search query → ranked URLs + snippets or answers | Tavily, Exa, Brave Search API, Serper, Perplexity Sonar API, built-in provider tools |
Job 1: Reader / URL-to-Markdown
Jina Reader (SaaS + OSS) — prefix any URL with https://r.jina.ai/ and receive clean Markdown optimized for LLMs. No key required for basic usage; optional API key for higher rate limits. The extraction model is ReaderLM-v2 (1.5B). Supports PDF and MS Office documents via direct POST. Free tier available; OSS branch at github.com/jina-ai/reader. Exposes an MCP server via the Jina AI MCP (smithery.ai registry).
Firecrawl /scrape (SaaS + AGPL self-host) — one-URL scrape endpoint returning Markdown, HTML, or structured JSON. Handles JS-rendered pages, proxy rotation, and CAPTCHA. Free tier (500 credits/month); paid from $19/month. GitHub: github.com/mendableai/firecrawl. Exposes an official MCP server.
Self-hosted options (OSS):
- trafilatura (Apache 2.0, Python) — extracts main text and metadata from HTML with high accuracy; outputs TXT, Markdown, CSV, JSON, or TEI-XML. Command-line and library. Used by HuggingFace, IBM, and Microsoft Research. Docs: trafilatura.readthedocs.io.
- Mozilla Readability (Apache 2.0, JavaScript) — the parser behind Firefox Reader Mode; strips nav/ads and returns article DOM. GitHub: github.com/mozilla/readability. Pair with Playwright or Puppeteer for JS-rendered pages.
When to self-host vs use a service: self-hosted options are free and private but require infrastructure and cannot solve CAPTCHAs. Services handle anti-bot at scale out of the box.
Job 2: Crawl / scrape at scale
Firecrawl /crawl (SaaS + AGPL self-host) — crawls an entire site and returns all pages as Markdown. Same service as the /scrape endpoint; the /crawl endpoint accepts a root URL and traverses all sub-URLs. Handles JS rendering, rate limiting, and proxy rotation automatically.
Apify (SaaS) — managed cloud platform with 30,000+ community-built Actors (preconfigured scrapers for common targets) plus a proxy network and storage layer. Actors run serverlessly; pricing is pay-per-compute-unit. Homepage: apify.com.
Crawlee (Apache 2.0, OSS by Apify) — open-source TypeScript/JavaScript (and Python) web-scraping library. Supports Cheerio, JSDOM, Playwright, and Puppeteer crawlers with auto proxy rotation, fingerprinting, and autoscaling. Can run locally or deploy to Apify. Python port stable since September 2025. GitHub: github.com/apify/crawlee.
Browserbase (SaaS) — managed cloud headless browsers (Playwright/Puppeteer API) optimized for AI agents. Handles CAPTCHA, stealth, and session recording. Priced per session. Homepage: browserbase.com.
Bright Data (SaaS) — enterprise proxy + scraping stack. Web MCP server (free tier: 5,000 requests/month) exposes Web Unlocker, SERP API, and Scraping Browser directly to MCP-compatible agents. Homepage: brightdata.com.
ScrapingBee (SaaS) — headless browser scraping API; handles JS rendering and proxy rotation. Acquired by Oxylabs in 2025; operates as an independent brand. Homepage: scrapingbee.com.
Job 3: Search APIs
Tavily (SaaS) — agent-native search API: Search, Extract, Map, and Crawl endpoints. Returns structured results optimized for RAG. Sub-200ms p50 latency; 100M+ monthly requests. Joined Nebius (AI infrastructure) in February 2026. MCP server available. Docs: docs.tavily.com.
Exa (SaaS) — formerly Metaphor; neural/embedding-based search designed for AI agents. Retrieves pages by semantic meaning, not keyword matching. Raised $85M at $700M valuation (September 2025). APIs: Search, Contents, Answer, Find Similar, Websets. Contents (up to 10 results) included free with each Search call as of March 2026. Docs: exa.ai/docs.
Brave Search API (SaaS) — REST API over Brave's own independent web index (30B+ pages). Does not license from Google or Bing. SOC 2 Type II attested (October 2025). Supplies real-time search data to several major LLMs. Docs: brave.com/search/api.
Serper (SaaS) — fast Google SERP API. Returns real-time Google results (web, news, images, maps) in JSON. ~2.87s latency; $0.30–$1.00/1k queries at scale. 2,500 free queries/month. MCP server available. Homepage: serper.dev.
Perplexity Sonar API (SaaS) — LLM-generated answers with inline web citations. Four model tiers: Sonar, Sonar Pro, Sonar Reasoning, and Deep Research. $14–$22 per 1,000 Pro Search queries. Docs: docs.perplexity.ai.
Built-in provider search tools — all three major providers expose native web-search tools that run server-side (no extra API key needed):
- Anthropic Claude —
web_search_20260209server tool in the Messages API; supports domain filtering,max_usescap, and dynamic result filtering via code execution. $10 per 1,000 searches plus token costs. Docs: platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool. - OpenAI Responses API —
{"type": "web_search"}built-in tool; supportsexternal_web_access, filters, andreturn_token_budgetcontrols. Docs: platform.openai.com/docs/guides/tools-web-search. - Google Gemini —
google_searchgrounding tool; can be combined with custom function calling in a single API call. Docs: ai.google.dev/gemini-api/docs/google-search.
The agent angle
Several services expose MCP servers (Jina, Firecrawl, Bright Data, Tavily, Serper), letting any MCP-compatible agent call web-data tools without custom integration. Check each provider's MCP docs or the registry at registry.modelcontextprotocol.io.
Clean Markdown is the standard interchange between web-data tools and agent context. Prefer it over raw HTML to minimize token cost.
When your agent is the crawler, respect robots.txt and AI-crawler policies: see /resources/ai-crawler-policy.
Treat all fetched web content as untrusted — prompt injection is a real attack surface. See /resources/agentic-security-checklist, sections 1 and 4.
Web data that feeds a retrieval system connects to the RAG layer: see /resources/rag-retrieval-for-agents.
Practical guidance
- Prefer reader/Markdown over raw HTML — token cost difference is often 5–10x.
- Cache aggressively — a 1-hour TTL covers most agent use cases and cuts cost and latency substantially.
- JS-rendered vs static — static pages work with lightweight extractors (trafilatura, Readability). JS-heavy sites require a headless browser (Playwright, Browserbase, Bright Data Scraping Browser).
- Rate-limit and identify your crawler honestly — set a recognizable
User-Agentwith a contact URL; back off on 429; honorCrawl-delayin robots.txt. - For scale — managed services (Firecrawl, Apify, Bright Data) handle proxy rotation and anti-bot. Self-hosted stacks (Crawlee + Playwright) give more control at higher ops cost.
Verified sources
- Jina Reader API: https://jina.ai/reader/
- Jina Reader GitHub (jina-ai/reader): https://github.com/jina-ai/reader
- Firecrawl homepage: https://www.firecrawl.dev/
- Firecrawl GitHub (mendableai/firecrawl, AGPL-3.0): https://github.com/mendableai/firecrawl
- Crawlee GitHub (apify/crawlee, Apache 2.0): https://github.com/apify/crawlee
- Crawlee Python GitHub: https://github.com/apify/crawlee-python
- Apify platform: https://apify.com/
- Trafilatura docs: https://trafilatura.readthedocs.io/
- Mozilla Readability GitHub: https://github.com/mozilla/readability
- Bright Data Web MCP blog: https://brightdata.com/blog/ai/web-scraping-with-mcp
- Tavily docs: https://docs.tavily.com/
- Exa Search API docs: https://exa.ai/docs/reference/search-api-guide
- Brave Search API: https://brave.com/search/api/
- Brave Search API growth announcement: https://brave.com/blog/search-api-growth/
- Perplexity Sonar API docs: https://docs.perplexity.ai/
- Anthropic web_search tool docs: https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool
- OpenAI web search (Responses API): https://platform.openai.com/docs/guides/tools-web-search
- Google Gemini grounding with Search: https://ai.google.dev/gemini-api/docs/google-search