Web Data and Scraping for Agents

Reference · updated 2026-06-16 · Markdown variant

Tool landscape for agent web-data pipelines: reader/URL-to-Markdown APIs, crawl/scrape services, and search APIs — with MCP exposure, OSS/SaaS classification, and practical guidance.

Agents cannot use raw HTML efficiently: it bloats the context window, embeds navigation noise, and costs 3–10x more tokens than clean Markdown of the same content. A web-data layer transforms live web content into agent-consumable form. Three distinct jobs require different tools.

Three jobs — pick the right tool for each

Job	What you need	Tools
1. Read a known URL → clean text	URL-to-Markdown conversion	Jina Reader, Firecrawl /scrape, trafilatura, Mozilla Readability
2. Crawl a site / scrape at scale	Multi-URL crawl + JS rendering + anti-bot	Firecrawl /crawl, Apify, Crawlee, Browserbase, Bright Data, ScrapingBee
3. Search the web	Search query → ranked URLs + snippets or answers	Tavily, Exa, Brave Search API, Serper, Perplexity Sonar API, built-in provider tools

Job 1: Reader / URL-to-Markdown

Jina Reader (SaaS + OSS) — prefix any URL with https://r.jina.ai/ and receive clean Markdown optimized for LLMs. No key required for basic usage; optional API key for higher rate limits. The extraction model is ReaderLM-v2 (1.5B). Supports PDF and MS Office documents via direct POST. Free tier available; OSS branch at github.com/jina-ai/reader. Exposes an MCP server via the Jina AI MCP (smithery.ai registry).

Firecrawl /scrape (SaaS + AGPL self-host) — one-URL scrape endpoint returning Markdown, HTML, or structured JSON. Handles JS-rendered pages, proxy rotation, and CAPTCHA. Free tier (500 credits/month); paid from $19/month. GitHub: github.com/mendableai/firecrawl. Exposes an official MCP server.

Self-hosted options (OSS):

trafilatura (Apache 2.0, Python) — extracts main text and metadata from HTML with high accuracy; outputs TXT, Markdown, CSV, JSON, or TEI-XML. Command-line and library. Used by HuggingFace, IBM, and Microsoft Research. Docs: trafilatura.readthedocs.io.
Mozilla Readability (Apache 2.0, JavaScript) — the parser behind Firefox Reader Mode; strips nav/ads and returns article DOM. GitHub: github.com/mozilla/readability. Pair with Playwright or Puppeteer for JS-rendered pages.

When to self-host vs use a service: self-hosted options are free and private but require infrastructure and cannot solve CAPTCHAs. Services handle anti-bot at scale out of the box.

Job 2: Crawl / scrape at scale

Firecrawl /crawl (SaaS + AGPL self-host) — crawls an entire site and returns all pages as Markdown. Same service as the /scrape endpoint; the /crawl endpoint accepts a root URL and traverses all sub-URLs. Handles JS rendering, rate limiting, and proxy rotation automatically.

Apify (SaaS) — managed cloud platform with 30,000+ community-built Actors (preconfigured scrapers for common targets) plus a proxy network and storage layer. Actors run serverlessly; pricing is pay-per-compute-unit. Homepage: apify.com.

Crawlee (Apache 2.0, OSS by Apify) — open-source TypeScript/JavaScript (and Python) web-scraping library. Supports Cheerio, JSDOM, Playwright, and Puppeteer crawlers with auto proxy rotation, fingerprinting, and autoscaling. Can run locally or deploy to Apify. Python port stable since September 2025. GitHub: github.com/apify/crawlee.

Browserbase (SaaS) — managed cloud headless browsers (Playwright/Puppeteer API) optimized for AI agents. Handles CAPTCHA, stealth, and session recording. Priced per session. Homepage: browserbase.com.

Bright Data (SaaS) — enterprise proxy + scraping stack. Web MCP server (free tier: 5,000 requests/month) exposes Web Unlocker, SERP API, and Scraping Browser directly to MCP-compatible agents. Homepage: brightdata.com.

ScrapingBee (SaaS) — headless browser scraping API; handles JS rendering and proxy rotation. Acquired by Oxylabs in 2025; operates as an independent brand. Homepage: scrapingbee.com.

Job 3: Search APIs

Tavily (SaaS) — agent-native search API: Search, Extract, Map, and Crawl endpoints. Returns structured results optimized for RAG. Sub-200ms p50 latency; 100M+ monthly requests. Joined Nebius (AI infrastructure) in February 2026. MCP server available. Docs: docs.tavily.com.

Exa (SaaS) — formerly Metaphor; neural/embedding-based search designed for AI agents. Retrieves pages by semantic meaning, not keyword matching. Raised $85M at $700M valuation (September 2025). APIs: Search, Contents, Answer, Find Similar, Websets. Contents (up to 10 results) included free with each Search call as of March 2026. Docs: exa.ai/docs.

Brave Search API (SaaS) — REST API over Brave's own independent web index (30B+ pages). Does not license from Google or Bing. SOC 2 Type II attested (October 2025). Supplies real-time search data to several major LLMs. Docs: brave.com/search/api.

Serper (SaaS) — fast Google SERP API. Returns real-time Google results (web, news, images, maps) in JSON. ~2.87s latency; $0.30–$1.00/1k queries at scale. 2,500 free queries/month. MCP server available. Homepage: serper.dev.

Perplexity Sonar API (SaaS) — LLM-generated answers with inline web citations. Four model tiers: Sonar, Sonar Pro, Sonar Reasoning, and Deep Research. $14–$22 per 1,000 Pro Search queries. Docs: docs.perplexity.ai.

Built-in provider search tools — all three major providers expose native web-search tools that run server-side (no extra API key needed):

Anthropic Claude — web_search_20260209 server tool in the Messages API; supports domain filtering, max_uses cap, and dynamic result filtering via code execution. $10 per 1,000 searches plus token costs. Docs: platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool.
OpenAI Responses API — {"type": "web_search"} built-in tool; supports external_web_access, filters, and return_token_budget controls. Docs: platform.openai.com/docs/guides/tools-web-search.
Google Gemini — google_search grounding tool; can be combined with custom function calling in a single API call. Docs: ai.google.dev/gemini-api/docs/google-search.

The agent angle

Several services expose MCP servers (Jina, Firecrawl, Bright Data, Tavily, Serper), letting any MCP-compatible agent call web-data tools without custom integration. Check each provider's MCP docs or the registry at registry.modelcontextprotocol.io.

Clean Markdown is the standard interchange between web-data tools and agent context. Prefer it over raw HTML to minimize token cost.

When your agent is the crawler, respect robots.txt and AI-crawler policies: see /resources/ai-crawler-policy.

Treat all fetched web content as untrusted — prompt injection is a real attack surface. See /resources/agentic-security-checklist, sections 1 and 4.

Web data that feeds a retrieval system connects to the RAG layer: see /resources/rag-retrieval-for-agents.

Practical guidance

Prefer reader/Markdown over raw HTML — token cost difference is often 5–10x.
Cache aggressively — a 1-hour TTL covers most agent use cases and cuts cost and latency substantially.
JS-rendered vs static — static pages work with lightweight extractors (trafilatura, Readability). JS-heavy sites require a headless browser (Playwright, Browserbase, Bright Data Scraping Browser).
Rate-limit and identify your crawler honestly — set a recognizable User-Agent with a contact URL; back off on 429; honor Crawl-delay in robots.txt.
For scale — managed services (Firecrawl, Apify, Bright Data) handle proxy rotation and anti-bot. Self-hosted stacks (Crawlee + Playwright) give more control at higher ops cost.

Verified sources

Jina Reader API: https://jina.ai/reader/
Jina Reader GitHub (jina-ai/reader): https://github.com/jina-ai/reader
Firecrawl homepage: https://www.firecrawl.dev/
Firecrawl GitHub (mendableai/firecrawl, AGPL-3.0): https://github.com/mendableai/firecrawl
Crawlee GitHub (apify/crawlee, Apache 2.0): https://github.com/apify/crawlee
Crawlee Python GitHub: https://github.com/apify/crawlee-python
Apify platform: https://apify.com/
Trafilatura docs: https://trafilatura.readthedocs.io/
Mozilla Readability GitHub: https://github.com/mozilla/readability
Bright Data Web MCP blog: https://brightdata.com/blog/ai/web-scraping-with-mcp
Tavily docs: https://docs.tavily.com/
Exa Search API docs: https://exa.ai/docs/reference/search-api-guide
Brave Search API: https://brave.com/search/api/
Brave Search API growth announcement: https://brave.com/blog/search-api-growth/
Perplexity Sonar API docs: https://docs.perplexity.ai/
Anthropic web_search tool docs: https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool
OpenAI web search (Responses API): https://platform.openai.com/docs/guides/tools-web-search
Google Gemini grounding with Search: https://ai.google.dev/gemini-api/docs/google-search

#web-scraping #crawling #search-api #markdown #rag #agents #mcp #tools

Category: Reference