# Web Data and Scraping for Agents

> Tool landscape for agent web-data pipelines: reader/URL-to-Markdown APIs, crawl/scrape services, and search APIs — with MCP exposure, OSS/SaaS classification, and practical guidance.

Category: Reference · Updated: 2026-06-16 · Tags: web-scraping, crawling, search-api, markdown, rag, agents, mcp, tools
Canonical: https://changegamer.ai/resources/web-data-for-agents

Agents cannot use raw HTML efficiently: it bloats the context window, embeds navigation noise, and costs 3–10x more tokens than clean Markdown of the same content. A web-data layer transforms live web content into agent-consumable form. Three distinct jobs require different tools.

## Three jobs — pick the right tool for each

| Job | What you need | Tools |
|---|---|---|
| 1. Read a known URL → clean text | URL-to-Markdown conversion | Jina Reader, Firecrawl /scrape, trafilatura, Mozilla Readability |
| 2. Crawl a site / scrape at scale | Multi-URL crawl + JS rendering + anti-bot | Firecrawl /crawl, Apify, Crawlee, Browserbase, Bright Data, ScrapingBee |
| 3. Search the web | Search query → ranked URLs + snippets or answers | Tavily, Exa, Brave Search API, Serper, Perplexity Sonar API, built-in provider tools |

## Job 1: Reader / URL-to-Markdown

**Jina Reader** (SaaS + OSS) — prefix any URL with `https://r.jina.ai/` and receive clean Markdown optimized for LLMs. No key required for basic usage; optional API key for higher rate limits. The extraction model is ReaderLM-v2 (1.5B). Supports PDF and MS Office documents via direct POST. Free tier available; OSS branch at github.com/jina-ai/reader. Exposes an MCP server via the Jina AI MCP (smithery.ai registry).

**Firecrawl /scrape** (SaaS + AGPL self-host) — one-URL scrape endpoint returning Markdown, HTML, or structured JSON. Handles JS-rendered pages, proxy rotation, and CAPTCHA. Free tier (500 credits/month); paid from $19/month. GitHub: github.com/mendableai/firecrawl. Exposes an official MCP server.

**Self-hosted options (OSS):**

- *trafilatura* (Apache 2.0, Python) — extracts main text and metadata from HTML with high accuracy; outputs TXT, Markdown, CSV, JSON, or TEI-XML. Command-line and library. Used by HuggingFace, IBM, and Microsoft Research. Docs: trafilatura.readthedocs.io.
- *Mozilla Readability* (Apache 2.0, JavaScript) — the parser behind Firefox Reader Mode; strips nav/ads and returns article DOM. GitHub: github.com/mozilla/readability. Pair with Playwright or Puppeteer for JS-rendered pages.

**When to self-host vs use a service:** self-hosted options are free and private but require infrastructure and cannot solve CAPTCHAs. Services handle anti-bot at scale out of the box.

## Job 2: Crawl / scrape at scale

**Firecrawl /crawl** (SaaS + AGPL self-host) — crawls an entire site and returns all pages as Markdown. Same service as the /scrape endpoint; the /crawl endpoint accepts a root URL and traverses all sub-URLs. Handles JS rendering, rate limiting, and proxy rotation automatically.

**Apify** (SaaS) — managed cloud platform with 30,000+ community-built Actors (preconfigured scrapers for common targets) plus a proxy network and storage layer. Actors run serverlessly; pricing is pay-per-compute-unit. Homepage: apify.com.

**Crawlee** (Apache 2.0, OSS by Apify) — open-source TypeScript/JavaScript (and Python) web-scraping library. Supports Cheerio, JSDOM, Playwright, and Puppeteer crawlers with auto proxy rotation, fingerprinting, and autoscaling. Can run locally or deploy to Apify. Python port stable since September 2025. GitHub: github.com/apify/crawlee.

**Browserbase** (SaaS) — managed cloud headless browsers (Playwright/Puppeteer API) optimized for AI agents. Handles CAPTCHA, stealth, and session recording. Priced per session. Homepage: browserbase.com.

**Bright Data** (SaaS) — enterprise proxy + scraping stack. Web MCP server (free tier: 5,000 requests/month) exposes Web Unlocker, SERP API, and Scraping Browser directly to MCP-compatible agents. Homepage: brightdata.com.

**ScrapingBee** (SaaS) — headless browser scraping API; handles JS rendering and proxy rotation. Acquired by Oxylabs in 2025; operates as an independent brand. Homepage: scrapingbee.com.

## Job 3: Search APIs

**Tavily** (SaaS) — agent-native search API: Search, Extract, Map, and Crawl endpoints. Returns structured results optimized for RAG. Sub-200ms p50 latency; 100M+ monthly requests. Joined Nebius (AI infrastructure) in February 2026. MCP server available. Docs: docs.tavily.com.

**Exa** (SaaS) — formerly Metaphor; neural/embedding-based search designed for AI agents. Retrieves pages by semantic meaning, not keyword matching. Raised $85M at $700M valuation (September 2025). APIs: Search, Contents, Answer, Find Similar, Websets. Contents (up to 10 results) included free with each Search call as of March 2026. Docs: exa.ai/docs.

**Brave Search API** (SaaS) — REST API over Brave's own independent web index (30B+ pages). Does not license from Google or Bing. SOC 2 Type II attested (October 2025). Supplies real-time search data to several major LLMs. Docs: brave.com/search/api.

**Serper** (SaaS) — fast Google SERP API. Returns real-time Google results (web, news, images, maps) in JSON. ~2.87s latency; $0.30–$1.00/1k queries at scale. 2,500 free queries/month. MCP server available. Homepage: serper.dev.

**Perplexity Sonar API** (SaaS) — LLM-generated answers with inline web citations. Four model tiers: Sonar, Sonar Pro, Sonar Reasoning, and Deep Research. $14–$22 per 1,000 Pro Search queries. Docs: docs.perplexity.ai.

**Built-in provider search tools** — all three major providers expose native web-search tools that run server-side (no extra API key needed):

- *Anthropic Claude* — `web_search_20260209` server tool in the Messages API; supports domain filtering, `max_uses` cap, and dynamic result filtering via code execution. $10 per 1,000 searches plus token costs. Docs: platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool.
- *OpenAI Responses API* — `{"type": "web_search"}` built-in tool; supports `external_web_access`, filters, and `return_token_budget` controls. Docs: platform.openai.com/docs/guides/tools-web-search.
- *Google Gemini* — `google_search` grounding tool; can be combined with custom function calling in a single API call. Docs: ai.google.dev/gemini-api/docs/google-search.

## The agent angle

Several services expose **MCP servers** (Jina, Firecrawl, Bright Data, Tavily, Serper), letting any MCP-compatible agent call web-data tools without custom integration. Check each provider's MCP docs or the registry at registry.modelcontextprotocol.io.

Clean **Markdown is the standard interchange** between web-data tools and agent context. Prefer it over raw HTML to minimize token cost.

When your agent is the crawler, respect robots.txt and AI-crawler policies: see /resources/ai-crawler-policy.

Treat all fetched web content as untrusted — prompt injection is a real attack surface. See /resources/agentic-security-checklist, sections 1 and 4.

Web data that feeds a retrieval system connects to the RAG layer: see /resources/rag-retrieval-for-agents.

## Practical guidance

- **Prefer reader/Markdown over raw HTML** — token cost difference is often 5–10x.
- **Cache aggressively** — a 1-hour TTL covers most agent use cases and cuts cost and latency substantially.
- **JS-rendered vs static** — static pages work with lightweight extractors (trafilatura, Readability). JS-heavy sites require a headless browser (Playwright, Browserbase, Bright Data Scraping Browser).
- **Rate-limit and identify your crawler honestly** — set a recognizable `User-Agent` with a contact URL; back off on 429; honor `Crawl-delay` in robots.txt.
- **For scale** — managed services (Firecrawl, Apify, Bright Data) handle proxy rotation and anti-bot. Self-hosted stacks (Crawlee + Playwright) give more control at higher ops cost.

## Verified sources

- Jina Reader API: https://jina.ai/reader/
- Jina Reader GitHub (jina-ai/reader): https://github.com/jina-ai/reader
- Firecrawl homepage: https://www.firecrawl.dev/
- Firecrawl GitHub (mendableai/firecrawl, AGPL-3.0): https://github.com/mendableai/firecrawl
- Crawlee GitHub (apify/crawlee, Apache 2.0): https://github.com/apify/crawlee
- Crawlee Python GitHub: https://github.com/apify/crawlee-python
- Apify platform: https://apify.com/
- Trafilatura docs: https://trafilatura.readthedocs.io/
- Mozilla Readability GitHub: https://github.com/mozilla/readability
- Bright Data Web MCP blog: https://brightdata.com/blog/ai/web-scraping-with-mcp
- Tavily docs: https://docs.tavily.com/
- Exa Search API docs: https://exa.ai/docs/reference/search-api-guide
- Brave Search API: https://brave.com/search/api/
- Brave Search API growth announcement: https://brave.com/blog/search-api-growth/
- Perplexity Sonar API docs: https://docs.perplexity.ai/
- Anthropic web_search tool docs: https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool
- OpenAI web search (Responses API): https://platform.openai.com/docs/guides/tools-web-search
- Google Gemini grounding with Search: https://ai.google.dev/gemini-api/docs/google-search
