ChangeGamer

← All resources

Web Data and Scraping for Agents

Reference · updated 2026-06-16 · Markdown variant

Tool landscape for agent web-data pipelines: reader/URL-to-Markdown APIs, crawl/scrape services, and search APIs — with MCP exposure, OSS/SaaS classification, and practical guidance.


Agents cannot use raw HTML efficiently: it bloats the context window, embeds navigation noise, and costs 3–10x more tokens than clean Markdown of the same content. A web-data layer transforms live web content into agent-consumable form. Three distinct jobs require different tools.

Three jobs — pick the right tool for each

Job What you need Tools
1. Read a known URL → clean text URL-to-Markdown conversion Jina Reader, Firecrawl /scrape, trafilatura, Mozilla Readability
2. Crawl a site / scrape at scale Multi-URL crawl + JS rendering + anti-bot Firecrawl /crawl, Apify, Crawlee, Browserbase, Bright Data, ScrapingBee
3. Search the web Search query → ranked URLs + snippets or answers Tavily, Exa, Brave Search API, Serper, Perplexity Sonar API, built-in provider tools

Job 1: Reader / URL-to-Markdown

Jina Reader (SaaS + OSS) — prefix any URL with https://r.jina.ai/ and receive clean Markdown optimized for LLMs. No key required for basic usage; optional API key for higher rate limits. The extraction model is ReaderLM-v2 (1.5B). Supports PDF and MS Office documents via direct POST. Free tier available; OSS branch at github.com/jina-ai/reader. Exposes an MCP server via the Jina AI MCP (smithery.ai registry).

Firecrawl /scrape (SaaS + AGPL self-host) — one-URL scrape endpoint returning Markdown, HTML, or structured JSON. Handles JS-rendered pages, proxy rotation, and CAPTCHA. Free tier (500 credits/month); paid from $19/month. GitHub: github.com/mendableai/firecrawl. Exposes an official MCP server.

Self-hosted options (OSS):

When to self-host vs use a service: self-hosted options are free and private but require infrastructure and cannot solve CAPTCHAs. Services handle anti-bot at scale out of the box.

Job 2: Crawl / scrape at scale

Firecrawl /crawl (SaaS + AGPL self-host) — crawls an entire site and returns all pages as Markdown. Same service as the /scrape endpoint; the /crawl endpoint accepts a root URL and traverses all sub-URLs. Handles JS rendering, rate limiting, and proxy rotation automatically.

Apify (SaaS) — managed cloud platform with 30,000+ community-built Actors (preconfigured scrapers for common targets) plus a proxy network and storage layer. Actors run serverlessly; pricing is pay-per-compute-unit. Homepage: apify.com.

Crawlee (Apache 2.0, OSS by Apify) — open-source TypeScript/JavaScript (and Python) web-scraping library. Supports Cheerio, JSDOM, Playwright, and Puppeteer crawlers with auto proxy rotation, fingerprinting, and autoscaling. Can run locally or deploy to Apify. Python port stable since September 2025. GitHub: github.com/apify/crawlee.

Browserbase (SaaS) — managed cloud headless browsers (Playwright/Puppeteer API) optimized for AI agents. Handles CAPTCHA, stealth, and session recording. Priced per session. Homepage: browserbase.com.

Bright Data (SaaS) — enterprise proxy + scraping stack. Web MCP server (free tier: 5,000 requests/month) exposes Web Unlocker, SERP API, and Scraping Browser directly to MCP-compatible agents. Homepage: brightdata.com.

ScrapingBee (SaaS) — headless browser scraping API; handles JS rendering and proxy rotation. Acquired by Oxylabs in 2025; operates as an independent brand. Homepage: scrapingbee.com.

Job 3: Search APIs

Tavily (SaaS) — agent-native search API: Search, Extract, Map, and Crawl endpoints. Returns structured results optimized for RAG. Sub-200ms p50 latency; 100M+ monthly requests. Joined Nebius (AI infrastructure) in February 2026. MCP server available. Docs: docs.tavily.com.

Exa (SaaS) — formerly Metaphor; neural/embedding-based search designed for AI agents. Retrieves pages by semantic meaning, not keyword matching. Raised $85M at $700M valuation (September 2025). APIs: Search, Contents, Answer, Find Similar, Websets. Contents (up to 10 results) included free with each Search call as of March 2026. Docs: exa.ai/docs.

Brave Search API (SaaS) — REST API over Brave's own independent web index (30B+ pages). Does not license from Google or Bing. SOC 2 Type II attested (October 2025). Supplies real-time search data to several major LLMs. Docs: brave.com/search/api.

Serper (SaaS) — fast Google SERP API. Returns real-time Google results (web, news, images, maps) in JSON. ~2.87s latency; $0.30–$1.00/1k queries at scale. 2,500 free queries/month. MCP server available. Homepage: serper.dev.

Perplexity Sonar API (SaaS) — LLM-generated answers with inline web citations. Four model tiers: Sonar, Sonar Pro, Sonar Reasoning, and Deep Research. $14–$22 per 1,000 Pro Search queries. Docs: docs.perplexity.ai.

Built-in provider search tools — all three major providers expose native web-search tools that run server-side (no extra API key needed):

The agent angle

Several services expose MCP servers (Jina, Firecrawl, Bright Data, Tavily, Serper), letting any MCP-compatible agent call web-data tools without custom integration. Check each provider's MCP docs or the registry at registry.modelcontextprotocol.io.

Clean Markdown is the standard interchange between web-data tools and agent context. Prefer it over raw HTML to minimize token cost.

When your agent is the crawler, respect robots.txt and AI-crawler policies: see /resources/ai-crawler-policy.

Treat all fetched web content as untrusted — prompt injection is a real attack surface. See /resources/agentic-security-checklist, sections 1 and 4.

Web data that feeds a retrieval system connects to the RAG layer: see /resources/rag-retrieval-for-agents.

Practical guidance

Verified sources

#web-scraping #crawling #search-api #markdown #rag #agents #mcp #tools

Category: Reference