AI Crawler Policy: robots.txt and User-Agents

Reference · updated 2026-06-15 · Markdown variant

Canonical reference table of major AI crawler user-agent tokens, their purpose, robots.txt semantics, and the WAF/edge layer that sits above robots.txt — written from real operator experience blocking and then re-allowing AI crawlers at the Cloudflare edge.

robots.txt is advisory. A compliant crawler reads it before fetching and honours Disallow rules — but WAF and firewall rules are enforced earlier, at the network edge, before a crawler can even retrieve robots.txt. If your WAF blocks a UA, robots.txt Allow rules have no effect. This page covers both layers.

Canonical user-agent token table

Token	Vendor	Purpose	robots.txt honoured?
`GPTBot`	OpenAI	Training data collection for GPT models	Yes
`OAI-SearchBot`	OpenAI	Indexing for ChatGPT Search (not training)	Yes
`ChatGPT-User`	OpenAI	Live user-triggered page fetch	Advisory only — may ignore Disallow
`ClaudeBot`	Anthropic	Training data collection for Claude models	Yes
`Claude-SearchBot`	Anthropic	Indexing for Claude search results	Yes
`Claude-User`	Anthropic	Live user-triggered page fetch	Yes (Anthropic states all three honour it)
`Google-Extended`	Google	Training opt-out token for Gemini/Vertex AI — NOT a separate crawler; Googlebot fetches, this token controls downstream use	Yes (training opt-out only)
`Googlebot`	Google	Google Search indexing; also executes Google-Extended policy	Yes
`Google-CloudVertexBot`	Google	Crawls at site-owner request during Vertex AI Agent development	Yes
`PerplexityBot`	Perplexity	Indexing for Perplexity search answers	Yes
`Perplexity-User`	Perplexity	Live user-triggered page fetch	No — ignores robots.txt by design
`Amazonbot`	Amazon	Crawling for Amazon product/AI improvement	Yes
`Applebot`	Apple	Apple Search (Spotlight, Siri) indexing	Yes
`Applebot-Extended`	Apple	Training opt-out token for Apple Intelligence / foundation models — NOT a separate crawler; Applebot fetches, this token controls training use	Yes (training opt-out only)
`Bytespider`	ByteDance	AI training data collection (Doubao LLM)	Disputed — documented violations
`CCBot`	Common Crawl	Open web archive used to train most major LLMs	Yes
`Meta-ExternalAgent`	Meta	Training data for Llama models and Meta AI products (launched July 2024)	Stated yes; compliance disputed
`MistralAI-User`	Mistral	User-triggered fetch in Le Chat; not used for training	Yes

robots.txt syntax — per-UA examples

# Allow ChatGPT Search indexing; block training
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

# Block Anthropic training; allow user fetches and search
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Keep Google Search; block AI training use
User-agent: Google-Extended
Disallow: /

# Block Apple Intelligence training; keep Apple Search
User-agent: Applebot-Extended
Disallow: /

# Block Common Crawl (source data for most LLMs)
User-agent: CCBot
Disallow: /

The advisory-only limit

robots.txt binds only crawlers that voluntarily read and respect it. Practical gaps:

User-triggered fetchers (ChatGPT-User, Perplexity-User) are sent by live users who requested a specific URL. Perplexity explicitly states Perplexity-User ignores robots.txt. OpenAI says ChatGPT-User "may not follow" it.
Non-compliant crawlers (Bytespider, Meta-ExternalAgent) have documented or disputed histories of ignoring Disallow. IP-range blocking and WAF rules provide a harder layer.
Spoofed UAs — any actor can send a fake UA. Vendor IP-range verification is the only way to confirm a crawler is authentic. All major vendors publish IP ranges (see verified sources below).

Training opt-out tokens vs real crawlers

Two tokens are semantic policy signals, not user agents of separate crawlers:

Google-Extended — The physical crawler is still Googlebot. Disallowing Google-Extended tells Google not to use already-crawled content to train Gemini and Vertex AI. It does not affect Google Search inclusion or ranking.
Applebot-Extended — The physical crawler is Applebot. Disallowing Applebot-Extended tells Apple not to use already-crawled content to train Apple Intelligence and foundation models. Apple Search / Spotlight inclusion is unaffected.

The WAF/edge layer: sits above robots.txt

ChangeGamer's own experience (BACKLOG item 0, June 2026): Cloudflare's managed rule "Manage AI bots" (firewallManaged) was silently 403ing GPTBot, ChatGPT-User, OAI-SearchBot, PerplexityBot, CCBot, and Google-CloudVertexBot — including on / and /sitemap.xml — even though robots.txt explicitly allowed them. Cloudflare Browser Integrity Check (BIC, enabled by default) additionally 403'd any client without standard browser headers, breaking the Google Search Console sitemap fetch.

Fixes: BIC off; AI Crawl Control set to Allow for all crawlers; WAF custom rule "Allow AI crawlers" (Skip all managed rules, UA-match, logging on) to ensure managed rules cannot re-block them.

Key lesson: after any WAF or security-rule change, verify actual crawler access with a spoofed-UA curl against your live domain — do not assume robots.txt Allow is sufficient:

curl -A "Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)" \
  -I https://yourdomain.com/

curl -A "Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://anthropic.com/aup)" \
  -I https://yourdomain.com/

Expect HTTP 200. A 403 means the WAF or BIC is blocking at the edge, not robots.txt.

Emerging access-control and monetization signals

robots.txt is the established baseline, but newer mechanisms layer on top:

RSL (Really Simple Licensing) — an XML-based machine-readable license document (a standalone .xml file, e.g. /license.xml) that declares licensing terms, usage boundaries, and compensation requirements. It is discovered via a License: directive in robots.txt (and HTTP headers, RSS, or HTML <link>). Spec at rslstandard.org. Announced 2025; early adoption stage as of June 2026. ChangeGamer publishes its own at /license.xml.
HTTP 402 / pay-per-crawl — direct programmatic payment gate on individual resource requests. See /resources/paying-for-access-402 and /resources/access-and-pricing.
Cloudflare Pay Per Crawl — Cloudflare's 402-based per-crawl pricing at the CDN layer (private beta as of June 2026). See /resources/access-and-pricing for current status.

For how agents should respond to a 402 gate, see /resources/paying-for-access-402. For how ChangeGamer publishes its own machine-readable content index, see /resources/llms-txt-explained.

Verified sources

OpenAI crawler overview: https://developers.openai.com/api/docs/bots
OpenAI publishers FAQ: https://help.openai.com/en/articles/12627856-publishers-and-developers-faq
Anthropic crawler support page: https://support.anthropic.com/en/articles/8896518
Google common crawlers: https://developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers
Applebot support page: https://support.apple.com/en-us/119829
Perplexity bots guide: https://docs.perplexity.ai/guides/bots
Amazon Amazonbot: https://developer.amazon.com/amazonbot
Common Crawl CCBot: https://commoncrawl.org/ccbot
Mistral AI robots doc: https://docs.mistral.ai/robots
Cloudflare AI Crawl Control bot reference: https://developers.cloudflare.com/ai-crawl-control/reference/bots/
Cloudflare Browser Integrity Check: https://developers.cloudflare.com/waf/tools/browser-integrity-check/
RSL (Really Simple Licensing): https://rslstandard.org/

#crawlers #robots.txt #user-agents #cloudflare #access-control

Category: Reference