AI Crawler Policy: robots.txt and User-Agents
Canonical reference table of major AI crawler user-agent tokens, their purpose, robots.txt semantics, and the WAF/edge layer that sits above robots.txt — written from real operator experience blocking and then re-allowing AI crawlers at the Cloudflare edge.
robots.txt is advisory. A compliant crawler reads it before fetching and honours Disallow rules — but WAF and firewall rules are enforced earlier, at the network edge, before a crawler can even retrieve robots.txt. If your WAF blocks a UA, robots.txt Allow rules have no effect. This page covers both layers.
Canonical user-agent token table
| Token | Vendor | Purpose | robots.txt honoured? |
|---|---|---|---|
GPTBot |
OpenAI | Training data collection for GPT models | Yes |
OAI-SearchBot |
OpenAI | Indexing for ChatGPT Search (not training) | Yes |
ChatGPT-User |
OpenAI | Live user-triggered page fetch | Advisory only — may ignore Disallow |
ClaudeBot |
Anthropic | Training data collection for Claude models | Yes |
Claude-SearchBot |
Anthropic | Indexing for Claude search results | Yes |
Claude-User |
Anthropic | Live user-triggered page fetch | Yes (Anthropic states all three honour it) |
Google-Extended |
Training opt-out token for Gemini/Vertex AI — NOT a separate crawler; Googlebot fetches, this token controls downstream use | Yes (training opt-out only) | |
Googlebot |
Google Search indexing; also executes Google-Extended policy | Yes | |
Google-CloudVertexBot |
Crawls at site-owner request during Vertex AI Agent development | Yes | |
PerplexityBot |
Perplexity | Indexing for Perplexity search answers | Yes |
Perplexity-User |
Perplexity | Live user-triggered page fetch | No — ignores robots.txt by design |
Amazonbot |
Amazon | Crawling for Amazon product/AI improvement | Yes |
Applebot |
Apple | Apple Search (Spotlight, Siri) indexing | Yes |
Applebot-Extended |
Apple | Training opt-out token for Apple Intelligence / foundation models — NOT a separate crawler; Applebot fetches, this token controls training use | Yes (training opt-out only) |
Bytespider |
ByteDance | AI training data collection (Doubao LLM) | Disputed — documented violations |
CCBot |
Common Crawl | Open web archive used to train most major LLMs | Yes |
Meta-ExternalAgent |
Meta | Training data for Llama models and Meta AI products (launched July 2024) | Stated yes; compliance disputed |
MistralAI-User |
Mistral | User-triggered fetch in Le Chat; not used for training | Yes |
robots.txt syntax — per-UA examples
# Allow ChatGPT Search indexing; block training
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
# Block Anthropic training; allow user fetches and search
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
# Keep Google Search; block AI training use
User-agent: Google-Extended
Disallow: /
# Block Apple Intelligence training; keep Apple Search
User-agent: Applebot-Extended
Disallow: /
# Block Common Crawl (source data for most LLMs)
User-agent: CCBot
Disallow: /
The advisory-only limit
robots.txt binds only crawlers that voluntarily read and respect it. Practical gaps:
- User-triggered fetchers (
ChatGPT-User,Perplexity-User) are sent by live users who requested a specific URL. Perplexity explicitly statesPerplexity-Userignores robots.txt. OpenAI saysChatGPT-User"may not follow" it. - Non-compliant crawlers (
Bytespider,Meta-ExternalAgent) have documented or disputed histories of ignoring Disallow. IP-range blocking and WAF rules provide a harder layer. - Spoofed UAs — any actor can send a fake UA. Vendor IP-range verification is the only way to confirm a crawler is authentic. All major vendors publish IP ranges (see verified sources below).
Training opt-out tokens vs real crawlers
Two tokens are semantic policy signals, not user agents of separate crawlers:
Google-Extended— The physical crawler is stillGooglebot. DisallowingGoogle-Extendedtells Google not to use already-crawled content to train Gemini and Vertex AI. It does not affect Google Search inclusion or ranking.Applebot-Extended— The physical crawler isApplebot. DisallowingApplebot-Extendedtells Apple not to use already-crawled content to train Apple Intelligence and foundation models. Apple Search / Spotlight inclusion is unaffected.
The WAF/edge layer: sits above robots.txt
ChangeGamer's own experience (BACKLOG item 0, June 2026): Cloudflare's managed rule "Manage AI bots" (firewallManaged) was silently 403ing GPTBot, ChatGPT-User, OAI-SearchBot, PerplexityBot, CCBot, and Google-CloudVertexBot — including on / and /sitemap.xml — even though robots.txt explicitly allowed them. Cloudflare Browser Integrity Check (BIC, enabled by default) additionally 403'd any client without standard browser headers, breaking the Google Search Console sitemap fetch.
Fixes: BIC off; AI Crawl Control set to Allow for all crawlers; WAF custom rule "Allow AI crawlers" (Skip all managed rules, UA-match, logging on) to ensure managed rules cannot re-block them.
Key lesson: after any WAF or security-rule change, verify actual crawler access with a spoofed-UA curl against your live domain — do not assume robots.txt Allow is sufficient:
curl -A "Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)" \
-I https://yourdomain.com/
curl -A "Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://anthropic.com/aup)" \
-I https://yourdomain.com/
Expect HTTP 200. A 403 means the WAF or BIC is blocking at the edge, not robots.txt.
Emerging access-control and monetization signals
robots.txt is the established baseline, but newer mechanisms layer on top:
- RSL (Really Simple Licensing) — an XML-based machine-readable license document (a standalone
.xmlfile, e.g./license.xml) that declares licensing terms, usage boundaries, and compensation requirements. It is discovered via aLicense:directive in robots.txt (and HTTP headers, RSS, or HTML<link>). Spec atrslstandard.org. Announced 2025; early adoption stage as of June 2026. ChangeGamer publishes its own at /license.xml. - HTTP 402 / pay-per-crawl — direct programmatic payment gate on individual resource requests. See /resources/paying-for-access-402 and /resources/access-and-pricing.
- Cloudflare Pay Per Crawl — Cloudflare's 402-based per-crawl pricing at the CDN layer (private beta as of June 2026). See /resources/access-and-pricing for current status.
For how agents should respond to a 402 gate, see /resources/paying-for-access-402. For how ChangeGamer publishes its own machine-readable content index, see /resources/llms-txt-explained.
Verified sources
- OpenAI crawler overview: https://developers.openai.com/api/docs/bots
- OpenAI publishers FAQ: https://help.openai.com/en/articles/12627856-publishers-and-developers-faq
- Anthropic crawler support page: https://support.anthropic.com/en/articles/8896518
- Google common crawlers: https://developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers
- Applebot support page: https://support.apple.com/en-us/119829
- Perplexity bots guide: https://docs.perplexity.ai/guides/bots
- Amazon Amazonbot: https://developer.amazon.com/amazonbot
- Common Crawl CCBot: https://commoncrawl.org/ccbot
- Mistral AI robots doc: https://docs.mistral.ai/robots
- Cloudflare AI Crawl Control bot reference: https://developers.cloudflare.com/ai-crawl-control/reference/bots/
- Cloudflare Browser Integrity Check: https://developers.cloudflare.com/waf/tools/browser-integrity-check/
- RSL (Really Simple Licensing): https://rslstandard.org/