ChangeGamer

← All resources

AI Crawler Policy: robots.txt and User-Agents

Reference · updated 2026-06-15 · Markdown variant

Canonical reference table of major AI crawler user-agent tokens, their purpose, robots.txt semantics, and the WAF/edge layer that sits above robots.txt — written from real operator experience blocking and then re-allowing AI crawlers at the Cloudflare edge.


robots.txt is advisory. A compliant crawler reads it before fetching and honours Disallow rules — but WAF and firewall rules are enforced earlier, at the network edge, before a crawler can even retrieve robots.txt. If your WAF blocks a UA, robots.txt Allow rules have no effect. This page covers both layers.

Canonical user-agent token table

Token Vendor Purpose robots.txt honoured?
GPTBot OpenAI Training data collection for GPT models Yes
OAI-SearchBot OpenAI Indexing for ChatGPT Search (not training) Yes
ChatGPT-User OpenAI Live user-triggered page fetch Advisory only — may ignore Disallow
ClaudeBot Anthropic Training data collection for Claude models Yes
Claude-SearchBot Anthropic Indexing for Claude search results Yes
Claude-User Anthropic Live user-triggered page fetch Yes (Anthropic states all three honour it)
Google-Extended Google Training opt-out token for Gemini/Vertex AI — NOT a separate crawler; Googlebot fetches, this token controls downstream use Yes (training opt-out only)
Googlebot Google Google Search indexing; also executes Google-Extended policy Yes
Google-CloudVertexBot Google Crawls at site-owner request during Vertex AI Agent development Yes
PerplexityBot Perplexity Indexing for Perplexity search answers Yes
Perplexity-User Perplexity Live user-triggered page fetch No — ignores robots.txt by design
Amazonbot Amazon Crawling for Amazon product/AI improvement Yes
Applebot Apple Apple Search (Spotlight, Siri) indexing Yes
Applebot-Extended Apple Training opt-out token for Apple Intelligence / foundation models — NOT a separate crawler; Applebot fetches, this token controls training use Yes (training opt-out only)
Bytespider ByteDance AI training data collection (Doubao LLM) Disputed — documented violations
CCBot Common Crawl Open web archive used to train most major LLMs Yes
Meta-ExternalAgent Meta Training data for Llama models and Meta AI products (launched July 2024) Stated yes; compliance disputed
MistralAI-User Mistral User-triggered fetch in Le Chat; not used for training Yes

robots.txt syntax — per-UA examples

# Allow ChatGPT Search indexing; block training
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

# Block Anthropic training; allow user fetches and search
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Keep Google Search; block AI training use
User-agent: Google-Extended
Disallow: /

# Block Apple Intelligence training; keep Apple Search
User-agent: Applebot-Extended
Disallow: /

# Block Common Crawl (source data for most LLMs)
User-agent: CCBot
Disallow: /

The advisory-only limit

robots.txt binds only crawlers that voluntarily read and respect it. Practical gaps:

Training opt-out tokens vs real crawlers

Two tokens are semantic policy signals, not user agents of separate crawlers:

The WAF/edge layer: sits above robots.txt

ChangeGamer's own experience (BACKLOG item 0, June 2026): Cloudflare's managed rule "Manage AI bots" (firewallManaged) was silently 403ing GPTBot, ChatGPT-User, OAI-SearchBot, PerplexityBot, CCBot, and Google-CloudVertexBot — including on / and /sitemap.xml — even though robots.txt explicitly allowed them. Cloudflare Browser Integrity Check (BIC, enabled by default) additionally 403'd any client without standard browser headers, breaking the Google Search Console sitemap fetch.

Fixes: BIC off; AI Crawl Control set to Allow for all crawlers; WAF custom rule "Allow AI crawlers" (Skip all managed rules, UA-match, logging on) to ensure managed rules cannot re-block them.

Key lesson: after any WAF or security-rule change, verify actual crawler access with a spoofed-UA curl against your live domain — do not assume robots.txt Allow is sufficient:

curl -A "Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)" \
  -I https://yourdomain.com/

curl -A "Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://anthropic.com/aup)" \
  -I https://yourdomain.com/

Expect HTTP 200. A 403 means the WAF or BIC is blocking at the edge, not robots.txt.

Emerging access-control and monetization signals

robots.txt is the established baseline, but newer mechanisms layer on top:

For how agents should respond to a 402 gate, see /resources/paying-for-access-402. For how ChangeGamer publishes its own machine-readable content index, see /resources/llms-txt-explained.

Verified sources

#crawlers #robots.txt #user-agents #cloudflare #access-control

Category: Reference