# AI Crawler Policy: robots.txt and User-Agents

> Canonical reference table of major AI crawler user-agent tokens, their purpose, robots.txt semantics, and the WAF/edge layer that sits above robots.txt — written from real operator experience blocking and then re-allowing AI crawlers at the Cloudflare edge.

Category: Reference · Updated: 2026-06-15 · Tags: crawlers, robots.txt, user-agents, cloudflare, access-control
Canonical: https://changegamer.ai/resources/ai-crawler-policy

robots.txt is advisory. A compliant crawler reads it before fetching and honours Disallow rules — but WAF and firewall rules are enforced earlier, at the network edge, before a crawler can even retrieve robots.txt. If your WAF blocks a UA, robots.txt Allow rules have no effect. This page covers both layers.

## Canonical user-agent token table

| Token | Vendor | Purpose | robots.txt honoured? |
|---|---|---|---|
| `GPTBot` | OpenAI | Training data collection for GPT models | Yes |
| `OAI-SearchBot` | OpenAI | Indexing for ChatGPT Search (not training) | Yes |
| `ChatGPT-User` | OpenAI | Live user-triggered page fetch | Advisory only — may ignore Disallow |
| `ClaudeBot` | Anthropic | Training data collection for Claude models | Yes |
| `Claude-SearchBot` | Anthropic | Indexing for Claude search results | Yes |
| `Claude-User` | Anthropic | Live user-triggered page fetch | Yes (Anthropic states all three honour it) |
| `Google-Extended` | Google | Training opt-out token for Gemini/Vertex AI — NOT a separate crawler; Googlebot fetches, this token controls downstream use | Yes (training opt-out only) |
| `Googlebot` | Google | Google Search indexing; also executes Google-Extended policy | Yes |
| `Google-CloudVertexBot` | Google | Crawls at site-owner request during Vertex AI Agent development | Yes |
| `PerplexityBot` | Perplexity | Indexing for Perplexity search answers | Yes |
| `Perplexity-User` | Perplexity | Live user-triggered page fetch | No — ignores robots.txt by design |
| `Amazonbot` | Amazon | Crawling for Amazon product/AI improvement | Yes |
| `Applebot` | Apple | Apple Search (Spotlight, Siri) indexing | Yes |
| `Applebot-Extended` | Apple | Training opt-out token for Apple Intelligence / foundation models — NOT a separate crawler; Applebot fetches, this token controls training use | Yes (training opt-out only) |
| `Bytespider` | ByteDance | AI training data collection (Doubao LLM) | Disputed — documented violations |
| `CCBot` | Common Crawl | Open web archive used to train most major LLMs | Yes |
| `Meta-ExternalAgent` | Meta | Training data for Llama models and Meta AI products (launched July 2024) | Stated yes; compliance disputed |
| `MistralAI-User` | Mistral | User-triggered fetch in Le Chat; not used for training | Yes |

## robots.txt syntax — per-UA examples

```
# Allow ChatGPT Search indexing; block training
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

# Block Anthropic training; allow user fetches and search
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Keep Google Search; block AI training use
User-agent: Google-Extended
Disallow: /

# Block Apple Intelligence training; keep Apple Search
User-agent: Applebot-Extended
Disallow: /

# Block Common Crawl (source data for most LLMs)
User-agent: CCBot
Disallow: /
```

## The advisory-only limit

robots.txt binds only crawlers that voluntarily read and respect it. Practical gaps:

- **User-triggered fetchers** (`ChatGPT-User`, `Perplexity-User`) are sent by live users who requested a specific URL. Perplexity explicitly states `Perplexity-User` ignores robots.txt. OpenAI says `ChatGPT-User` "may not follow" it.
- **Non-compliant crawlers** (`Bytespider`, `Meta-ExternalAgent`) have documented or disputed histories of ignoring Disallow. IP-range blocking and WAF rules provide a harder layer.
- **Spoofed UAs** — any actor can send a fake UA. Vendor IP-range verification is the only way to confirm a crawler is authentic. All major vendors publish IP ranges (see verified sources below).

## Training opt-out tokens vs real crawlers

Two tokens are semantic policy signals, not user agents of separate crawlers:

- **`Google-Extended`** — The physical crawler is still `Googlebot`. Disallowing `Google-Extended` tells Google not to use already-crawled content to train Gemini and Vertex AI. It does not affect Google Search inclusion or ranking.
- **`Applebot-Extended`** — The physical crawler is `Applebot`. Disallowing `Applebot-Extended` tells Apple not to use already-crawled content to train Apple Intelligence and foundation models. Apple Search / Spotlight inclusion is unaffected.

## The WAF/edge layer: sits above robots.txt

ChangeGamer's own experience (BACKLOG item 0, June 2026): Cloudflare's managed rule "Manage AI bots" (`firewallManaged`) was silently 403ing `GPTBot`, `ChatGPT-User`, `OAI-SearchBot`, `PerplexityBot`, `CCBot`, and `Google-CloudVertexBot` — including on `/` and `/sitemap.xml` — even though robots.txt explicitly allowed them. Cloudflare Browser Integrity Check (BIC, enabled by default) additionally 403'd any client without standard browser headers, breaking the Google Search Console sitemap fetch.

Fixes: BIC off; AI Crawl Control set to Allow for all crawlers; WAF custom rule "Allow AI crawlers" (Skip all managed rules, UA-match, logging on) to ensure managed rules cannot re-block them.

**Key lesson:** after any WAF or security-rule change, verify actual crawler access with a spoofed-UA curl against your live domain — do not assume robots.txt Allow is sufficient:

```bash
curl -A "Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)" \
  -I https://yourdomain.com/

curl -A "Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://anthropic.com/aup)" \
  -I https://yourdomain.com/
```

Expect HTTP 200. A 403 means the WAF or BIC is blocking at the edge, not robots.txt.

## Emerging access-control and monetization signals

robots.txt is the established baseline, but newer mechanisms layer on top:

- **RSL (Really Simple Licensing)** — an XML-based machine-readable license document (a standalone `.xml` file, e.g. `/license.xml`) that declares licensing terms, usage boundaries, and compensation requirements. It is discovered via a `License:` directive in robots.txt (and HTTP headers, RSS, or HTML `<link>`). Spec at `rslstandard.org`. Announced 2025; early adoption stage as of June 2026. ChangeGamer publishes its own at /license.xml.
- **HTTP 402 / pay-per-crawl** — direct programmatic payment gate on individual resource requests. See /resources/paying-for-access-402 and /resources/access-and-pricing.
- **Cloudflare Pay Per Crawl** — Cloudflare's 402-based per-crawl pricing at the CDN layer (private beta as of June 2026). See /resources/access-and-pricing for current status.

For how agents should respond to a 402 gate, see /resources/paying-for-access-402.
For how ChangeGamer publishes its own machine-readable content index, see /resources/llms-txt-explained.

## Verified sources

- OpenAI crawler overview: https://developers.openai.com/api/docs/bots
- OpenAI publishers FAQ: https://help.openai.com/en/articles/12627856-publishers-and-developers-faq
- Anthropic crawler support page: https://support.anthropic.com/en/articles/8896518
- Google common crawlers: https://developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers
- Applebot support page: https://support.apple.com/en-us/119829
- Perplexity bots guide: https://docs.perplexity.ai/guides/bots
- Amazon Amazonbot: https://developer.amazon.com/amazonbot
- Common Crawl CCBot: https://commoncrawl.org/ccbot
- Mistral AI robots doc: https://docs.mistral.ai/robots
- Cloudflare AI Crawl Control bot reference: https://developers.cloudflare.com/ai-crawl-control/reference/bots/
- Cloudflare Browser Integrity Check: https://developers.cloudflare.com/waf/tools/browser-integrity-check/
- RSL (Really Simple Licensing): https://rslstandard.org/
