How the Web Is Configuring Robots.txt for AI in 2026
Robots.txt has become the primary public lever publishers use to control AI access. The picture in May 2026 is a fast-moving balance between training-bot blocking (predominantly GPTBot, CCBot, ClaudeBot) and search-or-assistant-bot allowance (OAI-SearchBot, PerplexityBot, Google-Extended). This page consolidates the headline robots.txt adoption statistics for the AI crawler era as of May 2026.
Top-Site Blocking Rates (May 2026)
| AI Bot | % Blocked on Top 1000 Sites | Trend |
|---|---|---|
| GPTBot | ~25% (up from 5% in early 2023) | Flat in 2026 after rapid 2024 growth |
| ClaudeBot | ~20% and rising | Fastest-growing block target in Q1-Q2 2026 |
| CCBot (Common Crawl) | ~18% | Overtaken by ClaudeBot in April 2026 |
| Google-Extended | ~12% | Slower growth than dedicated AI bots |
| PerplexityBot | ~9% | Lower because it drives clicks |
| OAI-SearchBot | ~5% | Largely allowed; drives ChatGPT search citations |
| Bytespider (ByteDance) | ~22% | High block rate; cited security concerns |
| Applebot-Extended | ~7% | Newer; lower awareness |
Site-Wide Aggregate (Cloudflare Network Data)
| Metric | Value |
|---|---|
| Sites blocking GPTBot (broad Cloudflare network) | ~49% |
| ClaudeBot share of all DISALLOW rules (March 2026) | 10.1% (up from 9.6% in January) |
| ClaudeBot pages crawled per referral returned | 20,583 to 1 |
| AI crawler traffic that is training or mixed-purpose | 89.4% |
| AI crawler traffic that is search-related | 8% |
| AI crawler traffic responding to actual user queries | 2.2% |
Common Site Strategies in 2026
| Strategy | Pattern | Adoption |
|---|---|---|
| Block all AI | Disallow GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider, Applebot-Extended, Perplexity | ~8% of top 1000 (mostly publishers) |
| Middle path | Block training bots (GPTBot, CCBot, ClaudeBot, Google-Extended); allow search bots (OAI-SearchBot, PerplexityBot) | ~30% of top 1000 (dominant pattern) |
| Allow all | No AI-specific blocks | ~50% of top 1000 (default-do-nothing) |
| Conditional / paid | Cloudflare pay-per-crawl or Tollbit license | ~5% of top 1000 (growing) |
| Selective vertical block | Block specific bots based on alignment / values (e.g., publishers blocking ByteDance) | ~7% of top 1000 |
Six Things the Robots.txt Data Tells You
- GPTBot blocking has plateaued at ~25 percent of top sites. The block rate grew rapidly from 5 percent in early 2023 to 25 percent through 2024 and has been flat through 2026. The plateau suggests publishers who would block have blocked; the remaining ~75 percent have made a deliberate decision to allow.
- ClaudeBot is the fastest-growing block target. Share of DISALLOW rules rose from 9.6 percent in January 2026 to 10.1 percent in March, overtaking CCBot in April. The growth reflects publisher frustration with Anthropic's 20,583-to-1 crawl-to-referral ratio (the worst in the industry).
- The middle-path strategy is dominant. Approximately 30 percent of top sites now block training bots (GPTBot, CCBot, ClaudeBot, Google-Extended) while allowing search bots (OAI-SearchBot, PerplexityBot). This is the most common single configuration in 2026 and reflects publishers wanting AI visibility for traffic-driving citations but not for training data extraction.
- 89.4 percent of AI crawler traffic is training, not search. Only 8 percent is search-related and just 2.2 percent responds to actual user queries. The asymmetry is why publishers block training bots aggressively while leaving search bots alone: training crawls don't return human visitors.
- ByteDance Bytespider is blocked at 22 percent, higher than ClaudeBot. The above-average block rate reflects security and geopolitical concerns. Bytespider is a non-AI-specific crawler (predates the AI boom) but the AI training implications drive its higher-than-expected block rate among top US and EU sites.
- Paid alternatives (Cloudflare pay-per-crawl, Tollbit) are growing. Approximately 5 percent of top sites now use a paid-crawl model in 2026, up from approximately 1 percent in 2024. The model lets publishers monetise crawls instead of just blocking; expect adoption to expand through 2026-2027 as the per-crawl pricing model matures.
What This Means for AI Visibility
For brand-visibility programs, robots.txt configuration matters because brands blocking AI bots exclude themselves from those vendors' future training data. The brand-visibility cost of blocking GPTBot is delayed (current ChatGPT visibility may be unaffected; visibility in GPT-6 or GPT-7 will be) but real. Brands should review their robots.txt against their AI-visibility strategy explicitly; the default-allow position keeps options open while the default-block position locks in exclusion.
Methodology
Statistics aggregated May 15, 2026 from TechnologyChecker.io robots.txt analysis across Cloudflare's network, ALM Corp's 66-billion-bot-request OpenAI search crawler analysis, Paul Calvano's August 2025 AI bots and robots.txt research, and Search Engine Journal coverage of Anthropic's ClaudeBot DISALLOW growth. Refreshed monthly.
How Presenc AI Helps
Presenc AI tracks brand-mention rates inside major AI platforms and correlates them against publisher robots.txt posture. For brands evaluating whether to block specific AI crawlers, our instrumentation surfaces the downstream brand-visibility impact so the decision can be made with evidence rather than principle alone.