Research

State of Robots.txt for AI, May 2026

How websites are using robots.txt to block AI crawlers in 2026. 25 percent of top 1000 sites blocking GPTBot, ClaudeBot the fastest-growing block target, and the middle-path strategy that lets search bots through while blocking training crawlers.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

How the Web Is Configuring Robots.txt for AI in 2026

Robots.txt has become the primary public lever publishers use to control AI access. The picture in May 2026 is a fast-moving balance between training-bot blocking (predominantly GPTBot, CCBot, ClaudeBot) and search-or-assistant-bot allowance (OAI-SearchBot, PerplexityBot, Google-Extended). This page consolidates the headline robots.txt adoption statistics for the AI crawler era as of May 2026.

Top-Site Blocking Rates (May 2026)

AI Bot% Blocked on Top 1000 SitesTrend
GPTBot~25% (up from 5% in early 2023)Flat in 2026 after rapid 2024 growth
ClaudeBot~20% and risingFastest-growing block target in Q1-Q2 2026
CCBot (Common Crawl)~18%Overtaken by ClaudeBot in April 2026
Google-Extended~12%Slower growth than dedicated AI bots
PerplexityBot~9%Lower because it drives clicks
OAI-SearchBot~5%Largely allowed; drives ChatGPT search citations
Bytespider (ByteDance)~22%High block rate; cited security concerns
Applebot-Extended~7%Newer; lower awareness

Site-Wide Aggregate (Cloudflare Network Data)

MetricValue
Sites blocking GPTBot (broad Cloudflare network)~49%
ClaudeBot share of all DISALLOW rules (March 2026)10.1% (up from 9.6% in January)
ClaudeBot pages crawled per referral returned20,583 to 1
AI crawler traffic that is training or mixed-purpose89.4%
AI crawler traffic that is search-related8%
AI crawler traffic responding to actual user queries2.2%

Common Site Strategies in 2026

StrategyPatternAdoption
Block all AIDisallow GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider, Applebot-Extended, Perplexity~8% of top 1000 (mostly publishers)
Middle pathBlock training bots (GPTBot, CCBot, ClaudeBot, Google-Extended); allow search bots (OAI-SearchBot, PerplexityBot)~30% of top 1000 (dominant pattern)
Allow allNo AI-specific blocks~50% of top 1000 (default-do-nothing)
Conditional / paidCloudflare pay-per-crawl or Tollbit license~5% of top 1000 (growing)
Selective vertical blockBlock specific bots based on alignment / values (e.g., publishers blocking ByteDance)~7% of top 1000

Six Things the Robots.txt Data Tells You

  1. GPTBot blocking has plateaued at ~25 percent of top sites. The block rate grew rapidly from 5 percent in early 2023 to 25 percent through 2024 and has been flat through 2026. The plateau suggests publishers who would block have blocked; the remaining ~75 percent have made a deliberate decision to allow.
  2. ClaudeBot is the fastest-growing block target. Share of DISALLOW rules rose from 9.6 percent in January 2026 to 10.1 percent in March, overtaking CCBot in April. The growth reflects publisher frustration with Anthropic's 20,583-to-1 crawl-to-referral ratio (the worst in the industry).
  3. The middle-path strategy is dominant. Approximately 30 percent of top sites now block training bots (GPTBot, CCBot, ClaudeBot, Google-Extended) while allowing search bots (OAI-SearchBot, PerplexityBot). This is the most common single configuration in 2026 and reflects publishers wanting AI visibility for traffic-driving citations but not for training data extraction.
  4. 89.4 percent of AI crawler traffic is training, not search. Only 8 percent is search-related and just 2.2 percent responds to actual user queries. The asymmetry is why publishers block training bots aggressively while leaving search bots alone: training crawls don't return human visitors.
  5. ByteDance Bytespider is blocked at 22 percent, higher than ClaudeBot. The above-average block rate reflects security and geopolitical concerns. Bytespider is a non-AI-specific crawler (predates the AI boom) but the AI training implications drive its higher-than-expected block rate among top US and EU sites.
  6. Paid alternatives (Cloudflare pay-per-crawl, Tollbit) are growing. Approximately 5 percent of top sites now use a paid-crawl model in 2026, up from approximately 1 percent in 2024. The model lets publishers monetise crawls instead of just blocking; expect adoption to expand through 2026-2027 as the per-crawl pricing model matures.

What This Means for AI Visibility

For brand-visibility programs, robots.txt configuration matters because brands blocking AI bots exclude themselves from those vendors' future training data. The brand-visibility cost of blocking GPTBot is delayed (current ChatGPT visibility may be unaffected; visibility in GPT-6 or GPT-7 will be) but real. Brands should review their robots.txt against their AI-visibility strategy explicitly; the default-allow position keeps options open while the default-block position locks in exclusion.

Methodology

Statistics aggregated May 15, 2026 from TechnologyChecker.io robots.txt analysis across Cloudflare's network, ALM Corp's 66-billion-bot-request OpenAI search crawler analysis, Paul Calvano's August 2025 AI bots and robots.txt research, and Search Engine Journal coverage of Anthropic's ClaudeBot DISALLOW growth. Refreshed monthly.

How Presenc AI Helps

Presenc AI tracks brand-mention rates inside major AI platforms and correlates them against publisher robots.txt posture. For brands evaluating whether to block specific AI crawlers, our instrumentation surfaces the downstream brand-visibility impact so the decision can be made with evidence rather than principle alone.

Frequently Asked Questions

Approximately 25 percent of the top 1000 websites block GPTBot, up from 5 percent in early 2023. Across the broader Cloudflare network the figure is closer to 49 percent. The growth plateaued in 2024-2026; the publishers who were going to block GPTBot have blocked; the remaining ~75 percent of top sites have made deliberate allow decisions.
GPTBot in absolute count, but ClaudeBot is the fastest-growing block target and overtook CCBot in April 2026 to become the second-most-blocked AI user-agent. The growth reflects publisher frustration with Anthropic's 20,583-to-1 crawl-to-referral ratio, the worst in the industry. Bytespider (ByteDance) is blocked at 22 percent of top sites despite not being an AI-specific crawler.
It depends on your business model. Publishers monetising primarily through ad-supported referral traffic see negative ROI from training-only crawlers and tend to block them. Publishers building brand awareness or product visibility benefit from training-corpus inclusion and tend to allow. The dominant pattern in 2026 is the middle path: block training bots (GPTBot, CCBot, ClaudeBot) while allowing search bots (OAI-SearchBot, PerplexityBot).
Block dedicated training crawlers (GPTBot, CCBot, ClaudeBot, Google-Extended) while allowing search-and-assistant bots (OAI-SearchBot, PerplexityBot). Approximately 30 percent of top 1000 sites use this configuration in 2026. The pattern lets publishers maintain visibility in AI search results while preventing their content from being used as training data without compensation. It is the dominant single configuration today.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.