Research

AI Crawler Behavior on the Top 1,000 Sites 2026

How GPTBot, ClaudeBot, PerplexityBot, and CCBot crawl the top 1,000 websites in 2026. 25% block GPTBot, ClaudeBot crawls 20,583 pages per referral. Snapshot for 2026-05-15.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

What this is

The top 1,000 sites by traffic are where AI assistants do most of their reading. Their robots.txt files and server logs are therefore the best leading indicator of how the open web is rebalancing around AI crawlers. This page summarises a 2026-05-15 snapshot of blocking rates, crawl frequency, and crawl-to-refer ratio for the four AI bots that matter most.

Blocking Rate Across Top 1,000 Sites

BotBlocked by2024 baselineQoQ direction
GPTBot25%~5% (Jan 2023)Flat Q1 2026
ClaudeBot20%~7% (Jan 2024)↑ from 9.6% to 10.1% share of DISALLOW
CCBot18%~12% (Jan 2024)↑ modestly
Google-Extended12%~9%Flat
PerplexityBot9%~3%↑ as paywalled-content disputes grow
Bytespider22%~6%↑ on geopolitical concerns
OAI-SearchBot5%n/a (newer)↑ slowly; treated separately from GPTBot

Crawl Frequency (median hits/day per site that allows the bot)

BotHits per allowed site per dayAggressiveness rank
GPTBot~4,2001 (most aggressive)
ClaudeBot~1,8002
PerplexityBot~9803
CCBot~6404
Google-Extended~3,100*n/a (overlaps Googlebot)

*Google-Extended traffic is not separately rate-shaped on the wire; estimate based on Cloudflare aggregate.

Crawl-to-Refer Ratio (lower = more parasitic)

BotPages crawled per referralSource
ClaudeBot~20,583SEOmator GEO Data Report 2026
GPTBot~1,500SEOmator GEO Data Report 2026
PerplexityBot~210SEOmator GEO Data Report 2026
OAI-SearchBot~85Presenc AI internal
Googlebot (baseline)~6Industry consensus

Six Things the Data Tells You

  1. GPTBot blocking has plateaued at 25%. Three-quarters of the top sites have decided open-web AI training is acceptable; the holdouts are mainly publishers and paywalled platforms.
  2. ClaudeBot crawl-to-refer is the worst-in-class. 20,583 pages crawled per referral makes ClaudeBot the most parasitic bot on the open web, which explains its rising block rate.
  3. PerplexityBot returns 10x more referrals per crawl than ClaudeBot. If you care about visible traffic, PerplexityBot is the most reciprocal allow-list candidate.
  4. Bytespider blocking exceeds GPTBot in some segments on geopolitical grounds (US federal, financial services, defence-adjacent).
  5. OAI-SearchBot is treated separately. 5% block rate vs 25% for GPTBot. Publishers are increasingly differentiating training crawls from answer-engine crawls.
  6. All four major bots respect robots.txt 100% of the time on the top 1,000 sites in our sample, so robots.txt remains a meaningful control surface despite occasional anecdotes to the contrary.

What This Means for AI Visibility

The top-1,000 baseline is the implicit policy other sites benchmark against. If you allow more crawlers than the top-1,000 median, you are net contributing training data; if you allow fewer, you are limiting your own AI citation surface. For most brands the optimum in 2026 is to block training-only bots (GPTBot when set to training, ClaudeBot for non-search) while allowing the answer-engine bots (OAI-SearchBot, PerplexityBot, Google-Extended for AI Overviews).

Methodology

Block rates are scraped from robots.txt on the Tranco top 1,000 sites as of 2026-05-12. Crawl frequency is the median across Cloudflare and Presenc AI Worker logs for sites in the same set. Crawl-to-refer ratio is sourced from the SEOmator GEO Data Report 2026 and the Cloudflare network robots.txt analysis. Crawler aggressiveness drawn from Digital Applied's 30-day site log study.

How Presenc AI Helps

Presenc AI ingests your edge logs and benchmarks your crawler mix against the top-1,000 baseline so you can see, per bot, whether you are over-blocking (missing AI citation surface) or under-blocking (giving away training data with no return). The crawl-to-refer ratio is computed in real-time per bot, so allow/disallow decisions are data-driven rather than vibes-driven.

Frequently Asked Questions

25% as of May 2026, up from 5% in early 2023. Blocking has plateaued for two quarters and is unlikely to climb significantly higher without a triggering legal or commercial event.
ClaudeBot, at roughly 20,583 pages crawled for every referral returned. PerplexityBot is best in class at around 210:1, and OAI-SearchBot beats it at 85:1.
No. The most common 2026 strategy among the top 1,000 is to block training-only bots (GPTBot, ClaudeBot, CCBot) while allowing answer-engine bots (OAI-SearchBot, PerplexityBot, Google-Extended). This preserves citation surface while limiting training-data extraction.
Yes, for the four major bots (GPTBot, ClaudeBot, PerplexityBot, CCBot) compliance is essentially 100% on the top 1,000 sites we monitored. Stray reports of non-compliance usually involve unrelated scrapers using spoofed user agents.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.