What this is
The top 1,000 sites by traffic are where AI assistants do most of their reading. Their robots.txt files and server logs are therefore the best leading indicator of how the open web is rebalancing around AI crawlers. This page summarises a 2026-05-15 snapshot of blocking rates, crawl frequency, and crawl-to-refer ratio for the four AI bots that matter most.
Blocking Rate Across Top 1,000 Sites
| Bot | Blocked by | 2024 baseline | QoQ direction |
|---|---|---|---|
| GPTBot | 25% | ~5% (Jan 2023) | Flat Q1 2026 |
| ClaudeBot | 20% | ~7% (Jan 2024) | ↑ from 9.6% to 10.1% share of DISALLOW |
| CCBot | 18% | ~12% (Jan 2024) | ↑ modestly |
| Google-Extended | 12% | ~9% | Flat |
| PerplexityBot | 9% | ~3% | ↑ as paywalled-content disputes grow |
| Bytespider | 22% | ~6% | ↑ on geopolitical concerns |
| OAI-SearchBot | 5% | n/a (newer) | ↑ slowly; treated separately from GPTBot |
Crawl Frequency (median hits/day per site that allows the bot)
| Bot | Hits per allowed site per day | Aggressiveness rank |
|---|---|---|
| GPTBot | ~4,200 | 1 (most aggressive) |
| ClaudeBot | ~1,800 | 2 |
| PerplexityBot | ~980 | 3 |
| CCBot | ~640 | 4 |
| Google-Extended | ~3,100* | n/a (overlaps Googlebot) |
*Google-Extended traffic is not separately rate-shaped on the wire; estimate based on Cloudflare aggregate.
Crawl-to-Refer Ratio (lower = more parasitic)
| Bot | Pages crawled per referral | Source |
|---|---|---|
| ClaudeBot | ~20,583 | SEOmator GEO Data Report 2026 |
| GPTBot | ~1,500 | SEOmator GEO Data Report 2026 |
| PerplexityBot | ~210 | SEOmator GEO Data Report 2026 |
| OAI-SearchBot | ~85 | Presenc AI internal |
| Googlebot (baseline) | ~6 | Industry consensus |
Six Things the Data Tells You
- GPTBot blocking has plateaued at 25%. Three-quarters of the top sites have decided open-web AI training is acceptable; the holdouts are mainly publishers and paywalled platforms.
- ClaudeBot crawl-to-refer is the worst-in-class. 20,583 pages crawled per referral makes ClaudeBot the most parasitic bot on the open web, which explains its rising block rate.
- PerplexityBot returns 10x more referrals per crawl than ClaudeBot. If you care about visible traffic, PerplexityBot is the most reciprocal allow-list candidate.
- Bytespider blocking exceeds GPTBot in some segments on geopolitical grounds (US federal, financial services, defence-adjacent).
- OAI-SearchBot is treated separately. 5% block rate vs 25% for GPTBot. Publishers are increasingly differentiating training crawls from answer-engine crawls.
- All four major bots respect robots.txt 100% of the time on the top 1,000 sites in our sample, so robots.txt remains a meaningful control surface despite occasional anecdotes to the contrary.
What This Means for AI Visibility
The top-1,000 baseline is the implicit policy other sites benchmark against. If you allow more crawlers than the top-1,000 median, you are net contributing training data; if you allow fewer, you are limiting your own AI citation surface. For most brands the optimum in 2026 is to block training-only bots (GPTBot when set to training, ClaudeBot for non-search) while allowing the answer-engine bots (OAI-SearchBot, PerplexityBot, Google-Extended for AI Overviews).
Methodology
Block rates are scraped from robots.txt on the Tranco top 1,000 sites as of 2026-05-12. Crawl frequency is the median across Cloudflare and Presenc AI Worker logs for sites in the same set. Crawl-to-refer ratio is sourced from the SEOmator GEO Data Report 2026 and the Cloudflare network robots.txt analysis. Crawler aggressiveness drawn from Digital Applied's 30-day site log study.
How Presenc AI Helps
Presenc AI ingests your edge logs and benchmarks your crawler mix against the top-1,000 baseline so you can see, per bot, whether you are over-blocking (missing AI citation surface) or under-blocking (giving away training data with no return). The crawl-to-refer ratio is computed in real-time per bot, so allow/disallow decisions are data-driven rather than vibes-driven.