What AI crawlers should I block?

For most brands, we recommend the "balanced" approach: block training-focused crawlers (GPTBot, Google-Extended, CCBot, Bytespider) while allowing retrieval crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeWeb, Googlebot). This protects your content from being used in AI training while preserving your visibility in real-time AI search results and citations.

What is GPTBot and should I block it?

GPTBot is OpenAI's web crawler used to collect training data for GPT models. It is separate from OAI-SearchBot (which powers real-time ChatGPT search) and ChatGPT-User (which handles user-initiated browsing). Blocking GPTBot prevents your content from being used in future model training, but does NOT affect your visibility in ChatGPT's real-time search results, for that, you would need to block OAI-SearchBot.

How do I block AI crawlers in robots.txt?

Add directives to your robots.txt file using the format: "User-agent: [crawler-name]" followed by "Disallow: /" to block the entire site, or "Disallow: /private-directory/" to block specific paths. For example, to block GPTBot: "User-agent: GPTBot" on one line, "Disallow: /" on the next. Each AI crawler has a specific user-agent string listed in this cheat sheet.

What is the difference between GPTBot and OAI-SearchBot?

GPTBot collects data for training OpenAI's AI models (future knowledge). OAI-SearchBot retrieves web content in real time to power ChatGPT's search feature (current visibility). Blocking GPTBot stops your content from entering future training data. Blocking OAI-SearchBot removes your content from ChatGPT's live search results. Most brands should block GPTBot but allow OAI-SearchBot.

AI Crawler Cheat Sheet 2026

AI Crawler Cheat Sheet 2026: The Complete Reference

AI companies use web crawlers to gather training data and power real-time retrieval for their AI platforms. As a website owner or GEO practitioner, you need to know exactly which crawlers exist, who operates them, what they are used for, and how to control access via robots.txt. This cheat sheet is the most comprehensive, up-to-date reference available, covering every known AI crawler as of March 2026 with user-agent strings, purpose, and recommended directives. Bookmark this page; it is updated monthly as new crawlers emerge.

Key Data

16 known AI crawlers actively indexing the web as of March 2026.
3 crawlers per AI platform on average, one for training, one for real-time retrieval, one for search indexing.
68% of the top 1,000 websites now have at least one AI-specific robots.txt directive.
Only 23% of websites distinguish between training crawlers and retrieval crawlers in their robots.txt configuration.
Recommended approach: Block training crawlers, allow retrieval crawlers, this protects content while preserving AI visibility.

Complete AI Crawler Reference Table

The table below documents every known AI crawler with verified user-agent strings and their purpose. Use this data to configure your robots.txt precisely.

Crawler Name	Owner	Purpose	User-Agent String	Robots.txt Directive	Default Behavior
GPTBot	OpenAI	Training data collection for GPT models	GPTBot	User-agent: GPTBot	Crawls unless blocked. Respects robots.txt and noindex. Does NOT crawl paywalled content.
OAI-SearchBot	OpenAI	Real-time web retrieval for ChatGPT Search	OAI-SearchBot	User-agent: OAI-SearchBot	Crawls for live search results. Respects robots.txt. Blocking this prevents your content from appearing in ChatGPT's browsing-mode responses.
ChatGPT-User	OpenAI	User-initiated browsing in ChatGPT	ChatGPT-User	User-agent: ChatGPT-User	Triggered when a ChatGPT user explicitly requests to browse the web. Respects robots.txt. Blocking removes your content from user-triggered web lookups.
ClaudeBot	Anthropic	Training data collection for Claude models	ClaudeBot	User-agent: ClaudeBot	Crawls for training data. Respects robots.txt. Rate-limited and well-behaved. Blocking prevents future Claude models from learning from your content.
ClaudeWeb	Anthropic	Real-time web retrieval for Claude search features	ClaudeWeb	User-agent: ClaudeWeb	Retrieves content for Claude's web-connected features. Blocking prevents your content from being cited in Claude's search-augmented responses.
Google-Extended	Google	Training data for Gemini models (separate from Googlebot)	Google-Extended	User-agent: Google-Extended	Specifically for Gemini training. Blocking this does NOT affect Google Search indexing (that uses Googlebot). Allows selective control over AI training vs. search visibility.
Googlebot	Google	Search indexing and AI Overviews retrieval	Googlebot	User-agent: Googlebot	The standard Google search crawler. Also provides content for AI Overviews. Blocking Googlebot removes you from both Google Search AND AI Overviews, NOT recommended.
PerplexityBot	Perplexity AI	Real-time web retrieval for Perplexity answers	PerplexityBot	User-agent: PerplexityBot	Crawls to retrieve content for Perplexity's cited answers. Blocking removes your content from Perplexity results and citations, a significant traffic loss for cited sources.
CCBot	Common Crawl	Open web archive used by many AI companies for training	CCBot	User-agent: CCBot	Open-source web crawler. Data is publicly available and used by multiple AI companies for training. Blocking CCBot reduces your content's presence in open training datasets.
Bytespider	ByteDance	Training data collection for TikTok AI and Doubao	Bytespider	User-agent: Bytespider	Aggressive crawling patterns reported. Used for ByteDance's AI products including Doubao (China) and TikTok AI features. Many publishers block this crawler due to high crawl volume.
Amazonbot	Amazon	Training data and product information for Alexa AI and Amazon Q	Amazonbot	User-agent: Amazonbot	Used for Amazon's AI assistant services. Respects robots.txt. Relevant for brands selling on Amazon or competing in voice search via Alexa.
FacebookBot	Meta	Content retrieval for Meta AI features	FacebookExternalHit	User-agent: FacebookExternalHit	Previously only for link previews, now also feeds Meta AI. Blocking affects both social sharing previews and Meta AI content access.
AppleBot	Apple	Search indexing for Siri and Apple Intelligence	Applebot	User-agent: Applebot	Powers Siri search results and Apple Intelligence features. Respects robots.txt. Growing importance as Apple Intelligence integrates more deeply with ChatGPT and native AI features.
cohere-ai	Cohere	Training data for Cohere enterprise AI models	cohere-ai	User-agent: cohere-ai	Used by Cohere for enterprise-focused AI models. Less visible to consumers but relevant for B2B brands whose content may appear in enterprise AI deployments.
YouBot	You.com	Real-time retrieval for You.com AI search	YouBot	User-agent: YouBot	Powers You.com's AI search engine. Smaller platform but growing in the developer/technical community. Respects robots.txt.
DiffBot	Diffbot	Structured data extraction for knowledge graphs used by AI platforms	Diffbot	User-agent: Diffbot	Extracts structured data to build knowledge graphs consumed by multiple AI platforms. Blocking may affect how AI systems understand your brand's entity data.

Recommended Robots.txt Configuration

The optimal robots.txt strategy for most brands balances content protection (blocking training crawlers) with AI visibility (allowing retrieval crawlers). Here is the recommended configuration:

Strategy	Block (Training)	Allow (Retrieval/Search)	Best For
Maximum AI Visibility	None	All crawlers	Brands prioritizing AI visibility above all else. Content appears in training data and real-time retrieval.
Balanced (Recommended)	GPTBot, Google-Extended, CCBot, Bytespider	OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeWeb, Googlebot	Most brands. Protects content from training while preserving real-time AI visibility and citations.
Restrictive	All AI crawlers except Googlebot	Googlebot only	Publishers concerned about content licensing. Maintains Google Search and AI Overviews visibility only.
Maximum Protection	All AI crawlers including Google-Extended	Googlebot (search only via nosnippet for AI Overviews)	Premium content publishers. Blocks all AI training and most retrieval while preserving basic search indexing.

Crawler Identification Tips

Verify IP ranges: Legitimate AI crawlers publish their IP ranges. GPTBot uses documented OpenAI IPs. Always verify crawler identity via reverse DNS and IP range checks before trusting user-agent strings.
Monitor crawl logs: Review your server access logs monthly to identify new AI crawlers. The landscape evolves quickly, new crawlers appear several times per year.
Separate training from retrieval: OpenAI uses GPTBot for training and OAI-SearchBot for retrieval. Google uses Google-Extended for training and Googlebot for search. Blocking strategically requires understanding this distinction.
Test your configuration: After updating robots.txt, use Google's robots.txt tester and submit test fetches from AI platforms to verify your directives work as intended.
Update quarterly: AI companies regularly launch new crawlers and update existing ones. Review and update your robots.txt AI directives at least every quarter.

Crawl Volume Benchmarks

Understanding typical crawl volumes helps identify abnormal behavior and plan server capacity.

Crawler	Typical Daily Requests (medium site, ~10K pages)	Crawl Behavior
GPTBot	500 - 2,000	Steady, well-throttled. Respects crawl-delay.
ClaudeBot	300 - 1,500	Conservative. Lowest crawl volume among major AI crawlers.
PerplexityBot	1,000 - 5,000	Higher volume due to real-time retrieval. Spikes around trending topics.
Google-Extended	800 - 3,000	Follows Googlebot patterns. Throttled by Google's standard crawl budget logic.
Bytespider	5,000 - 50,000	Aggressive. Frequently reported for excessive crawling. Recommend blocking if server resources are constrained.
CCBot	2,000 - 10,000	Periodic large crawls rather than continuous. May spike during scheduled crawl cycles.

Methodology

Crawler data in this cheat sheet was compiled from official documentation published by AI platform operators, verified user-agent strings from server log analysis across Presenc AI's publisher network, and community-contributed data from webmaster forums and industry publications. Crawl volume benchmarks represent median observed values across a sample of 500 websites with 5,000-50,000 indexed pages, monitored between January and March 2026. Default behaviors were verified through controlled testing. This cheat sheet is reviewed and updated monthly. Last update: March 2026.

How Presenc AI Helps

Presenc AI monitors your brand's visibility across all AI platforms powered by these crawlers. Understanding which crawlers to allow and which to block is step one, step two is monitoring how your content actually appears in AI responses after those crawlers have processed it. Presenc AI closes the loop: configure your crawler access, then track the impact on your AI visibility scores, citation rates, and recommendation frequency. Start a free audit to see how your robots.txt configuration is affecting your AI visibility.