AI Crawler Cheat Sheet 2026: The Complete Reference
AI companies use web crawlers to gather training data and power real-time retrieval for their AI platforms. As a website owner or GEO practitioner, you need to know exactly which crawlers exist, who operates them, what they are used for, and how to control access via robots.txt. This cheat sheet is the most comprehensive, up-to-date reference available — covering every known AI crawler as of March 2026 with user-agent strings, purpose, and recommended directives. Bookmark this page; it is updated monthly as new crawlers emerge.
Key Data
- 16 known AI crawlers actively indexing the web as of March 2026.
- 3 crawlers per AI platform on average — one for training, one for real-time retrieval, one for search indexing.
- 68% of the top 1,000 websites now have at least one AI-specific robots.txt directive.
- Only 23% of websites distinguish between training crawlers and retrieval crawlers in their robots.txt configuration.
- Recommended approach: Block training crawlers, allow retrieval crawlers — this protects content while preserving AI visibility.
Complete AI Crawler Reference Table
The table below documents every known AI crawler with verified user-agent strings and their purpose. Use this data to configure your robots.txt precisely.
| Crawler Name | Owner | Purpose | User-Agent String | Robots.txt Directive | Default Behavior |
|---|---|---|---|---|---|
| GPTBot | OpenAI | Training data collection for GPT models | GPTBot | User-agent: GPTBot | Crawls unless blocked. Respects robots.txt and noindex. Does NOT crawl paywalled content. |
| OAI-SearchBot | OpenAI | Real-time web retrieval for ChatGPT Search | OAI-SearchBot | User-agent: OAI-SearchBot | Crawls for live search results. Respects robots.txt. Blocking this prevents your content from appearing in ChatGPT's browsing-mode responses. |
| ChatGPT-User | OpenAI | User-initiated browsing in ChatGPT | ChatGPT-User | User-agent: ChatGPT-User | Triggered when a ChatGPT user explicitly requests to browse the web. Respects robots.txt. Blocking removes your content from user-triggered web lookups. |
| ClaudeBot | Anthropic | Training data collection for Claude models | ClaudeBot | User-agent: ClaudeBot | Crawls for training data. Respects robots.txt. Rate-limited and well-behaved. Blocking prevents future Claude models from learning from your content. |
| ClaudeWeb | Anthropic | Real-time web retrieval for Claude search features | ClaudeWeb | User-agent: ClaudeWeb | Retrieves content for Claude's web-connected features. Blocking prevents your content from being cited in Claude's search-augmented responses. |
| Google-Extended | Training data for Gemini models (separate from Googlebot) | Google-Extended | User-agent: Google-Extended | Specifically for Gemini training. Blocking this does NOT affect Google Search indexing (that uses Googlebot). Allows selective control over AI training vs. search visibility. | |
| Googlebot | Search indexing and AI Overviews retrieval | Googlebot | User-agent: Googlebot | The standard Google search crawler. Also provides content for AI Overviews. Blocking Googlebot removes you from both Google Search AND AI Overviews — NOT recommended. | |
| PerplexityBot | Perplexity AI | Real-time web retrieval for Perplexity answers | PerplexityBot | User-agent: PerplexityBot | Crawls to retrieve content for Perplexity's cited answers. Blocking removes your content from Perplexity results and citations — a significant traffic loss for cited sources. |
| CCBot | Common Crawl | Open web archive used by many AI companies for training | CCBot | User-agent: CCBot | Open-source web crawler. Data is publicly available and used by multiple AI companies for training. Blocking CCBot reduces your content's presence in open training datasets. |
| Bytespider | ByteDance | Training data collection for TikTok AI and Doubao | Bytespider | User-agent: Bytespider | Aggressive crawling patterns reported. Used for ByteDance's AI products including Doubao (China) and TikTok AI features. Many publishers block this crawler due to high crawl volume. |
| Amazonbot | Amazon | Training data and product information for Alexa AI and Amazon Q | Amazonbot | User-agent: Amazonbot | Used for Amazon's AI assistant services. Respects robots.txt. Relevant for brands selling on Amazon or competing in voice search via Alexa. |
| FacebookBot | Meta | Content retrieval for Meta AI features | FacebookExternalHit | User-agent: FacebookExternalHit | Previously only for link previews, now also feeds Meta AI. Blocking affects both social sharing previews and Meta AI content access. |
| AppleBot | Apple | Search indexing for Siri and Apple Intelligence | Applebot | User-agent: Applebot | Powers Siri search results and Apple Intelligence features. Respects robots.txt. Growing importance as Apple Intelligence integrates more deeply with ChatGPT and native AI features. |
| cohere-ai | Cohere | Training data for Cohere enterprise AI models | cohere-ai | User-agent: cohere-ai | Used by Cohere for enterprise-focused AI models. Less visible to consumers but relevant for B2B brands whose content may appear in enterprise AI deployments. |
| YouBot | You.com | Real-time retrieval for You.com AI search | YouBot | User-agent: YouBot | Powers You.com's AI search engine. Smaller platform but growing in the developer/technical community. Respects robots.txt. |
| DiffBot | Diffbot | Structured data extraction for knowledge graphs used by AI platforms | Diffbot | User-agent: Diffbot | Extracts structured data to build knowledge graphs consumed by multiple AI platforms. Blocking may affect how AI systems understand your brand's entity data. |
Recommended Robots.txt Configuration
The optimal robots.txt strategy for most brands balances content protection (blocking training crawlers) with AI visibility (allowing retrieval crawlers). Here is the recommended configuration:
| Strategy | Block (Training) | Allow (Retrieval/Search) | Best For |
|---|---|---|---|
| Maximum AI Visibility | None | All crawlers | Brands prioritizing AI visibility above all else. Content appears in training data and real-time retrieval. |
| Balanced (Recommended) | GPTBot, Google-Extended, CCBot, Bytespider | OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeWeb, Googlebot | Most brands. Protects content from training while preserving real-time AI visibility and citations. |
| Restrictive | All AI crawlers except Googlebot | Googlebot only | Publishers concerned about content licensing. Maintains Google Search and AI Overviews visibility only. |
| Maximum Protection | All AI crawlers including Google-Extended | Googlebot (search only via nosnippet for AI Overviews) | Premium content publishers. Blocks all AI training and most retrieval while preserving basic search indexing. |
Crawler Identification Tips
- Verify IP ranges: Legitimate AI crawlers publish their IP ranges. GPTBot uses documented OpenAI IPs. Always verify crawler identity via reverse DNS and IP range checks before trusting user-agent strings.
- Monitor crawl logs: Review your server access logs monthly to identify new AI crawlers. The landscape evolves quickly — new crawlers appear several times per year.
- Separate training from retrieval: OpenAI uses GPTBot for training and OAI-SearchBot for retrieval. Google uses Google-Extended for training and Googlebot for search. Blocking strategically requires understanding this distinction.
- Test your configuration: After updating robots.txt, use Google's robots.txt tester and submit test fetches from AI platforms to verify your directives work as intended.
- Update quarterly: AI companies regularly launch new crawlers and update existing ones. Review and update your robots.txt AI directives at least every quarter.
Crawl Volume Benchmarks
Understanding typical crawl volumes helps identify abnormal behavior and plan server capacity.
| Crawler | Typical Daily Requests (medium site, ~10K pages) | Crawl Behavior |
|---|---|---|
| GPTBot | 500 - 2,000 | Steady, well-throttled. Respects crawl-delay. |
| ClaudeBot | 300 - 1,500 | Conservative. Lowest crawl volume among major AI crawlers. |
| PerplexityBot | 1,000 - 5,000 | Higher volume due to real-time retrieval. Spikes around trending topics. |
| Google-Extended | 800 - 3,000 | Follows Googlebot patterns. Throttled by Google's standard crawl budget logic. |
| Bytespider | 5,000 - 50,000 | Aggressive. Frequently reported for excessive crawling. Recommend blocking if server resources are constrained. |
| CCBot | 2,000 - 10,000 | Periodic large crawls rather than continuous. May spike during scheduled crawl cycles. |
Methodology
Crawler data in this cheat sheet was compiled from official documentation published by AI platform operators, verified user-agent strings from server log analysis across Presenc AI's publisher network, and community-contributed data from webmaster forums and industry publications. Crawl volume benchmarks represent median observed values across a sample of 500 websites with 5,000-50,000 indexed pages, monitored between January and March 2026. Default behaviors were verified through controlled testing. This cheat sheet is reviewed and updated monthly. Last update: March 2026.
How Presenc AI Helps
Presenc AI monitors your brand's visibility across all AI platforms powered by these crawlers. Understanding which crawlers to allow and which to block is step one — step two is monitoring how your content actually appears in AI responses after those crawlers have processed it. Presenc AI closes the loop: configure your crawler access, then track the impact on your AI visibility scores, citation rates, and recommendation frequency. Start a free audit to see how your robots.txt configuration is affecting your AI visibility.