Research

AI Crawler Cheat Sheet 2026

Complete reference of all AI crawlers in 2026: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and more. User-agent strings, robots.txt directives, and default behaviors.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: March 2026

AI Crawler Cheat Sheet 2026: The Complete Reference

AI companies use web crawlers to gather training data and power real-time retrieval for their AI platforms. As a website owner or GEO practitioner, you need to know exactly which crawlers exist, who operates them, what they are used for, and how to control access via robots.txt. This cheat sheet is the most comprehensive, up-to-date reference available — covering every known AI crawler as of March 2026 with user-agent strings, purpose, and recommended directives. Bookmark this page; it is updated monthly as new crawlers emerge.

Key Data

  • 16 known AI crawlers actively indexing the web as of March 2026.
  • 3 crawlers per AI platform on average — one for training, one for real-time retrieval, one for search indexing.
  • 68% of the top 1,000 websites now have at least one AI-specific robots.txt directive.
  • Only 23% of websites distinguish between training crawlers and retrieval crawlers in their robots.txt configuration.
  • Recommended approach: Block training crawlers, allow retrieval crawlers — this protects content while preserving AI visibility.

Complete AI Crawler Reference Table

The table below documents every known AI crawler with verified user-agent strings and their purpose. Use this data to configure your robots.txt precisely.

Crawler NameOwnerPurposeUser-Agent StringRobots.txt DirectiveDefault Behavior
GPTBotOpenAITraining data collection for GPT modelsGPTBotUser-agent: GPTBotCrawls unless blocked. Respects robots.txt and noindex. Does NOT crawl paywalled content.
OAI-SearchBotOpenAIReal-time web retrieval for ChatGPT SearchOAI-SearchBotUser-agent: OAI-SearchBotCrawls for live search results. Respects robots.txt. Blocking this prevents your content from appearing in ChatGPT's browsing-mode responses.
ChatGPT-UserOpenAIUser-initiated browsing in ChatGPTChatGPT-UserUser-agent: ChatGPT-UserTriggered when a ChatGPT user explicitly requests to browse the web. Respects robots.txt. Blocking removes your content from user-triggered web lookups.
ClaudeBotAnthropicTraining data collection for Claude modelsClaudeBotUser-agent: ClaudeBotCrawls for training data. Respects robots.txt. Rate-limited and well-behaved. Blocking prevents future Claude models from learning from your content.
ClaudeWebAnthropicReal-time web retrieval for Claude search featuresClaudeWebUser-agent: ClaudeWebRetrieves content for Claude's web-connected features. Blocking prevents your content from being cited in Claude's search-augmented responses.
Google-ExtendedGoogleTraining data for Gemini models (separate from Googlebot)Google-ExtendedUser-agent: Google-ExtendedSpecifically for Gemini training. Blocking this does NOT affect Google Search indexing (that uses Googlebot). Allows selective control over AI training vs. search visibility.
GooglebotGoogleSearch indexing and AI Overviews retrievalGooglebotUser-agent: GooglebotThe standard Google search crawler. Also provides content for AI Overviews. Blocking Googlebot removes you from both Google Search AND AI Overviews — NOT recommended.
PerplexityBotPerplexity AIReal-time web retrieval for Perplexity answersPerplexityBotUser-agent: PerplexityBotCrawls to retrieve content for Perplexity's cited answers. Blocking removes your content from Perplexity results and citations — a significant traffic loss for cited sources.
CCBotCommon CrawlOpen web archive used by many AI companies for trainingCCBotUser-agent: CCBotOpen-source web crawler. Data is publicly available and used by multiple AI companies for training. Blocking CCBot reduces your content's presence in open training datasets.
BytespiderByteDanceTraining data collection for TikTok AI and DoubaoBytespiderUser-agent: BytespiderAggressive crawling patterns reported. Used for ByteDance's AI products including Doubao (China) and TikTok AI features. Many publishers block this crawler due to high crawl volume.
AmazonbotAmazonTraining data and product information for Alexa AI and Amazon QAmazonbotUser-agent: AmazonbotUsed for Amazon's AI assistant services. Respects robots.txt. Relevant for brands selling on Amazon or competing in voice search via Alexa.
FacebookBotMetaContent retrieval for Meta AI featuresFacebookExternalHitUser-agent: FacebookExternalHitPreviously only for link previews, now also feeds Meta AI. Blocking affects both social sharing previews and Meta AI content access.
AppleBotAppleSearch indexing for Siri and Apple IntelligenceApplebotUser-agent: ApplebotPowers Siri search results and Apple Intelligence features. Respects robots.txt. Growing importance as Apple Intelligence integrates more deeply with ChatGPT and native AI features.
cohere-aiCohereTraining data for Cohere enterprise AI modelscohere-aiUser-agent: cohere-aiUsed by Cohere for enterprise-focused AI models. Less visible to consumers but relevant for B2B brands whose content may appear in enterprise AI deployments.
YouBotYou.comReal-time retrieval for You.com AI searchYouBotUser-agent: YouBotPowers You.com's AI search engine. Smaller platform but growing in the developer/technical community. Respects robots.txt.
DiffBotDiffbotStructured data extraction for knowledge graphs used by AI platformsDiffbotUser-agent: DiffbotExtracts structured data to build knowledge graphs consumed by multiple AI platforms. Blocking may affect how AI systems understand your brand's entity data.

Recommended Robots.txt Configuration

The optimal robots.txt strategy for most brands balances content protection (blocking training crawlers) with AI visibility (allowing retrieval crawlers). Here is the recommended configuration:

StrategyBlock (Training)Allow (Retrieval/Search)Best For
Maximum AI VisibilityNoneAll crawlersBrands prioritizing AI visibility above all else. Content appears in training data and real-time retrieval.
Balanced (Recommended)GPTBot, Google-Extended, CCBot, BytespiderOAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeWeb, GooglebotMost brands. Protects content from training while preserving real-time AI visibility and citations.
RestrictiveAll AI crawlers except GooglebotGooglebot onlyPublishers concerned about content licensing. Maintains Google Search and AI Overviews visibility only.
Maximum ProtectionAll AI crawlers including Google-ExtendedGooglebot (search only via nosnippet for AI Overviews)Premium content publishers. Blocks all AI training and most retrieval while preserving basic search indexing.

Crawler Identification Tips

  • Verify IP ranges: Legitimate AI crawlers publish their IP ranges. GPTBot uses documented OpenAI IPs. Always verify crawler identity via reverse DNS and IP range checks before trusting user-agent strings.
  • Monitor crawl logs: Review your server access logs monthly to identify new AI crawlers. The landscape evolves quickly — new crawlers appear several times per year.
  • Separate training from retrieval: OpenAI uses GPTBot for training and OAI-SearchBot for retrieval. Google uses Google-Extended for training and Googlebot for search. Blocking strategically requires understanding this distinction.
  • Test your configuration: After updating robots.txt, use Google's robots.txt tester and submit test fetches from AI platforms to verify your directives work as intended.
  • Update quarterly: AI companies regularly launch new crawlers and update existing ones. Review and update your robots.txt AI directives at least every quarter.

Crawl Volume Benchmarks

Understanding typical crawl volumes helps identify abnormal behavior and plan server capacity.

CrawlerTypical Daily Requests (medium site, ~10K pages)Crawl Behavior
GPTBot500 - 2,000Steady, well-throttled. Respects crawl-delay.
ClaudeBot300 - 1,500Conservative. Lowest crawl volume among major AI crawlers.
PerplexityBot1,000 - 5,000Higher volume due to real-time retrieval. Spikes around trending topics.
Google-Extended800 - 3,000Follows Googlebot patterns. Throttled by Google's standard crawl budget logic.
Bytespider5,000 - 50,000Aggressive. Frequently reported for excessive crawling. Recommend blocking if server resources are constrained.
CCBot2,000 - 10,000Periodic large crawls rather than continuous. May spike during scheduled crawl cycles.

Methodology

Crawler data in this cheat sheet was compiled from official documentation published by AI platform operators, verified user-agent strings from server log analysis across Presenc AI's publisher network, and community-contributed data from webmaster forums and industry publications. Crawl volume benchmarks represent median observed values across a sample of 500 websites with 5,000-50,000 indexed pages, monitored between January and March 2026. Default behaviors were verified through controlled testing. This cheat sheet is reviewed and updated monthly. Last update: March 2026.

How Presenc AI Helps

Presenc AI monitors your brand's visibility across all AI platforms powered by these crawlers. Understanding which crawlers to allow and which to block is step one — step two is monitoring how your content actually appears in AI responses after those crawlers have processed it. Presenc AI closes the loop: configure your crawler access, then track the impact on your AI visibility scores, citation rates, and recommendation frequency. Start a free audit to see how your robots.txt configuration is affecting your AI visibility.

Frequently Asked Questions

For most brands, we recommend the "balanced" approach: block training-focused crawlers (GPTBot, Google-Extended, CCBot, Bytespider) while allowing retrieval crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeWeb, Googlebot). This protects your content from being used in AI training while preserving your visibility in real-time AI search results and citations.
GPTBot is OpenAI's web crawler used to collect training data for GPT models. It is separate from OAI-SearchBot (which powers real-time ChatGPT search) and ChatGPT-User (which handles user-initiated browsing). Blocking GPTBot prevents your content from being used in future model training, but does NOT affect your visibility in ChatGPT's real-time search results — for that, you would need to block OAI-SearchBot.
Add directives to your robots.txt file using the format: "User-agent: [crawler-name]" followed by "Disallow: /" to block the entire site, or "Disallow: /private-directory/" to block specific paths. For example, to block GPTBot: "User-agent: GPTBot" on one line, "Disallow: /" on the next. Each AI crawler has a specific user-agent string listed in this cheat sheet.
GPTBot collects data for training OpenAI's AI models (future knowledge). OAI-SearchBot retrieves web content in real time to power ChatGPT's search feature (current visibility). Blocking GPTBot stops your content from entering future training data. Blocking OAI-SearchBot removes your content from ChatGPT's live search results. Most brands should block GPTBot but allow OAI-SearchBot.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.