AI Crawlers and Their Purposes
Perplexity AI Crawlers
Perplexity operates two distinct crawlers with different purposes:
PerplexityBot – This is the automatic web crawler that indexes website content for Perplexity AI’s search results and AI model training. It crawls the web systematically to build the knowledge base for Perplexity’s AI responses.
Perplexity-User – This crawler supports user actions within Perplexity. When users ask questions, it visits specific web pages to help provide accurate answers and includes links to those pages in responses. This is the one activated when a user clicks a link or requests real-time information1.
OpenAI Crawlers
OpenAI operates three distinct crawlers for different purposes:
GPTBot – Used for crawling content that may be used in training OpenAI’s generative AI foundation models. It avoids paywalled content and focuses on publicly available data to improve model accuracy and capabilities3.
ChatGPT-User – This crawler is dispatched when users ask ChatGPT or Custom GPTs to visit a specific web page. It’s not used for automatic crawling or AI training, but rather for direct user requests3.
OAI-SearchBot – Specifically designed for search functionality within ChatGPT’s search features. It links to and surfaces websites in search results but is not used to crawl content for AI model training.
Anthropic Claude Crawlers
Anthropic operates three crawlers with distinct roles:
ClaudeBot – The main crawler that helps enhance the utility and safety of generative AI models by collecting web content for potential training datasets. It’s been noted as particularly aggressive in its crawling behavior.
Claude-User – Supports Claude AI users by accessing websites when individuals ask questions to Claude. It allows real-time content retrieval in response to user queries.
Claude-SearchBot – Navigates the web to improve search result quality for users, analyzing online content to enhance the relevance and accuracy of search responses.
Meta AI Crawlers
Meta External Agent – Meta’s relatively new web crawler launched in mid-2024 to collect data for its AI models. It replaced the previous “Facebook External Hit” crawler and is similar to OpenAI’s GPTBot in purpose1011.
FacebookBot – The traditional Facebook crawler used for generating link previews and other social media functions.
Google AI Crawlers
Google-Extended – A standalone product token that controls whether sites help improve Gemini Apps (formerly Bard) and Vertex AI generative APIs. Importantly, blocking Google-Extended does not impact Google Search rankings13.
GoogleOther – Used for various Google services beyond traditional search.
Apple AI Crawlers
Applebot – Apple’s web crawler that powers search technology integrated into Spotlight, Siri, and Safari. The data may also be used to train Apple foundation models for Apple Intelligence features.
Applebot-Extended – Specifically for training Apple’s generative AI foundation models. Publishers can opt out of this while still allowing regular Applebot crawling.
ByteDance/TikTok Crawler
Bytespider – ByteDance’s extremely aggressive web crawler that has been noted as scraping at rates 25 times faster than OpenAI’s GPTBot and 3,000 times faster than Anthropic’s ClaudeBot. It’s suspected to be gathering data for ByteDance’s potential LLM development.
Other AI Crawlers
Cohere-ai and cohere-training-data-crawler – Operated by Cohere to download training data for their enterprise-focused LLMs.
Amazonbot – Amazon’s crawler for AI search and data collection purposes.
DuckAssistBot – DuckDuckGo’s AI assistant crawler.
PanguBot – Baidu’s AI crawler for the Chinese market.
YouBot – You.com’s AI search crawler.
Traditional Search Engine Crawlers
Major Search Engines
Googlebot – The world’s most active web crawler, indexing content for Google Search. It represents the largest portion of legitimate web crawler traffic.
Bingbot – Microsoft’s search crawler for Bing, the second-largest search engine crawler by volume.
BaiduSpider – China’s dominant search engine crawler, essential for visibility in Chinese markets.
YandexBot – Russia’s primary search engine crawler, using advanced AI algorithms for understanding user intent.
Slurp – Yahoo’s legacy crawler, still active but less common than in previous years.
SEO Tool Crawlers
AhrefsBot – Currently the second most active web crawler globally and the most active in the “Search Engine Optimization” category. It crawls for backlink analysis and SEO insights.
SemrushBot-SWA and SemrushBot-OCOB – Semrush’s crawlers for SEO analysis and competitive intelligence.
MJ12bot – Majestic’s crawler for building their backlink database and domain authority metrics.
MozBot – Moz’s crawler for their SEO tools and domain authority calculations.
Understanding the Difference Between Reference and User Click
Your specific question about PerplexityBot vs Perplexity-User highlights an important distinction:
-
PerplexityBot appears in logs when Perplexity’s automated systems crawl and index your content for general knowledge base building
-
Perplexity-User appears when an actual user of Perplexity asks a question that requires visiting your specific page, or when a user clicks a link provided in a Perplexity response
This same pattern applies to other AI services:
-
ChatGPT-User vs GPTBot
-
Claude-User vs ClaudeBot
-
Perplexity-User vs PerplexityBot
The “-User” variants represent human-initiated actions, while the base bot names represent automated crawling for training or indexing purposes.
This comprehensive list covers the major AI and traditional crawlers active in 2025, each serving specific purposes from AI model training to search indexing and real-time user assistance.