Every major AI company now deploys web crawlers to index the internet. These bots power the "web browsing" features of ChatGPT, Perplexity, Claude, and Google's AI Overviews, and they respect your robots.txt file the same way Googlebot does.
If you've accidentally blocked them, your content won't appear in AI-generated answers. If you've intentionally blocked them without thinking it through, you may be cutting off a growing source of brand visibility. Let me walk you through who these bots are and how to handle them correctly.
The Major AI Crawlers in 2025
GPTBot (OpenAI)
GPTBot is OpenAI's web crawler, used to train and update GPT-4o and future models, and to power ChatGPT's real-time web browsing. It was disclosed publicly in August 2023.
- User-agent:
GPTBot - IP range: 20.15.0.0/16
- Documentation: openai.com/gptbot
- Purpose: Model training + real-time browsing
ClaudeBot (Anthropic)
Anthropic's crawler powers Claude's web access features and contributes to training datasets.
- User-agent:
ClaudeBot - Documentation: anthropic.com/crawlers
- Purpose: Web browsing + training data
PerplexityBot (Perplexity AI)
Perplexity crawls the web in real-time to answer user queries. Getting cited by Perplexity directly drives referral traffic, which makes it the most immediately valuable AI crawler for most content sites.
- User-agent:
PerplexityBot - Documentation: docs.perplexity.ai/guides/bots
- Purpose: Real-time search indexing
Google-Extended
Google's crawler for AI training data, separate from Googlebot. Blocking it controls whether your content is used in Google's AI systems like Gemini training. This does NOT affect your Google Search rankings.
- User-agent:
Google-Extended - Purpose: AI training data only
CCBot (Common Crawl)
Common Crawl is a nonprofit that crawls the web and releases public datasets used by most AI companies for training. Blocking CCBot affects your content's presence in many AI training sets, including ones you might actually want to be in.
- User-agent:
CCBot - Purpose: Open web archive for AI training
Other Notable AI Crawlers
- Amazonbot — Amazon's crawler (used by Alexa/AI features)
- Diffbot — Powers knowledge graphs used by AI applications
- YouBot — You.com AI search engine
- Omgili / Webz.io — Data providers for AI systems
Should You Allow or Block AI Crawlers?
When to Allow All AI Crawlers (Recommended for Most Sites)
If your goal is brand visibility and being cited in AI answers, allow all major AI crawlers. This matters most for:
- Businesses that want brand mentions in ChatGPT/Perplexity answers
- Content publishers seeking referral traffic from AI search
- Experts building authority in their field
- E-commerce sites wanting products mentioned in AI shopping recommendations
When Blocking Makes Sense
There are real reasons to restrict certain crawlers. This isn't about being anti-AI — it's about protecting your business:
- Paywalled content — Block training crawlers (GPTBot, Google-Extended) but allow real-time search bots (PerplexityBot)
- Sensitive business data — Internal documentation, pricing strategies you don't want scraped
- Legal constraints — Some industries have specific data handling requirements
- Server load — Aggressive crawlers can genuinely impact performance on smaller hosts
How to Configure robots.txt for AI Crawlers
Allow All AI Crawlers (Most Sites)
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: CCBot
Allow: /
User-agent: Google-Extended
Allow: /
You don't technically need "Allow: /" since that's the default. But listing it explicitly tells AI companies you want them there.
Block AI Training Crawlers Only (Allow Search Bots)
# Block AI training data crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Allow real-time AI search crawlers
User-agent: PerplexityBot
Allow: /
Block All AI Crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Selective Path Blocking
Block AI crawlers from specific sections only:
User-agent: GPTBot
Disallow: /members/
Disallow: /premium-content/
Allow: /blog/
Allow: /
The Wildcard Trap
I've seen this hurt sites more than any other robots.txt mistake. A wildcard disallow accidentally blocks every bot on the web, including AI crawlers:
# DANGEROUS — blocks ALL bots including AI crawlers
User-agent: *
Disallow: /
If you have this in your robots.txt, you're blocking everything. Use specific user-agents instead, or apply wildcard rules only to specific paths:
User-agent: *
Disallow: /wp-admin/
Disallow: /private/
Allow: /
Checking If AI Crawlers Can Access Your Site
Use our free AI Crawlability Checker to see which AI bots are blocked on your site. You can also run your URL through our GEO Readiness Score, which checks AI crawler access as one of 12 signals.
Robots.txt vs. Meta Robots Tags
robots.txt controls crawl access. Meta robots tags (<meta name="robots" content="noindex">) control indexing. For AI systems, robots.txt is the primary control mechanism since most AI crawlers respect it. Adding noindex for pages you don't want cited is also worth doing.
One thing worth remembering: allowing AI crawlers doesn't guarantee citations. You still need strong content, structured data, and the other GEO signals. But blocking them guarantees you won't be cited. Start by making sure the door is open.