robots.txt is one of the oldest files on the web and still one of the most misunderstood. Get it wrong and you can accidentally block Googlebot from your entire site. I've seen this happen to a six-figure-traffic site after a developer added one careless line during a staging-to-production migration.
This guide covers everything: syntax, directives, common patterns, and the mistakes that cost rankings.
What Is robots.txt?
robots.txt is a plain text file at your domain root (https://yourdomain.com/robots.txt) that tells web crawlers which pages they can and cannot access. It's part of the Robots Exclusion Protocol, first proposed in 1994.
Key things to understand from the start:
- robots.txt controls crawl access, not indexing. A page can be indexed even if it's blocked in robots.txt if Google has other signals about it.
- Well-behaved bots respect it. Malicious bots ignore it entirely.
- Blocking Googlebot from a page doesn't remove it from the index — use
noindexmeta tags for that.
robots.txt Syntax
The file is made up of groups called "records." Each record applies to one or more user agents:
User-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html
Crawl-delay: 2
Sitemap: https://yourdomain.com/sitemap.xml
User-agent
Specifies which crawler the rules apply to. * is a wildcard that applies to all bots:
User-agent: *
Disallow: /wp-admin/
You can have multiple user-agent rules in one file. More specific rules override the wildcard for that bot. Common crawlers you might target by name:
Googlebot— Google's main crawlerGooglebot-Image— Google Image SearchBingbot— Microsoft BingGPTBot— OpenAI / ChatGPTClaudeBot— Anthropic / ClaudePerplexityBot— Perplexity AI
Disallow
Blocks the user-agent from crawling the specified path and everything below it:
User-agent: *
Disallow: /admin/ # Blocks /admin/ and all pages under it
Disallow: /private.html # Blocks a specific page
Disallow: / # Blocks your ENTIRE site (dangerous)
An empty Disallow means nothing is blocked:
User-agent: *
Disallow: # Allow everything (same as no robots.txt)
Allow
Overrides a Disallow for a specific path. Used when you want to block a directory but allow certain pages within it:
User-agent: Googlebot
Disallow: /members/
Allow: /members/free-preview/
Specificity wins. The most specific matching rule applies. So Allow: /members/free-preview/ beats Disallow: /members/ for pages under that path.
Crawl-delay
Tells crawlers to wait a set number of seconds between requests, reducing server load. Google doesn't support Crawl-delay in robots.txt — configure it in Google Search Console instead. Bingbot respects it.
User-agent: Bingbot
Crawl-delay: 3
Sitemap Declaration
robots.txt is a convenient place to declare your XML sitemap location, making it easier for any crawler to find it:
Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-images.xml
This goes at the end of the file and isn't part of a user-agent record. You can list multiple sitemaps.
A Practical robots.txt Example
# Allow all bots, block common non-public areas
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /api/
Allow: /wp-admin/admin-ajax.php
# Block AI training crawlers (optional, see GEO guide)
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Declare sitemaps
Sitemap: https://yourdomain.com/sitemap.xml
What Should You Actually Block?
Block paths that waste crawl budget or expose things you don't want indexed:
- Admin areas (
/wp-admin/,/admin/) - Duplicate content (
/tag/,/?s=search result pages) - Cart, checkout, and account pages on e-commerce sites
- Internal search results (
/search?q=) - Print versions of pages (
/print/) - API endpoints
Don't block CSS, JavaScript, or fonts. Google needs to render your pages to understand them properly. Blocking these files is a common legacy mistake left over from the Panda era that can actually hurt rankings.
Common robots.txt Mistakes
The Full-Site Block
The most catastrophic error. This one line blocks every crawler from every page:
User-agent: *
Disallow: /
I've seen this appear in production after someone copied a staging robots.txt. Check your live robots.txt right now if you haven't recently.
Blocking CSS and JavaScript
# Don't do this
Disallow: /wp-content/
Disallow: /*.js
Disallow: /*.css
Google renders your pages before indexing them. If it can't load your CSS and JS, it sees a broken page.
Wildcard Pattern Confusion
Google supports * (wildcard) and $ (end of URL) in paths:
Disallow: /*.pdf$ # Blocks all URLs ending in .pdf
Disallow: /search?* # Blocks all search query URLs
Some bots don't support wildcards the same way. Test with Google's robots.txt tester in Search Console.
Thinking robots.txt Prevents Indexing
It doesn't. If Google can't crawl a page but knows it exists from a link, it may still index the URL (showing "URL is known: page not indexed" in Search Console). For pages you never want indexed, use <meta name="robots" content="noindex"> on the page itself.
Testing Your robots.txt
- Google Search Console has a built-in robots.txt tester that shows how Googlebot parses each rule
- Visit
https://yourdomain.com/robots.txtdirectly to confirm it's live and formatted correctly - Use our Robots.txt Generator to build a clean file with the right syntax
Check your robots.txt every time you do a major site migration, platform change, or when traffic unexpectedly drops. It's a five-second check that can diagnose catastrophic problems.