How to Write robots.txt: SEO Guide

robots.txt is one of the oldest files on the web and still one of the most misunderstood. Get it wrong and you can accidentally block Googlebot from your entire site. I've seen this happen to a six-figure-traffic site after a developer added one careless line during a staging-to-production migration.

This guide covers everything: syntax, directives, common patterns, and the mistakes that cost rankings.

What Is robots.txt?

robots.txt is a plain text file at your domain root (https://yourdomain.com/robots.txt) that tells web crawlers which pages they can and cannot access. It's part of the Robots Exclusion Protocol, first proposed in 1994.

Key things to understand from the start:

robots.txt controls crawl access, not indexing. A page can be indexed even if it's blocked in robots.txt if Google has other signals about it.
Well-behaved bots respect it. Malicious bots ignore it entirely.
Blocking Googlebot from a page doesn't remove it from the index — use noindex meta tags for that.

robots.txt Syntax

The file is made up of groups called "records." Each record applies to one or more user agents:

User-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html
Crawl-delay: 2

Sitemap: https://yourdomain.com/sitemap.xml

User-agent

Specifies which crawler the rules apply to. * is a wildcard that applies to all bots:

User-agent: *
Disallow: /wp-admin/

You can have multiple user-agent rules in one file. More specific rules override the wildcard for that bot. Common crawlers you might target by name:

Googlebot — Google's main crawler
Googlebot-Image — Google Image Search
Bingbot — Microsoft Bing
GPTBot — OpenAI / ChatGPT
ClaudeBot — Anthropic / Claude
PerplexityBot — Perplexity AI

Disallow

Blocks the user-agent from crawling the specified path and everything below it:

User-agent: *
Disallow: /admin/          # Blocks /admin/ and all pages under it
Disallow: /private.html    # Blocks a specific page
Disallow: /               # Blocks your ENTIRE site (dangerous)

An empty Disallow means nothing is blocked:

User-agent: *
Disallow:    # Allow everything (same as no robots.txt)

Allow

Overrides a Disallow for a specific path. Used when you want to block a directory but allow certain pages within it:

User-agent: Googlebot
Disallow: /members/
Allow: /members/free-preview/

Specificity wins. The most specific matching rule applies. So Allow: /members/free-preview/ beats Disallow: /members/ for pages under that path.

Crawl-delay

Tells crawlers to wait a set number of seconds between requests, reducing server load. Google doesn't support Crawl-delay in robots.txt — configure it in Google Search Console instead. Bingbot respects it.

User-agent: Bingbot
Crawl-delay: 3

Sitemap Declaration

robots.txt is a convenient place to declare your XML sitemap location, making it easier for any crawler to find it:

Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-images.xml

This goes at the end of the file and isn't part of a user-agent record. You can list multiple sitemaps.

A Practical robots.txt Example

# Allow all bots, block common non-public areas
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /api/
Allow: /wp-admin/admin-ajax.php

# Block AI training crawlers (optional, see GEO guide)
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Declare sitemaps
Sitemap: https://yourdomain.com/sitemap.xml

What Should You Actually Block?

Block paths that waste crawl budget or expose things you don't want indexed:

Admin areas (/wp-admin/, /admin/)
Duplicate content (/tag/, /?s= search result pages)
Cart, checkout, and account pages on e-commerce sites
Internal search results (/search?q=)
Print versions of pages (/print/)
API endpoints

Don't block CSS, JavaScript, or fonts. Google needs to render your pages to understand them properly. Blocking these files is a common legacy mistake left over from the Panda era that can actually hurt rankings.

Common robots.txt Mistakes

The Full-Site Block

The most catastrophic error. This one line blocks every crawler from every page:

User-agent: *
Disallow: /

I've seen this appear in production after someone copied a staging robots.txt. Check your live robots.txt right now if you haven't recently.

Blocking CSS and JavaScript

# Don't do this
Disallow: /wp-content/
Disallow: /*.js
Disallow: /*.css

Google renders your pages before indexing them. If it can't load your CSS and JS, it sees a broken page.

Wildcard Pattern Confusion

Google supports * (wildcard) and $ (end of URL) in paths:

Disallow: /*.pdf$     # Blocks all URLs ending in .pdf
Disallow: /search?*   # Blocks all search query URLs

Some bots don't support wildcards the same way. Test with Google's robots.txt tester in Search Console.

Thinking robots.txt Prevents Indexing

It doesn't. If Google can't crawl a page but knows it exists from a link, it may still index the URL (showing "URL is known: page not indexed" in Search Console). For pages you never want indexed, use <meta name="robots" content="noindex"> on the page itself.

Testing Your robots.txt

Google Search Console has a built-in robots.txt tester that shows how Googlebot parses each rule
Visit https://yourdomain.com/robots.txt directly to confirm it's live and formatted correctly
Use our Robots.txt Generator to build a clean file with the right syntax

Check your robots.txt every time you do a major site migration, platform change, or when traffic unexpectedly drops. It's a five-second check that can diagnose catastrophic problems.