Robots.txt Configuration Guide: Directives, Rules, and Common Mistakes

The robots.txt file is one of the oldest and most fundamental tools in a webmaster's arsenal. Located at the root of your domain (e.g., https://example.com/robots.txt), this plain-text file communicates crawling instructions to search engine bots and other automated agents. Despite its simplicity, misconfigured robots.txt files are one of the most common causes of indexing problems, with studies showing that approximately 25% of websites have at least one robots.txt issue that negatively impacts their search visibility.

Proper robots.txt configuration is a foundational element of technical SEO, giving you direct control over how search engine crawlers interact with your site's URL space.

How Robots.txt Works

When a crawler like Googlebot arrives at your domain, the first thing it does is request /robots.txt. The file contains rules organized by user-agent (the name of the crawler) that specify which URL paths the crawler is allowed or disallowed from accessing. Crawlers that follow the Robots Exclusion Protocol (REP) will respect these directives, though compliance is voluntary, not enforced.

It is critical to understand what robots.txt does and does not do. It controls crawling, not indexing. Blocking a URL in robots.txt prevents the crawler from fetching the page's content, but it does not prevent the URL from appearing in search results. If other pages link to a disallowed URL, Google may still index it with a "URL is blocked by robots.txt" note and display limited information in search results. To prevent indexing, you need a noindex meta tag or X-Robots-Tag HTTP header on the page itself, which requires the page to be crawlable.

Robots.txt Syntax and Directives

User-agent

The User-agent: directive specifies which crawler the following rules apply to. Use User-agent: * for rules that apply to all crawlers. For crawler-specific rules, use the exact bot name, such as User-agent: Googlebot or User-agent: Bingbot. In 2026, you may also want specific rules for AI training crawlers like GPTBot, Google-Extended, ClaudeBot, and CCBot.

Disallow

The Disallow: directive blocks crawling of the specified URL path and anything below it. Examples:

Disallow: /admin/ — Blocks all URLs starting with /admin/
Disallow: /search — Blocks /search and any URL starting with /search (including /search-results, /search?q=test)
Disallow: / — Blocks the entire site
Disallow: (empty value) — Allows everything (effectively no restriction)

Allow

The Allow: directive permits crawling of a specific path that would otherwise be blocked by a broader Disallow rule. This is useful for creating exceptions. For example, if you disallow /private/ but want one subdirectory to be crawlable, you would use Allow: /private/public-reports/ before the Disallow: /private/ rule.

Sitemap

The Sitemap: directive tells crawlers where to find your XML sitemap. Place it at the bottom of your robots.txt file. You can list multiple sitemaps. This directive is not user-agent specific and applies globally.

Pattern Matching with Wildcards

Google and Bing support two wildcard characters in robots.txt rules:

* (asterisk) — Matches any sequence of characters. Example: Disallow: /*.pdf$ blocks all URLs ending in .pdf.
$ (dollar sign) — Matches the end of the URL. Without $, Disallow: /file would block /file, /file.html, and /files/. Adding $ makes the match exact.

These wildcards enable precise control. For instance, Disallow: /*?*sort= blocks any URL containing a "sort" query parameter, which is useful for preventing crawling of faceted navigation URLs that create duplicate content.

Common Robots.txt Mistakes

Over years of site auditing, certain robots.txt mistakes appear repeatedly. Avoiding these saves significant debugging time.

Blocking CSS, JavaScript, or image files. Googlebot needs to access these resources to render your pages properly. Blocking them prevents Google from understanding your page layout and content, which can severely hurt rankings. Always allow access to /wp-content/, /static/, /assets/, and similar resource directories.
Using robots.txt to hide pages from search results. As explained above, Disallow prevents crawling but not indexing. Pages blocked by robots.txt can still appear in search results with minimal information. Use noindex meta tags instead.
Blocking entire subdirectories accidentally. A trailing slash matters: Disallow: /blog blocks /blog, /blog/, and /blog-post-title. If you only meant to block the /blog/ directory, use Disallow: /blog/ specifically.
Forgetting to update after site migrations. When you redesign or restructure your site, the robots.txt from the old site may block important paths on the new site. Always review robots.txt as part of any migration checklist.
Missing or inaccessible robots.txt. If /robots.txt returns a 5xx error, Google treats it as a temporary restriction and may delay crawling. A 404 is fine (it means no restrictions), but a server error is problematic.
Conflicting rules without priority understanding. When multiple rules match a URL, Google uses the most specific matching rule, not the first one listed. Understand this precedence to avoid unintended behavior.

Robots.txt for AI Crawlers in 2026

The rise of AI-powered search and large language models has introduced a new category of web crawlers. Many website operators now want to control whether their content is used for AI training while still allowing traditional search indexing. Major AI companies have introduced dedicated user agents:

GPTBot — OpenAI's crawler for training data
Google-Extended — Google's crawler for Gemini AI training (separate from Googlebot)
ClaudeBot — Anthropic's web crawler
CCBot — Common Crawl's bot, used by many AI companies

You can block these bots individually while keeping your site fully accessible to Googlebot and Bingbot. Be aware that blocking these crawlers does not retroactively remove your content from existing AI training datasets. It only prevents future crawling.

Testing and Validating Your Robots.txt

Always test your robots.txt before deploying changes to production. Google Search Console provides a robots.txt tester that lets you enter specific URLs and see whether they would be blocked or allowed by your rules. For programmatic testing, Google's open-source robots.txt parser library (available on GitHub) can validate rules in your CI/CD pipeline.

A well-configured robots.txt file is a powerful gatekeeper. It directs crawlers toward your valuable content and away from technical endpoints, duplicate pages, and private areas. But its power also means that a single misplaced rule can hide your entire site from search engines.

Review your robots.txt at least quarterly, after every site migration, and whenever you add new sections or functionality to your site. Keep the file as simple as possible, document your rules with comments (lines starting with #), and always validate changes before deploying them live.

← Back to Technical SEO