robots.txt and llms.txt: Controlling Bot Access

The robots.txt file is a plain text file at your domain root that tells search engine crawlers and other automated bots which parts of your site they may and may not access. It is the oldest and most widely respected crawl-control mechanism on the web, governed by the Robots Exclusion Protocol, which Google formalized as an internet standard (RFC 9309) in 2022. Despite its simplicity, robots.txt is one of the most commonly misconfigured files in SEO, and a single stray line can deindex an entire website.

This guide covers everything you need to use robots.txt confidently in 2026: the exact syntax of every directive, how wildcards and pattern matching work, how to control AI crawlers like GPTBot and ClaudeBot, when to use robots.txt versus a noindex tag versus a canonical, and the mistakes that quietly cost sites their traffic. You will find multiple copy-paste examples and a full directive reference table you can keep on hand.

What robots.txt Is (and Is Not)

robots.txt is a set of crawling instructions, not indexing instructions, and it is not a security control. Understanding this distinction prevents the most damaging mistakes.

It controls crawling. A Disallow rule asks well-behaved bots not to request the listed URLs. It does not remove pages already in the index.
It is advisory, not enforced. Reputable crawlers (Googlebot, Bingbot, and most major AI bots) obey it. Malicious scrapers ignore it entirely.
It is public. Anyone can read https://yoursite.com/robots.txt. Never list secret or sensitive paths there, because doing so advertises exactly where they live.
It is per-origin. A robots.txt file applies only to its exact protocol, host, and port. https://example.com and https://shop.example.com each need their own file.

Where the File Lives

The file must be named exactly robots.txt (lowercase) and placed at the root of the host, e.g. https://example.com/robots.txt. A file at https://example.com/folder/robots.txt is ignored. It must be served as UTF-8 plain text with a 200 status code. If the file returns a 404, crawlers assume there are no restrictions and crawl everything. If it returns a 5xx error for an extended period, Google may treat the whole site as disallowed, so a broken robots.txt can be worse than no file at all.

robots.txt Syntax Basics

A robots.txt file is made of one or more groups. Each group starts with one or more User-agent lines naming the bots it targets, followed by the rules that apply to them. Rules are read top to bottom; comments start with #.

# Allow all crawlers access to everything
User-agent: *
Allow: /

# Block all crawlers from admin and private folders
User-agent: *
Disallow: /admin/
Disallow: /private/

# Point crawlers to your sitemap
Sitemap: https://example.com/sitemap.xml

A few rules govern how groups are matched. A crawler obeys the single most specific group that names it; if no group names it, it falls back to the User-agent: * group. A bot never combines rules from multiple groups, so if you create a specific group for Googlebot, it ignores the * group entirely and you must repeat any shared rules.

The Core Directives Explained

User-agent

Names the crawler a group applies to. * matches all bots that do not have their own group. Matching is case-insensitive and based on a substring of the bot's full user-agent token (so Googlebot matches Googlebot, Googlebot-Image, and others unless they have more specific groups).

Disallow

Blocks crawling of any URL path that begins with the given value. Disallow: /cart blocks /cart, /cart/, and /cart-summary. An empty Disallow: (no value) means "block nothing" and is a valid way to allow full access.

Allow

Explicitly permits a path, used mainly to carve out exceptions inside a disallowed directory. When a URL matches both an Allow and a Disallow rule, Google obeys the more specific rule (the one with the longer matching path), not the one that appears first.

Sitemap

Declares the absolute URL of your XML sitemap. It is independent of any user-agent group, can appear anywhere in the file, and you may list multiple sitemaps. Always use the full URL including the protocol.

Wildcards and Pattern Matching

Google and Bing support two special characters in Allow and Disallow paths, which makes precise control possible:

* matches any sequence of characters. Disallow: /*? blocks every URL containing a query string.
$ anchors the match to the end of the URL. Disallow: /*.pdf$ blocks every URL ending in .pdf but not /file.pdf?ref=email.

# Block all URLs with query parameters
User-agent: *
Disallow: /*?

# Block every PDF file
User-agent: *
Disallow: /*.pdf$

# Block a tracking parameter but allow the clean URL
User-agent: *
Disallow: /*?utm_

# Block one file type inside a folder, allow the rest
User-agent: *
Disallow: /downloads/*.zip$
Allow: /downloads/

Test wildcard rules carefully before shipping them, because a broad pattern can block far more than intended. Run any URL against your rules with our robots.txt URL tester to confirm exactly what is and is not blocked.

Crawl-delay and Other Non-Standard Directives

Some directives are not part of the official standard and are honored inconsistently:

Crawl-delay asks a bot to wait a number of seconds between requests. Googlebot ignores it entirely (set crawl rate in Search Console instead). Bing and Yandex respect it. Use it sparingly, since a high value on a large site can throttle crawling so much that pages go stale.
Host was a Yandex-specific directive for declaring a preferred domain; it is largely obsolete. Use 301 redirects and canonicals instead.
Noindex in robots.txt was never officially supported and Google stopped obeying it in 2019. Do not use it; use a real noindex meta tag or HTTP header.

# Slow down Bing on a busy server (Google ignores this)
User-agent: Bingbot
Crawl-delay: 10
Disallow:

Common Crawlers and AI Bots

Knowing the exact user-agent token is essential, because rules only apply if the name matches. Here are the crawlers most sites care about in 2026:

Crawler	User-agent token	Owner	Purpose
Googlebot	Googlebot	Google	Search indexing
Bingbot	Bingbot	Microsoft	Search indexing
Google-Extended	Google-Extended	Google	Gemini AI training
GPTBot	GPTBot	OpenAI	ChatGPT training
OAI-SearchBot	OAI-SearchBot	OpenAI	ChatGPT search results
ClaudeBot	ClaudeBot	Anthropic	Claude training and search
PerplexityBot	PerplexityBot	Perplexity	Answer engine citations
Applebot-Extended	Applebot-Extended	Apple	Apple Intelligence training
CCBot	CCBot	Common Crawl	Open dataset (feeds many AIs)

Controlling AI Bot Access

The rise of AI crawlers forces a strategic decision: do you want your content used for AI training, cited in AI answers, both, or neither? The trade-off is real. Blocking AI search bots can remove you from AI Overviews and ChatGPT answers, costing visibility, while allowing training bots gives your content away for model training with no direct return. Many sites split the difference: block the training crawlers but allow the search and citation crawlers so they can still appear in AI answers.

# Allow normal search engines full access
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI TRAINING crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow AI SEARCH and citation crawlers (so you appear in answers)
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Note that a key limitation applies: bots only honor these rules if they choose to, and a few AI scrapers reportedly ignore robots.txt. For verified enforcement you still need server-side blocking. Confirm which bots you are currently allowing with our AI bot robots checker and the AI crawler checker. For the bigger picture on appearing in AI answers, read our AEO optimization guide and explore the AEO hub.

robots.txt vs noindex vs canonical

These three tools are constantly confused, yet they do completely different jobs. Choosing the wrong one is the source of many SEO disasters.

Goal	Use	Why
Stop a bot from crawling a URL	robots.txt Disallow	Saves crawl budget; does not remove existing index entries
Keep a page out of search results	noindex meta tag or X-Robots-Tag header	The page must be crawlable for the bot to see the noindex
Consolidate duplicate pages	rel=canonical	Tells search engines which version is the master copy

The most important and counter-intuitive rule: do not block a page in robots.txt if you also want it deindexed with noindex. If the page is disallowed, the crawler can never fetch it, so it never sees the noindex tag, and the URL can linger in search results (often shown with no description) for a long time. To remove a page from the index, allow crawling and add a noindex tag; only after it has dropped out should you consider disallowing it. Validate your canonical setup with our canonical checker and your tags with the meta tag analyzer; the meta tags guide covers noindex in depth.

The llms.txt Standard

llms.txt is an emerging standard (proposed in 2024) that provides structured information to AI assistants about your site. Unlike robots.txt, which controls access, llms.txt describes your most important content and policies in a format that is easy for language models to consume. It complements robots.txt rather than replacing it.

# llms.txt
# Site: example.com
# Purpose: Help AI assistants understand our content

## About
We are an SEO tools company providing free
website analysis and optimization guides.

## Key Pages
- / : Homepage with SEO analyzer tool
- /blog/ : SEO tutorials and guides
- /dashboard/ : SEO analysis dashboard
- /schema/ : Schema markup generator

## Content Policies
- AI citation: Encouraged with link attribution
- Content scraping: Not permitted
- Training data: Opt-out (see robots.txt)

## Contact
- Website: https://example.com
- Support: [email protected]

Learn how to publish and structure this file on our llms.txt page.

Common robots.txt Mistakes

These are the errors that show up again and again in real audits, ranked roughly by how much damage they cause.

Blocking the entire site by accident — A stray Disallow: / under User-agent: * deindexes everything. This most often happens when a staging-site robots.txt is pushed to production unchanged. Always re-check robots.txt immediately after a launch.
Blocking CSS and JavaScript — Google renders pages like a browser. If you block your stylesheets or scripts, Google sees a broken layout and may judge the page poorly. Never disallow your assets folders.
Using robots.txt to hide private pages — The file is public, so disallowing /admin/ tells everyone where your admin lives. Use real authentication, and use noindex for pages that must stay out of search.
Expecting Disallow to remove indexed pages — Blocking crawl does not deindex. A disallowed URL that has external links can still appear in results. Use noindex first.
Forgetting that paths are prefixes — Disallow: /news also blocks /newsletter. Add a trailing slash (Disallow: /news/) to block only the directory.
Mixing rules across groups — Creating a specific User-agent: Googlebot group makes Googlebot ignore the * group entirely. Repeat any shared rules inside the specific group.
Using relative sitemap URLs — The Sitemap: directive requires a full absolute URL with protocol and host.
Case and naming errors — The file must be lowercase robots.txt at the root, and while directive keywords are case-insensitive, URL paths are case-sensitive on most servers.

A Practical Workflow for Editing robots.txt

Treat robots.txt changes with the same care as a code deploy, because the blast radius is your entire site.

Plan the change and write down exactly which URLs you intend to block or allow.
Draft the rules in a generator rather than by hand to avoid syntax slips. Our robots.txt generator builds a clean file for you.
Test the new file against your real URLs before publishing. The robots.txt tester validates syntax and the URL tester shows whether a specific URL is blocked.
Publish, then immediately re-fetch /robots.txt in a browser to confirm the live file is what you expect.
Verify in Search Console with the robots.txt report and URL Inspection tool, and re-submit your validated sitemap if it changed.

If you also manage AI crawler access, re-run the AI bot robots checker after every edit so you know exactly which bots your new rules permit.

Ready-to-Use robots.txt Templates

Below are complete, copy-paste starting points for common site types. Adjust the paths to match your own folder structure, and always test before publishing.

Small business or brochure site

Allow everything, block only the obvious admin and cart paths, and declare the sitemap. This is the right default for most sites that simply want to be fully indexed.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /cgi-bin/

Sitemap: https://example.com/sitemap.xml

E-commerce store

Online stores generate huge numbers of low-value faceted and filter URLs that waste crawl budget. Block the parameter and internal-search noise while keeping product and category pages open.

User-agent: *
# Block faceted navigation and sorting parameters
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=
Disallow: /*?filter=
# Block internal site search results
Disallow: /search
Disallow: /*?q=
# Block cart, checkout, and account areas
Disallow: /cart
Disallow: /checkout
Disallow: /account

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/product-sitemap.xml

Blog or publisher

Keep all articles crawlable, block tag and pagination noise only if it genuinely creates thin duplicate pages, and split sitemaps by type for faster discovery.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /author/

Sitemap: https://example.com/post-sitemap.xml
Sitemap: https://example.com/page-sitemap.xml

Staging or pre-launch site

While a site is in development, block every crawler. This is the one legitimate use of a sitewide Disallow, but remember to remove it before launch, because forgetting to do so is the single most common cause of a new site never ranking.

# DEVELOPMENT ONLY - remove before going live
User-agent: *
Disallow: /

How robots.txt Affects Crawl Budget

Crawl budget is the number of URLs a search engine is willing and able to fetch from your site in a given period. For small sites it is rarely a concern, but for large sites with tens of thousands of URLs or more, it directly limits how quickly new and updated content gets discovered and indexed. robots.txt is one of your main levers for managing it.

Every URL a crawler fetches consumes part of your budget, so allowing bots to wander into infinite parameter combinations, internal search results, calendar pages, and session-ID variants drains the budget on pages that will never rank. By disallowing those low-value patterns, you redirect crawl attention toward the pages that actually matter: your products, articles, and landing pages. The goal is not to block as much as possible, but to stop crawlers from wasting effort on URLs that add no value to the index.

A few practical principles keep crawl budget healthy. Keep your sitemap clean so it lists only canonical, indexable, 200-status URLs, and declare it in robots.txt so crawlers have a reliable map of what you consider important. Fix redirect chains and broken links, because every hop and dead end spends budget for nothing. And resist the urge to block pages you actually want indexed in an attempt to save budget; the right tools for index control are noindex and canonical, not Disallow. Build and check a clean sitemap with our sitemap generator and sitemap validator.

Frequently Asked Questions

Will robots.txt remove my page from Google?

No. robots.txt only stops crawling, not indexing. A page you block can still appear in search results if other sites link to it, often shown with no description because Google could not read its content. To remove a page from the index, allow crawling and add a noindex meta tag or X-Robots-Tag header, then wait for the page to drop out before considering a Disallow rule.

Do I even need a robots.txt file?

Not strictly. If you want every page crawled, you can omit the file or serve an empty allow-all. However, having one is good practice: it lets you point crawlers to your sitemap, block low-value or duplicate URLs to save crawl budget, and control AI bot access. Just make sure that if it exists, it returns a 200 status and is not accidentally blocking important content.

How do I block AI bots like GPTBot and ClaudeBot?

Add a group for each bot's user-agent token with Disallow: /, for example User-agent: GPTBot followed by Disallow: /. Repeat for Google-Extended, CCBot, Applebot-Extended, and any others you want to exclude. Remember that you may want to keep the search and citation bots (OAI-SearchBot, PerplexityBot) allowed so you still appear in AI answers, and that compliance depends on the bot choosing to honor the rule.

What is the difference between Disallow and noindex?

Disallow in robots.txt controls crawling, while noindex controls whether a page appears in search results. Crucially, a noindex tag only works if the bot can crawl the page, so the two should not be combined on the same URL. Use Disallow to save crawl budget on pages you do not care about, and noindex to keep crawlable pages out of the index.

Does Google obey the Crawl-delay directive?

No. Googlebot ignores Crawl-delay entirely. To influence Google's crawl rate, use the settings in Google Search Console or improve your server response time. Bing and Yandex do respect Crawl-delay, so it can still be useful for those engines, but use it carefully because a high value can slow indexing of new content.

Can I have different rules for different crawlers?

Yes. Create a separate group for each user-agent, starting with a User-agent: line and followed by that bot's rules. Remember that a bot obeys only the single most specific group that names it and ignores all others, including the User-agent: * fallback, so you must repeat any shared rules inside each specific group.

How do I test my robots.txt before going live?

Use a tester rather than guessing. Our robots.txt tester checks syntax and flags common mistakes, and the robots.txt URL tester tells you whether a specific URL is blocked or allowed by your rules. After publishing, confirm the result with the URL Inspection tool in Google Search Console and re-fetch the live file in your browser to be certain the deployed version matches what you tested.

How large can a robots.txt file be?

Google enforces a maximum file size of 500 kibibytes (roughly 500,000 characters) and ignores anything beyond that limit, so keep your file lean and avoid listing thousands of individual URLs. If you find yourself adding endless one-off Disallow lines, that is usually a sign you should be using wildcard patterns or fixing the underlying URL structure instead. A tidy robots.txt with a handful of clear pattern rules is far easier to maintain and reason about than a sprawling list, and it leaves less room for the kind of accidental over-blocking that costs sites their rankings.