What should every robots.txt include?

At minimum: a Sitemap directive pointing to your XML sitemap, and a sensible Disallow list for any paths you don't want crawled (admin panels, checkout flows, search result pages). You should also explicitly allow or disallow major AI crawlers like GPTBot and ClaudeBot based on your content strategy.

Should I use Crawl-delay?

Only if your server is overwhelmed by bot traffic. Googlebot ignores Crawl-delay; it manages crawl rate from Google Search Console instead. For other bots, a crawl delay of 1 to 10 seconds can reduce server load without impacting SEO significantly.

How do I allow ChatGPT to index my site?

Add 'User-agent: GPTBot' followed by 'Allow: /' to your robots.txt. If that user-agent is already present with Disallow rules, remove or replace them. Then verify with our Robots.txt Tester to confirm GPTBot access is enabled.

Where should I upload the robots.txt file?

It must live at the root of your domain at https://yoursite.com/robots.txt. Subdirectories are not supported. Each subdomain (example.com vs www.example.com vs blog.example.com) needs its own robots.txt at its own root. Subdomain robots.txt files do not inherit from the main domain.

Can I disallow specific URL parameters?

Yes, using wildcards. 'Disallow: /*?' blocks any URL containing a question mark, blocking all parameter URLs. To target a specific parameter use 'Disallow: /*?utm_*' (blocks tracking parameters) or 'Disallow: /search?*'. Test with the Robots.txt Tester to confirm wildcards behave the way you intend, since support varies across crawlers.

Should I block staging or test environments via robots.txt?

No. Robots.txt is public, so listing 'Disallow: /staging/' tells anyone the path exists. For staging, use HTTP authentication (basic auth), or add a Noindex header at the server level, or block the IP range. Robots.txt should only be used for genuinely public paths you want to control crawler access to.

Will my robots.txt changes take effect immediately?

Not always. Googlebot caches robots.txt for up to 24 hours. Bing and others cache similarly. Submit your robots.txt URL through Google Search Console's URL Inspection tool to force a refresh. For high-stakes changes (unblocking pages after a deindex), you can sometimes get faster propagation by also requesting reindexing of the affected URLs.

Robots.txt Generator — Free Online Tool

Why a generated robots.txt beats one you wrote at 2am

Almost every robots.txt disaster I have cleaned up started with a senior engineer typing rules into a text editor at the end of a long sprint. A forgotten asterisk, a Disallow line copied from staging, a sitemap URL pointing at the dev domain. The rules look right, ship to production, and three weeks later organic traffic falls 40 percent because Googlebot stopped crawling product pages. A robots.txt generator removes the typo class of bugs entirely. You pick the patterns you want, the tool emits valid syntax, and you ship a file that does what the form said it would.

The other reason to use a robots.txt builder is that the spec keeps moving. Five years ago the only crawlers worth thinking about were Googlebot and Bingbot. Today there are at least a dozen AI bot user agents with slightly different conventions. A generator stays current so you do not have to memorize them, and produces a robots.txt example that matches latest practice.

Hand-written files also drift. Someone adds a rule for one campaign, nobody removes it, and a year later your robots.txt is a graveyard of expired Disallows nobody understands. Generating from known inputs gives you a clean baseline you can version-control and regenerate.

The minimum viable robots.txt

If you do nothing else, ship this. A User-agent line that names the crawler (or an asterisk for all), zero or more Disallow lines for paths you want bots to skip, and a Sitemap line pointing to your XML sitemap. Three directives, plain text, UTF-8 encoded. That is the entire protocol Google, Bing, DuckDuckGo, and most respectable bots actually follow.

The simplest valid robots.txt template is User-agent: * on one line, Disallow: on the next (empty value means nothing is blocked), and Sitemap: https://yoursite.com/sitemap.xml on the third. That file says every crawler is welcome everywhere, here is a list of URLs. For most marketing sites that is the right answer. Add complexity only when you have real reasons to keep specific paths out of search.

Robots.txt syntax is forgiving in some ways and strict in others. Comments start with a hash sign and run to end of line. Blank lines are allowed and useful for grouping. Field names like User-agent and Disallow are case insensitive, but path values are case sensitive on most servers, so /Admin and /admin are not the same URL.

Patterns you will use over and over

Block your admin area. Disallow: /admin/ and Disallow: /wp-admin/ are the most common rules on the internet because every CMS ships an authenticated dashboard nobody needs in search results. Add Disallow: /login and Disallow: /account/ for the same reason. Just remember robots.txt is public, so do not list URLs whose existence you want to hide.

Allow your CSS and JavaScript. This sounds redundant since the default is allow, but if you have a broad Disallow: /assets/ rule for some legacy reason, you need an Allow: /assets/*.css and Allow: /assets/*.js underneath it so Googlebot can render properly. A rendered page that looks broken to Google ranks worse than one that renders cleanly.

Block parameter URLs that create infinite crawl space. Faceted navigation on ecommerce sites generates thousands of URLs like /shoes?color=red&size=10&sort=price that all show the same listing. Disallow: /*?sort= and Disallow: /*?color= keeps Googlebot from spending crawl budget on duplicates. List your sitemap with the full absolute URL including protocol, because relative paths are not valid here.

AI bot user agents you should actually know

GPTBot is OpenAI's crawler for training data and ChatGPT browsing. ClaudeBot belongs to Anthropic and feeds Claude. PerplexityBot powers Perplexity AI's answer engine. Google-Extended is a token Google added so you can opt out of Gemini training without blocking normal Google search. CCBot is Common Crawl, the dataset behind many open source models. Bytespider is ByteDance, used by Doubao and TikTok.

To keep content out of AI training, the gptbot disallow rule looks like User-agent: GPTBot followed by Disallow: /. Repeat that block for each AI bot you want to exclude. If you want to be cited in ChatGPT and Perplexity answers (free distribution, basically), leave them allowed. The default is allow, so silence equals consent. A good ai bot robots.txt setup makes the decision explicit. News publishers usually block aggressively; most SaaS and ecommerce sites allow everything because being quoted in an AI answer with a link is high-intent traffic that converts well.

Crawl-delay, and why you usually should not bother

Crawl-delay asks crawlers to wait a specified number of seconds between requests. Bingbot and Yandex respect it. Googlebot ignores it completely; Google manages crawl rate through Search Console. Writing Crawl-delay: 10 does nothing for Google but tells Bing to slow to one request every ten seconds, which can hurt your Bing indexing without you realizing it.

Use Crawl-delay only if you have a documented server load problem caused by a specific bot, and even then prefer setting the rate in that bot's official tools (Bing Webmaster Tools, Yandex Webmaster). For Googlebot, use the Search Console crawl rate setting. Otherwise, leave Crawl-delay out entirely.

Wildcards, dollar signs, and other syntax that bites

Two pattern operators are widely supported. The asterisk matches any sequence of characters, including the empty string. The dollar sign anchors a pattern to the end of the URL. So Disallow: /*.pdf$ blocks any URL ending in .pdf, while Disallow: /*.pdf without the dollar sign blocks any URL containing .pdf anywhere, which can match /reports/document.pdf-old as well, probably not what you wanted.

The Allow directive overrides Disallow when both match the same URL, and the most specific rule wins. So Disallow: /private/ followed by Allow: /private/public-page lets that one URL through while keeping the rest of the directory blocked. This precedence behavior is consistent across Googlebot and Bingbot, but smaller crawlers may interpret it differently, so test anything tricky.

Common mistakes include putting multiple paths on one Disallow line (only the first works), using regex characters other than * and $ (none are supported), and forgetting paths match against URL path, not the full URL. Disallow: yoursite.com/admin does not work; Disallow: /admin does.

Per-bot blocks versus the global block, and why ordering matters

A robots.txt is organized into groups, each beginning with one or more User-agent lines followed by the rules that apply to them. A crawler reads the file, finds the single most specific group that names it, and obeys only that group. It does not merge the global asterisk group with its own named group. This single fact causes more generator-versus-handwritten confusion than anything else. If you write a User-agent: * block with your Disallow rules, then add a separate User-agent: GPTBot block that only says Disallow: /, GPTBot will follow only its own block and completely ignore every rule you put under the asterisk.

A good robots.txt builder handles this by repeating the shared rules inside each named group when you need them to apply, rather than assuming bots inherit from the wildcard. When you generate a file that blocks AI crawlers but allows search engines, check that your search-critical Disallow and Allow rules still appear under the groups for the bots that need them. The generator keeps the groups internally consistent so a named exception does not accidentally hand a bot a blank slate.

Order within a group matters less than people fear on Googlebot and Bingbot, which use most-specific-match-wins, but it matters a great deal on simpler crawlers that read top to bottom. Generating the file means the Allow and Disallow lines come out in a predictable, sorted order, so the same rules behave the same way across the widest set of bots instead of depending on whichever line you happened to type first.

Where to upload it and how to test before shipping

Robots.txt must live at the root of your domain. yoursite.com/robots.txt is the only location any crawler will check. It cannot be in a subdirectory; each subdomain needs its own file; and the redirect chain must be no longer than five hops. The file must return a 200 status, be served as text/plain, and be accessible to anonymous bots without authentication.

Once live, test it. Open Search Console, paste a URL into URL Inspection, and confirm the Crawl section shows the URL is allowed. Repeat for a path you intentionally blocked. Search Console has a robots.txt report under Settings that flags syntax errors and shows the last fetched version, useful when changes seem not to take effect (Google caches up to 24 hours).

Updating safely without nuking your traffic

Treat robots.txt changes the same way you treat database migrations. Make one change at a time. Generate the new version, diff it against the live version, and confirm only the lines you intended to touch have changed. Run the new file through a robots.txt tester before deploy, not after. After deploy, watch Search Console crawl stats for the next seven days; a sudden drop in pages crawled is the early warning that something blocks more than you intended.

Keep the file in version control alongside your application code, with commit messages that explain why each rule exists. Future you (or the person who replaces you) will need to know whether Disallow: /campaign-2024/ is still relevant or safe to remove. A robots.txt without context is a robots.txt that nobody dares change, and stale rules silently cost you rankings as your site evolves around them.

Why a generated robots.txt beats one you wrote at 2am

The minimum viable robots.txt

Patterns you will use over and over

AI bot user agents you should actually know

Crawl-delay, and why you usually should not bother

Wildcards, dollar signs, and other syntax that bites

Per-bot blocks versus the global block, and why ordering matters

Where to upload it and how to test before shipping

Updating safely without nuking your traffic

How it works

Configure Your Rules

Add Sitemap Reference

Copy or Download

Frequently asked

What should every robots.txt include?

Should I use Crawl-delay?

How do I allow ChatGPT to index my site?

Where should I upload the robots.txt file?

Can I disallow specific URL parameters?

Should I block staging or test environments via robots.txt?

Will my robots.txt changes take effect immediately?

Related tools

Why a generated robots.txt beats one you wrote at 2am

The minimum viable robots.txt

Patterns you will use over and over

AI bot user agents you should actually know

Crawl-delay, and why you usually should not bother

Wildcards, dollar signs, and other syntax that bites

Per-bot blocks versus the global block, and why ordering matters

Where to upload it and how to test before shipping

Updating safely without nuking your traffic

How it works

Configure Your Rules

Add Sitemap Reference

Copy or Download

Frequently asked

What should every robots.txt include?

Should I use Crawl-delay?

How do I allow ChatGPT to index my site?

Where should I upload the robots.txt file?

Can I disallow specific URL parameters?

Should I block staging or test environments via robots.txt?

Will my robots.txt changes take effect immediately?

Related tools