Why a generated robots.txt beats one you wrote at 2am
Almost every robots.txt disaster I have cleaned up started with a senior engineer typing rules into a text editor at the end of a long sprint. A forgotten asterisk, a Disallow line copied from staging, a sitemap URL pointing at the dev domain. The rules look right, ship to production, and three weeks later organic traffic falls 40 percent because Googlebot stopped crawling product pages. A robots.txt generator removes the typo class of bugs entirely. You pick the patterns you want, the tool emits valid syntax, and you ship a file that does what the form said it would.
The other reason to use a robots.txt builder is that the spec keeps moving. Five years ago the only crawlers worth thinking about were Googlebot and Bingbot. Today there are at least a dozen AI bot user agents with slightly different conventions. A generator stays current so you do not have to memorize them, and produces a robots.txt example that matches latest practice.
Hand-written files also drift. Someone adds a rule for one campaign, nobody removes it, and a year later your robots.txt is a graveyard of expired Disallows nobody understands. Generating from known inputs gives you a clean baseline you can version-control and regenerate.
The minimum viable robots.txt
If you do nothing else, ship this. A User-agent line that names the crawler (or an asterisk for all), zero or more Disallow lines for paths you want bots to skip, and a Sitemap line pointing to your XML sitemap. Three directives, plain text, UTF-8 encoded. That is the entire protocol Google, Bing, DuckDuckGo, and most respectable bots actually follow.
The simplest valid robots.txt template is User-agent: * on one line, Disallow: on the next (empty value means nothing is blocked), and Sitemap: https://yoursite.com/sitemap.xml on the third. That file says every crawler is welcome everywhere, here is a list of URLs. For most marketing sites that is the right answer. Add complexity only when you have real reasons to keep specific paths out of search.
Robots.txt syntax is forgiving in some ways and strict in others. Comments start with a hash sign and run to end of line. Blank lines are allowed and useful for grouping. Field names like User-agent and Disallow are case insensitive, but path values are case sensitive on most servers, so /Admin and /admin are not the same URL.
Patterns you will use over and over
Block your admin area. Disallow: /admin/ and Disallow: /wp-admin/ are the most common rules on the internet because every CMS ships an authenticated dashboard nobody needs in search results. Add Disallow: /login and Disallow: /account/ for the same reason. Just remember robots.txt is public, so do not list URLs whose existence you want to hide.
Allow your CSS and JavaScript. This sounds redundant since the default is allow, but if you have a broad Disallow: /assets/ rule for some legacy reason, you need an Allow: /assets/*.css and Allow: /assets/*.js underneath it so Googlebot can render properly. A rendered page that looks broken to Google ranks worse than one that renders cleanly.
Block parameter URLs that create infinite crawl space. Faceted navigation on ecommerce sites generates thousands of URLs like /shoes?color=red&size=10&sort=price that all show the same listing. Disallow: /*?sort= and Disallow: /*?color= keeps Googlebot from spending crawl budget on duplicates. List your sitemap with the full absolute URL including protocol, because relative paths are not valid here.
AI bot user agents you should actually know
GPTBot is OpenAI's crawler for training data and ChatGPT browsing. ClaudeBot belongs to Anthropic and feeds Claude. PerplexityBot powers Perplexity AI's answer engine. Google-Extended is a token Google added so you can opt out of Gemini training without blocking normal Google search. CCBot is Common Crawl, the dataset behind many open source models. Bytespider is ByteDance, used by Doubao and TikTok.
To keep content out of AI training, the gptbot disallow rule looks like User-agent: GPTBot followed by Disallow: /. Repeat that block for each AI bot you want to exclude. If you want to be cited in ChatGPT and Perplexity answers (free distribution, basically), leave them allowed. The default is allow, so silence equals consent. A good ai bot robots.txt setup makes the decision explicit. News publishers usually block aggressively; most SaaS and ecommerce sites allow everything because being quoted in an AI answer with a link is high-intent traffic that converts well.
Crawl-delay, and why you usually should not bother
Crawl-delay asks crawlers to wait a specified number of seconds between requests. Bingbot and Yandex respect it. Googlebot ignores it completely; Google manages crawl rate through Search Console. Writing Crawl-delay: 10 does nothing for Google but tells Bing to slow to one request every ten seconds, which can hurt your Bing indexing without you realizing it.
Use Crawl-delay only if you have a documented server load problem caused by a specific bot, and even then prefer setting the rate in that bot's official tools (Bing Webmaster Tools, Yandex Webmaster). For Googlebot, use the Search Console crawl rate setting. Otherwise, leave Crawl-delay out entirely.
Wildcards, dollar signs, and other syntax that bites
Two pattern operators are widely supported. The asterisk matches any sequence of characters, including the empty string. The dollar sign anchors a pattern to the end of the URL. So Disallow: /*.pdf$ blocks any URL ending in .pdf, while Disallow: /*.pdf without the dollar sign blocks any URL containing .pdf anywhere, which can match /reports/document.pdf-old as well, probably not what you wanted.
The Allow directive overrides Disallow when both match the same URL, and the most specific rule wins. So Disallow: /private/ followed by Allow: /private/public-page lets that one URL through while keeping the rest of the directory blocked. This precedence behavior is consistent across Googlebot and Bingbot, but smaller crawlers may interpret it differently, so test anything tricky.
Common mistakes include putting multiple paths on one Disallow line (only the first works), using regex characters other than * and $ (none are supported), and forgetting paths match against URL path, not the full URL. Disallow: yoursite.com/admin does not work; Disallow: /admin does.
Where to upload it and how to test before shipping
Robots.txt must live at the root of your domain. yoursite.com/robots.txt is the only location any crawler will check. It cannot be in a subdirectory; each subdomain needs its own file; and the redirect chain must be no longer than five hops. The file must return a 200 status, be served as text/plain, and be accessible to anonymous bots without authentication.
Once live, test it. Open Search Console, paste a URL into URL Inspection, and confirm the Crawl section shows the URL is allowed. Repeat for a path you intentionally blocked. Search Console has a robots.txt report under Settings that flags syntax errors and shows the last fetched version, useful when changes seem not to take effect (Google caches up to 24 hours).
Updating safely without nuking your traffic
Treat robots.txt changes the same way you treat database migrations. Make one change at a time. Generate the new version, diff it against the live version, and confirm only the lines you intended to touch have changed. Run the new file through a robots.txt tester before deploy, not after. After deploy, watch Search Console crawl stats for the next seven days; a sudden drop in pages crawled is the early warning that something blocks more than you intended.
Keep the file in version control alongside your application code, with commit messages that explain why each rule exists. Future you (or the person who replaces you) will need to know whether Disallow: /campaign-2024/ is still relevant or safe to remove. A robots.txt without context is a robots.txt that nobody dares change, and stale rules silently cost you rankings as your site evolves around them.