The robots.txt file is a text file at your domain root that tells search engine crawlers which pages they can and cannot access. The newer llms.txt file serves a similar purpose for AI language model crawlers.
robots.txt Basics
Place robots.txt at your domain root (e.g., https://example.com/robots.txt). It uses a simple syntax:
# Allow all crawlers access to everything
User-agent: *
Allow: /
# Block all crawlers from admin pages
User-agent: *
Disallow: /admin/
Disallow: /private/
# Sitemap location
Sitemap: https://example.com/sitemap.xml
Key robots.txt Rules
- User-agent — Specifies which crawler the rules apply to (
*means all) - Disallow — Blocks the specified path from being crawled
- Allow — Explicitly allows a path (useful for overriding broader Disallow rules)
- Sitemap — Points to your XML sitemap location
Common Crawlers
| Crawler | User-agent | Owner |
|---|---|---|
| Googlebot | Googlebot | |
| Bingbot | Bingbot | Microsoft |
| GPTBot | GPTBot | OpenAI |
| ClaudeBot | ClaudeBot | Anthropic |
| Google AI | Google-Extended | Google (AI training) |
| PerplexityBot | PerplexityBot | Perplexity |
Controlling AI Bot Access
You can selectively allow or block AI crawlers:
# Allow search engines, block AI training
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Allow Perplexity (real-time citation)
User-agent: PerplexityBot
Allow: /
The llms.txt Standard
llms.txt is an emerging standard (proposed in 2024) that provides structured information to AI assistants about your site. Unlike robots.txt which controls access, llms.txt describes your site's content and policies.
# llms.txt
# Site: example.com
# Purpose: Help AI assistants understand our content
## About
We are an SEO tools company providing free
website analysis and optimization guides.
## Key Pages
- / : Homepage with SEO analyzer tool
- /blog/ : SEO tutorials and guides
- /dashboard/ : SEO analysis dashboard
- /schema/ : Schema markup generator
## Content Policies
- AI citation: Encouraged with link attribution
- Content scraping: Not permitted
- Training data: Opt-out (see robots.txt)
## Contact
- Website: https://example.com
- Support: support@example.com
Common robots.txt Mistakes
- Blocking CSS/JS — Don't block stylesheets or scripts; Google needs them to render your pages
- Blocking the entire site — A stray
Disallow: /underUser-agent: *will de-index your entire site - Using for security — robots.txt is publicly readable and is NOT a security measure. Use proper authentication instead.
- Forgetting the trailing slash —
Disallow: /adminblocks/admin-pagetoo. UseDisallow: /admin/to block only the directory.
Testing robots.txt
DarnItSEO checks your robots.txt configuration as part of its technical SEO analysis. You can also use Google Search Console's robots.txt tester to verify your rules.