Free LLM Optimization Checklist — llms.txt, RAG Readiness & AI Access (2026)

llms.txt & llms-full.txt

Publish a /llms.txt file at your domain root
The llms.txt proposal puts a curated, LLM-friendly map of your best content at /llms.txt so models can find what matters at inference time.
llms.txt Generator
Start the file with a single H1 project/site name
The H1 title is the only strictly required element of the spec; without it parsers can't identify the document.
llms.txt Validator
Add a blockquote summary directly under the H1
A short blockquote gives the model the key context needed to understand the rest of the file before reading any links.
llms.txt Validator
Group links under H2 section headers (e.g. Docs, Guides, About)
H2-delimited file lists are how the spec organises links so an LLM can pick the right section for a query.
llms.txt Validator
Give each link a descriptive title and a short note after the colon
The `[title](url): note` format lets the model judge relevance without fetching every page.
llms.txt Validator
Put skippable links under an '## Optional' section
The spec reserves an Optional section for secondary content that can be dropped when a shorter context window is needed.
llms.txt Validator
Curate to your highest-value pages, not your whole sitemap
llms.txt is a hand-picked guide; dumping every URL dilutes it and wastes the model's limited context.
llms.txt Generator
Publish /llms-full.txt with full-text content in one file
llms-full.txt concatenates your actual content so a model can ingest everything in a single fetch (Anthropic, Cloudflare and Vercel ship one).
llms-full.txt Generator
Serve llms.txt as static text/plain with no redirects
If the file isn't returned as text/plain (and free of redirects/rewrites/CDN cache issues) it may not be processed correctly.
Keep llms.txt in sync with your live content
A stale llms.txt that links to moved or deleted pages sends LLMs to dead ends and erodes trust in the file.
llms.txt Diff Tool
Offer a clean Markdown copy of pages at the same URL + .md
The spec recommends a `.md` (or `index.html.md`) twin of each page so models read pure content instead of parsing HTML.
Page-to-Markdown Exporter

AI Crawler Access (robots.txt)

Have a valid robots.txt at the domain root
robots.txt is still the primary, widely-honoured signal AI crawlers check before fetching your pages.
Robots.txt Generator
Decide deliberately whether each AI bot is allowed or blocked
Allowing training/search bots feeds AI answers; blocking them protects content — either way it should be a choice, not an accident.
AI Bot robots.txt Checker
Set rules for OpenAI's GPTBot, OAI-SearchBot and ChatGPT-User
OpenAI splits training (GPTBot), search indexing (OAI-SearchBot) and live user fetches (ChatGPT-User) — each needs an explicit directive.
AI Bot robots.txt Generator
Set rules for ClaudeBot, Claude-SearchBot and Claude-User
Anthropic's three bots cover training, in-product search and user-initiated fetches, giving granular robots.txt control.
AI Bot robots.txt Generator
Configure Google-Extended separately from Googlebot
Google-Extended controls Gemini/AI-Overviews training without affecting normal Google Search indexing.
AI Bot robots.txt Generator
Cover PerplexityBot, CCBot, Amazonbot, Applebot-Extended, Meta-ExternalAgent
These bots feed Perplexity, Common Crawl, Alexa, Apple Intelligence and Meta AI — the rest of the AI citation surface.
AI Bot robots.txt Generator
Confirm you aren't blanket-blocking all bots with a wildcard Disallow
A stray `User-agent: * / Disallow: /` makes your whole site invisible to LLMs you actually want to reach you.
Robots.txt Tester
Test that key URLs are actually crawlable for AI user-agents
Path-level rules interact in non-obvious ways; testing real URLs against each bot confirms intent matches reality.
Robots.txt URL Simulator
Reference your XML sitemap from robots.txt
A sitemap directive helps every crawler — AI included — discover your full canonical URL set.
XML Sitemap Validator
Ensure llms.txt and robots.txt don't contradict each other
Pointing LLMs to a page in llms.txt while blocking it in robots.txt sends a confused, self-defeating signal.
llms.txt vs robots.txt Consistency Checker

Machine-Readable & Clean HTML

Serve critical content in static server-rendered HTML
Analysis of 500M+ GPTBot fetches found zero JavaScript execution — AI crawlers read raw HTML and never wait for rendering.
Don't hide primary content behind client-side JavaScript
Content injected by JS in SPAs is completely invisible to GPTBot, ClaudeBot and PerplexityBot.
Page-to-Markdown Exporter
Verify each page converts cleanly to Markdown
If a page collapses into clean Markdown, an LLM can ingest it faithfully; if it turns to noise, content is being lost.
Page-to-Markdown Exporter
Use semantic HTML (heading hierarchy, lists, tables, articles)
Header-based structure lets retrieval systems split content along real topic boundaries instead of guessing.
Semantic Structure Analyzer
Keep a logical, sequential heading order (no skipped levels)
A clean H1→H2→H3 outline gives chunkers a reliable map of your document's structure.
Semantic Structure Analyzer
Maintain a high text-to-HTML ratio (content over markup bloat)
Heavy markup, inline scripts and tracking code bury the actual text models care about and waste their context.
Text-to-HTML Ratio Checker
Use descriptive link text instead of 'click here'
Link anchors are strong context signals; meaningful anchors help models understand where a link leads.
Semantic Structure Analyzer
Give images and diagrams meaningful alt/caption text
Text-only crawlers can't see images; alt text and captions are the only way that information reaches an LLM.
Set canonical tags so models ingest one authoritative version
Duplicate URLs split signals and risk an AI citing a parameterised or stale copy of your page.

Structured Data & Entities

Add JSON-LD structured data for your key entities
JSON-LD is the preferred 2026 format and is increasingly leveraged by ChatGPT, Perplexity and Google AI Overviews.
Schema Markup Tester
Fill out schema fully (not just the minimum required fields)
Richer, complete markup gives machines an unambiguous description of who and what you are.
Schema Completeness Scorer
Define an Organization entity with a stable @id
A persistent @id lets you reference the same entity across pages and tie everything back to one identity.
Schema Completeness Scorer
Link entities to Wikidata/Wikipedia via sameAs
sameAs Q-IDs anchor your brand in the canonical knowledge base behind Google, ChatGPT, Claude and Perplexity — 'non-negotiable' for LLM search.
Wikidata Entity Presence Checker
Make sure your brand exists as a Wikidata entity
If there's no Q-ID to point to, models can't disambiguate your brand from similarly-named ones.
Wikidata Entity Presence Checker
Cover the entities and topics LLMs expect for your niche
Filling entity gaps signals topical authority that retrieval and answer engines reward.
Entity Coverage Gap Analyzer

Retrieval / RAG Readiness

Write self-contained passages that make sense out of context
Retrieved chunks often start mid-argument; self-contained passages stop the model from hedging or hallucinating.
Passage Chunk Analyzer
Keep sections roughly chunk-sized (~300–500 words)
Recursive ~512-token chunks topped a Feb-2026 benchmark; sections near that size retrieve cleanly without splitting mid-thought.
Passage Chunk Analyzer
Keep one topic per section under a clear heading
Header-based 'by title' chunking keeps each topic in its own retrievable unit, boosting precision.
Passage Chunk Analyzer
Define key terms in plain, standalone sentences
Clean definition blocks are easy for models to extract and quote verbatim as answers.
Definition Block Detector
Front-load the direct answer at the top of each section
Answer-first writing means the most quotable sentence sits where retrieval and summarisation grab it.
Passage Chunk Analyzer
Back claims with statistics and concrete data
Specific numbers are highly citable and signal substance LLMs prefer to quote.
Statistic & Citation Density Scorer
Use explicit question-and-answer formatting where natural
Q&A pairs map directly onto user prompts, making your content an easy retrieval match.
Semantic Structure Analyzer
Measure an overall LLM-optimization score and track it
A single score turns the dozens of signals here into a number you can watch improve over time.
LLMO Score Analyzer

Licensing & AI Usage Policy

Declare your AI usage policy (e.g. an ai.txt file)
A machine-readable policy file states up front how AI systems may use your content, beyond simple allow/block.
ai.txt Generator
Reserve text-and-data-mining rights via TDMRep (tdmrep.json / headers / meta)
TDMRep is referenced by the EU AI Act and CDSM Article 4 as a machine-readable rights reservation with legal teeth in the EU.
Consider an RSL license to set machine-readable terms (attribution / pay-per-crawl)
RSL 1.0 became an industry standard in 2025 (Reddit, Yahoo, Quora, Medium) for declaring AI usage terms and compensation.
Distinguish AI-training vs AI-search vs indexing in your policy
RSL's ai-all / ai-input / ai-index categories let you allow search citation while opting out of training.
Keep policy signals consistent across robots.txt, headers and meta tags
Different crawlers check different signals, so the same intent should appear in every place a bot might look.

Verification & Monitoring

Verify which AI crawlers can actually reach your site
Confirming real reachability catches firewall, WAF or CDN rules that silently block bots your robots.txt allows.
AI Crawler Accessibility Checker
Monitor server logs for AI bot visits (GPTBot, ClaudeBot, PerplexityBot)
Logs are the only proof of whether AI crawlers are fetching your content and how often.
AI Bot Crawl Log Parser
Verify real bots by reverse DNS / published IP ranges
User-agent strings are easily spoofed; legitimate GPTBot resolves to OpenAI infrastructure and publishes its IP ranges.
AI Bot Crawl Log Parser
Confirm llms.txt and llms-full.txt return 200 (not 404/redirect)
A broken or redirected file means none of your llms.txt work is reaching models at all.
llms.txt Validator
Don't treat llms.txt as an access-control mechanism
No major AI vendor enforces llms.txt as permission; use robots.txt, headers and licensing for control.

Want this checked automatically?

The DarnItSEO Audit runs 75+ of these checks across your whole site.

Run Free Audit

llms.txt & llms-full.txt

Publish a /llms.txt file at your domain root

The llms.txt proposal puts a curated, LLM-friendly map of your best content at /llms.txt so models can find what matters at inference time.

llms.txt Generator

Start the file with a single H1 project/site name

The H1 title is the only strictly required element of the spec; without it parsers can't identify the document.

llms.txt Validator

Add a blockquote summary directly under the H1

A short blockquote gives the model the key context needed to understand the rest of the file before reading any links.

llms.txt Validator

Group links under H2 section headers (e.g. Docs, Guides, About)

H2-delimited file lists are how the spec organises links so an LLM can pick the right section for a query.

llms.txt Validator

Give each link a descriptive title and a short note after the colon

The `[title](url): note` format lets the model judge relevance without fetching every page.

llms.txt Validator

Put skippable links under an '## Optional' section

The spec reserves an Optional section for secondary content that can be dropped when a shorter context window is needed.

llms.txt Validator

Curate to your highest-value pages, not your whole sitemap

llms.txt is a hand-picked guide; dumping every URL dilutes it and wastes the model's limited context.

llms.txt Generator

Publish /llms-full.txt with full-text content in one file

llms-full.txt concatenates your actual content so a model can ingest everything in a single fetch (Anthropic, Cloudflare and Vercel ship one).

llms-full.txt Generator

Serve llms.txt as static text/plain with no redirects

If the file isn't returned as text/plain (and free of redirects/rewrites/CDN cache issues) it may not be processed correctly.

Keep llms.txt in sync with your live content

A stale llms.txt that links to moved or deleted pages sends LLMs to dead ends and erodes trust in the file.

llms.txt Diff Tool

Offer a clean Markdown copy of pages at the same URL + .md

The spec recommends a `.md` (or `index.html.md`) twin of each page so models read pure content instead of parsing HTML.

Page-to-Markdown Exporter

AI Crawler Access (robots.txt)

Have a valid robots.txt at the domain root

robots.txt is still the primary, widely-honoured signal AI crawlers check before fetching your pages.

Robots.txt Generator

Decide deliberately whether each AI bot is allowed or blocked

Allowing training/search bots feeds AI answers; blocking them protects content — either way it should be a choice, not an accident.

AI Bot robots.txt Checker

Set rules for OpenAI's GPTBot, OAI-SearchBot and ChatGPT-User

OpenAI splits training (GPTBot), search indexing (OAI-SearchBot) and live user fetches (ChatGPT-User) — each needs an explicit directive.

AI Bot robots.txt Generator

Set rules for ClaudeBot, Claude-SearchBot and Claude-User

Anthropic's three bots cover training, in-product search and user-initiated fetches, giving granular robots.txt control.

AI Bot robots.txt Generator

Configure Google-Extended separately from Googlebot

Google-Extended controls Gemini/AI-Overviews training without affecting normal Google Search indexing.

AI Bot robots.txt Generator

Cover PerplexityBot, CCBot, Amazonbot, Applebot-Extended, Meta-ExternalAgent

These bots feed Perplexity, Common Crawl, Alexa, Apple Intelligence and Meta AI — the rest of the AI citation surface.

AI Bot robots.txt Generator

Confirm you aren't blanket-blocking all bots with a wildcard Disallow

A stray `User-agent: * / Disallow: /` makes your whole site invisible to LLMs you actually want to reach you.

Robots.txt Tester

Test that key URLs are actually crawlable for AI user-agents

Path-level rules interact in non-obvious ways; testing real URLs against each bot confirms intent matches reality.

Robots.txt URL Simulator

Reference your XML sitemap from robots.txt

A sitemap directive helps every crawler — AI included — discover your full canonical URL set.

XML Sitemap Validator

Ensure llms.txt and robots.txt don't contradict each other

Pointing LLMs to a page in llms.txt while blocking it in robots.txt sends a confused, self-defeating signal.

llms.txt vs robots.txt Consistency Checker

Machine-Readable & Clean HTML

Serve critical content in static server-rendered HTML

Analysis of 500M+ GPTBot fetches found zero JavaScript execution — AI crawlers read raw HTML and never wait for rendering.

Don't hide primary content behind client-side JavaScript

Content injected by JS in SPAs is completely invisible to GPTBot, ClaudeBot and PerplexityBot.

Page-to-Markdown Exporter

Verify each page converts cleanly to Markdown

If a page collapses into clean Markdown, an LLM can ingest it faithfully; if it turns to noise, content is being lost.

Page-to-Markdown Exporter

Use semantic HTML (heading hierarchy, lists, tables, articles)

Header-based structure lets retrieval systems split content along real topic boundaries instead of guessing.

Semantic Structure Analyzer

Keep a logical, sequential heading order (no skipped levels)

A clean H1→H2→H3 outline gives chunkers a reliable map of your document's structure.

Semantic Structure Analyzer

Maintain a high text-to-HTML ratio (content over markup bloat)

Heavy markup, inline scripts and tracking code bury the actual text models care about and waste their context.

Text-to-HTML Ratio Checker

Use descriptive link text instead of 'click here'

Link anchors are strong context signals; meaningful anchors help models understand where a link leads.

Semantic Structure Analyzer

Give images and diagrams meaningful alt/caption text

Text-only crawlers can't see images; alt text and captions are the only way that information reaches an LLM.

Set canonical tags so models ingest one authoritative version

Duplicate URLs split signals and risk an AI citing a parameterised or stale copy of your page.

Structured Data & Entities

Add JSON-LD structured data for your key entities

JSON-LD is the preferred 2026 format and is increasingly leveraged by ChatGPT, Perplexity and Google AI Overviews.

Schema Markup Tester

Fill out schema fully (not just the minimum required fields)

Richer, complete markup gives machines an unambiguous description of who and what you are.

Schema Completeness Scorer

Define an Organization entity with a stable @id

A persistent @id lets you reference the same entity across pages and tie everything back to one identity.

Schema Completeness Scorer

Link entities to Wikidata/Wikipedia via sameAs

sameAs Q-IDs anchor your brand in the canonical knowledge base behind Google, ChatGPT, Claude and Perplexity — 'non-negotiable' for LLM search.

Wikidata Entity Presence Checker

Make sure your brand exists as a Wikidata entity

If there's no Q-ID to point to, models can't disambiguate your brand from similarly-named ones.

Wikidata Entity Presence Checker

Cover the entities and topics LLMs expect for your niche

Filling entity gaps signals topical authority that retrieval and answer engines reward.

Entity Coverage Gap Analyzer

Retrieval / RAG Readiness

Write self-contained passages that make sense out of context

Retrieved chunks often start mid-argument; self-contained passages stop the model from hedging or hallucinating.

Passage Chunk Analyzer

Keep sections roughly chunk-sized (~300–500 words)

Recursive ~512-token chunks topped a Feb-2026 benchmark; sections near that size retrieve cleanly without splitting mid-thought.

Passage Chunk Analyzer

Keep one topic per section under a clear heading

Header-based 'by title' chunking keeps each topic in its own retrievable unit, boosting precision.

Passage Chunk Analyzer

Define key terms in plain, standalone sentences

Clean definition blocks are easy for models to extract and quote verbatim as answers.

Definition Block Detector

Front-load the direct answer at the top of each section

Answer-first writing means the most quotable sentence sits where retrieval and summarisation grab it.

Passage Chunk Analyzer

Back claims with statistics and concrete data

Specific numbers are highly citable and signal substance LLMs prefer to quote.

Statistic & Citation Density Scorer

Use explicit question-and-answer formatting where natural

Q&A pairs map directly onto user prompts, making your content an easy retrieval match.

Semantic Structure Analyzer

Measure an overall LLM-optimization score and track it

A single score turns the dozens of signals here into a number you can watch improve over time.

LLMO Score Analyzer

Licensing & AI Usage Policy

Declare your AI usage policy (e.g. an ai.txt file)

A machine-readable policy file states up front how AI systems may use your content, beyond simple allow/block.

ai.txt Generator

Reserve text-and-data-mining rights via TDMRep (tdmrep.json / headers / meta)

TDMRep is referenced by the EU AI Act and CDSM Article 4 as a machine-readable rights reservation with legal teeth in the EU.

Consider an RSL license to set machine-readable terms (attribution / pay-per-crawl)

RSL 1.0 became an industry standard in 2025 (Reddit, Yahoo, Quora, Medium) for declaring AI usage terms and compensation.

Distinguish AI-training vs AI-search vs indexing in your policy

RSL's ai-all / ai-input / ai-index categories let you allow search citation while opting out of training.

Keep policy signals consistent across robots.txt, headers and meta tags

Different crawlers check different signals, so the same intent should appear in every place a bot might look.

Verification & Monitoring

Verify which AI crawlers can actually reach your site

Confirming real reachability catches firewall, WAF or CDN rules that silently block bots your robots.txt allows.

AI Crawler Accessibility Checker

Monitor server logs for AI bot visits (GPTBot, ClaudeBot, PerplexityBot)

Logs are the only proof of whether AI crawlers are fetching your content and how often.

AI Bot Crawl Log Parser

Verify real bots by reverse DNS / published IP ranges

User-agent strings are easily spoofed; legitimate GPTBot resolves to OpenAI infrastructure and publishes its IP ranges.

AI Bot Crawl Log Parser

Confirm llms.txt and llms-full.txt return 200 (not 404/redirect)

A broken or redirected file means none of your llms.txt work is reaching models at all.

llms.txt Validator

Don't treat llms.txt as an access-control mechanism

No major AI vendor enforces llms.txt as permission; use robots.txt, headers and licensing for control.