What is an XML sitemap and why does it matter?

An XML sitemap is a structured file that lists your website's URLs, helping search engines discover and prioritize content. It doesn't guarantee ranking, but it accelerates indexing. Critical for large sites, new pages, or content buried deep in your site structure.

How many URLs can an XML sitemap have?

A single sitemap file supports up to 50,000 URLs and a 50MB file size. For larger sites, you'll need a sitemap index file that points to multiple individual sitemaps. Our validator handles both.

Should I include noindex pages in my sitemap?

No. Including noindex pages in your sitemap sends conflicting signals to Google. Your sitemap should only contain canonical, indexable URLs you actually want appearing in search results.

Do priority and changefreq still matter?

No. Google has explicitly said since 2017 that it ignores both fields. They were used by older crawlers but provide no benefit today. You can safely omit them. Keep loc and lastmod, since lastmod is still actively read.

Why does my sitemap show 'discovered but not indexed' in Search Console?

Google has crawled the URL but decided not to index it. Common causes: low quality content (thin, duplicate, scraped), poor internal linking signaling unimportance, soft 404 detection, or simply low domain authority for new sites. The sitemap is doing its job by surfacing the URL; the indexing decision comes from elsewhere.

What should I do with deleted pages in my sitemap?

Remove them. A sitemap should only contain URLs you want indexed. Deleted URLs should return 410 (or 404) and be excluded from sitemap regeneration. Listing dead URLs wastes crawl budget and can be interpreted as a quality signal against the domain.

Should I separate my sitemap by content type?

Yes for large sites. Split sitemaps by type (pages, blog, products, images) so you can monitor indexation rate per category in Search Console. A drop in only the blog sitemap's coverage tells you exactly where the issue is. Use a sitemap index file to reference all of them.

XML Sitemap Validator — Free Online

Why an invalid sitemap silently breaks indexation

A broken sitemap is the kind of problem that does not announce itself. Pages keep loading, users keep visiting, but Google is rejecting your sitemap.xml entirely, falling back to whatever URLs it can discover through internal links, and ignoring the new product pages or blog posts you keep wondering why nobody can find.

The reason this fails so quietly is that Google does not email you when your file has a syntax error. It logs a status in Search Console, often under "Couldn't fetch" or, worse, "Success" with zero discovered URLs. If you are not checking the Sitemaps report regularly, you find out weeks later when an editor asks why a launched page does not show up for its own brand name.

This XML sitemap checker pulls your live file, parses it against the sitemap protocol Google publishes, and tells you what is wrong. Malformed tags, encoding mismatches, invalid lastmod values, unescaped ampersands; all of it gets flagged with a line number.

The sitemap protocol, in one section

The Google sitemap protocol is the sitemaps.org spec, jointly adopted by Google, Bing, and Yahoo in 2008. A valid sitemap.xml starts with an XML declaration, opens a urlset element with the correct namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"), and lists each page inside a url block.

Inside every url block you put a loc tag, the absolute URL of the page. That is the only required element. Optional siblings include lastmod, changefreq, and priority. The loc must be URL encoded, must match the protocol and host registered in Search Console, and must not exceed 2048 characters.

Most validators stop at "is this well formed XML." This sitemap.xml validator goes further. It checks the namespace, whether each loc resolves to a real 200 response rather than a 404 or redirect chain, and whether your lastmod values are plausible.

Sitemap index files and the 50,000 URL limit

A single sitemap.xml holds a maximum of 50,000 URLs or 50MB uncompressed, whichever comes first. Above that, you split into multiple files and reference them from a sitemap index. The index uses a sitemapindex root element instead of urlset, and each child is a sitemap entry with a loc pointing to one of the child files.

The index itself is also capped at 50,000 entries. Large publishers typically run a flat index pointing to fifty or a hundred segmented child sitemaps grouped by content type or date. This makes it easier to spot which segment is failing, and lets you regenerate one segment without rewriting the whole thing. If you are gzipping, the 50MB limit applies to the uncompressed size.

Common sitemap errors and what they mean

The most frequent failure is malformed XML. An unclosed tag, a missing quote, a stray less-than sign inside a URL parameter; any of these and the parser stops cold. There is no recovery and no best effort rendering. One bad character and the entire sitemap fails to load.

Encoding mismatches are next. Your sitemap.xml syntax declares UTF-8 in the prolog, but the file gets saved with BOM, or as Windows-1252, or with mixed encodings if you concatenated outputs from different scripts. Google rejects the file or treats accented characters as garbage, producing URLs that do not match anything on your server.

Escape characters cause silent damage. URLs with ampersands, single quotes, and angle brackets must be entity encoded inside loc tags. Invalid lastmod values are the quietest killer: if the timestamp is in the wrong format (Google expects W3C datetime), the field is ignored, the sitemap parses fine, and your update signals are gone.

Why the URLs inside the file matter more than the XML

Passing an XML well-formedness check is the low bar. Plenty of sitemaps are perfectly valid XML and still drag your indexation down, because the problem is not the markup, it is what the loc tags point at. This validator fetches a sample of the URLs and reports their real status, which is where the interesting failures live. A sitemap stuffed with URLs that 301 to a new path tells Google to crawl the redirect, follow it, and then index the destination, wasting crawl budget on a round trip that should not exist. Only final, canonical, 200-returning URLs belong in the file.

The two worst offenders are redirects and 404s. Redirects in a sitemap mean you migrated URLs and never regenerated the file, so the sitemap still advertises the old addresses. 404s mean you deleted pages but left their URLs in the sitemap, which trains Google to distrust the file as a source of live, indexable pages. A third, subtler problem is including URLs that carry a noindex tag or canonicalize to a different page. Those send Google a contradictory signal: the sitemap says this URL is important, the page says do not index me or index someone else instead. The validator flags these mismatches so the file says exactly one thing about each URL.

Host and protocol consistency is the last URL-level check. Every loc must use the same scheme and hostname as the property the sitemap is submitted under. A sitemap served on https that lists http URLs, or one on the apex domain that lists www URLs, can have a large share of its entries quietly ignored. These mismatches never show up as XML errors, which is exactly why an automated check earns its place.

The lastmod field, and why most sites get it wrong

Of all the elements in the sitemap protocol, lastmod is the one Google actually uses. John Mueller and Gary Illyes have both said publicly that lastmod is a meaningful signal for recrawl scheduling, but only when it is honest. Most sites lie to it by accident.

The classic mistake is setting lastmod to the deploy timestamp. Every time you ship a new build, the sitemap regenerates and every URL gets the same fresh lastmod, even though only three pages actually changed. After a few cycles of seeing every URL claim it changed last Tuesday, Google stops trusting your lastmod entirely.

The correct value is the last time the visible content of that specific page meaningfully changed. A typo fix counts. Updating an outdated statistic counts. Reordering navigation across the site does not count for every page. If your CMS does not track this, store a content_updated_at column and only touch it when the body, title, or main content is edited.

changefreq and priority, and why they no longer matter

The spec defines changefreq (always, hourly, daily, weekly, monthly, yearly, never) and priority (0.0 to 1.0). Both sound useful. Both are completely ignored by Google. Gary Illyes said it directly in 2017, and Google has reaffirmed it since. The fields exist because the spec is old, but the modern crawler treats them as decorative.

You can include them, leave them out, or set them to nonsense and your rankings will be identical. If you want to influence crawl behavior in a way that works, focus on an accurate lastmod, a clean internal link structure that surfaces important pages within a few clicks of the homepage, and fast server response (a sitemap with 100,000 URLs is only useful if Googlebot can crawl them before its budget runs out).

News, image, and video sitemap extensions

Google supports three sitemap extensions. The Google News sitemap (xmlns:news) lets approved publishers feed articles into Google News with publication date, title, and language. Articles must be less than 48 hours old, and the file must contain no more than 1,000 URLs.

The image extension (xmlns:image) lets you list up to 1,000 images per URL block, with optional caption, geo location, title, and license. The video extension (xmlns:video) is required for video thumbnails and rich snippets; you provide the thumbnail URL, title, description, content or player URL, and optional metadata. Without it, Google often fails to associate a video with its page.

Submitting, monitoring, and the "discovered but not indexed" trap

Once your sitemap validates clean, submit it in Google Search Console under Sitemaps using the absolute URL. Bing Webmaster Tools accepts the same file. You can also reference the sitemap from robots.txt with a Sitemap: line at the bottom.

Search Console takes hours to days to process the file. The Sitemaps report shows how many URLs were submitted, how many were discovered, and whether the fetch succeeded. Pair it with the Pages report to see which URLs are indexed, excluded, or stuck in limbo.

The most common limbo state is "Discovered, currently not indexed." Google has seen the URL via your sitemap but has not crawled it yet, usually because of crawl budget pressure or thin content signals on the page itself. The fix is rarely the sitemap. It is usually internal links pointing to the orphan page, faster server response, and stronger on-page content. Run this XML sitemap checker any time you ship redesigns or migrations. A clean sitemap will not rank you, but a broken one will quietly cap how much of your site Google can see.

Why an invalid sitemap silently breaks indexation

The sitemap protocol, in one section

Sitemap index files and the 50,000 URL limit

Common sitemap errors and what they mean

Why the URLs inside the file matter more than the XML

The lastmod field, and why most sites get it wrong

changefreq and priority, and why they no longer matter

News, image, and video sitemap extensions

Submitting, monitoring, and the "discovered but not indexed" trap

How it works

Enter Your Sitemap URL

Validate Syntax and Structure

Review Issues and Warnings

Frequently asked

What is an XML sitemap and why does it matter?

How many URLs can an XML sitemap have?

Should I include noindex pages in my sitemap?

Do priority and changefreq still matter?

Why does my sitemap show 'discovered but not indexed' in Search Console?

What should I do with deleted pages in my sitemap?

Should I separate my sitemap by content type?

Related tools

Why an invalid sitemap silently breaks indexation

The sitemap protocol, in one section

Sitemap index files and the 50,000 URL limit

Common sitemap errors and what they mean

Why the URLs inside the file matter more than the XML

The lastmod field, and why most sites get it wrong

changefreq and priority, and why they no longer matter

News, image, and video sitemap extensions

Submitting, monitoring, and the "discovered but not indexed" trap

How it works

Enter Your Sitemap URL

Validate Syntax and Structure

Review Issues and Warnings

Frequently asked

What is an XML sitemap and why does it matter?

How many URLs can an XML sitemap have?

Should I include noindex pages in my sitemap?

Do priority and changefreq still matter?

Why does my sitemap show 'discovered but not indexed' in Search Console?

What should I do with deleted pages in my sitemap?

Should I separate my sitemap by content type?

Related tools