Why an invalid sitemap silently breaks indexation
A broken sitemap is the kind of problem that does not announce itself. Pages keep loading, users keep visiting, but Google is rejecting your sitemap.xml entirely, falling back to whatever URLs it can discover through internal links, and ignoring the new product pages or blog posts you keep wondering why nobody can find.
The reason this fails so quietly is that Google does not email you when your file has a syntax error. It logs a status in Search Console, often under "Couldn't fetch" or, worse, "Success" with zero discovered URLs. If you are not checking the Sitemaps report regularly, you find out weeks later when an editor asks why a launched page does not show up for its own brand name.
This XML sitemap checker pulls your live file, parses it against the sitemap protocol Google publishes, and tells you what is wrong. Malformed tags, encoding mismatches, invalid lastmod values, unescaped ampersands; all of it gets flagged with a line number.
The sitemap protocol, in one section
The Google sitemap protocol is the sitemaps.org spec, jointly adopted by Google, Bing, and Yahoo in 2008. A valid sitemap.xml starts with an XML declaration, opens a urlset element with the correct namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"), and lists each page inside a url block.
Inside every url block you put a loc tag, the absolute URL of the page. That is the only required element. Optional siblings include lastmod, changefreq, and priority. The loc must be URL encoded, must match the protocol and host registered in Search Console, and must not exceed 2048 characters.
Most validators stop at "is this well formed XML." This sitemap.xml validator goes further. It checks the namespace, whether each loc resolves to a real 200 response rather than a 404 or redirect chain, and whether your lastmod values are plausible.
Sitemap index files and the 50,000 URL limit
A single sitemap.xml holds a maximum of 50,000 URLs or 50MB uncompressed, whichever comes first. Above that, you split into multiple files and reference them from a sitemap index. The index uses a sitemapindex root element instead of urlset, and each child is a sitemap entry with a loc pointing to one of the child files.
The index itself is also capped at 50,000 entries. Large publishers typically run a flat index pointing to fifty or a hundred segmented child sitemaps grouped by content type or date. This makes it easier to spot which segment is failing, and lets you regenerate one segment without rewriting the whole thing. If you are gzipping, the 50MB limit applies to the uncompressed size.
Common sitemap errors and what they mean
The most frequent failure is malformed XML. An unclosed tag, a missing quote, a stray less-than sign inside a URL parameter; any of these and the parser stops cold. There is no recovery and no best effort rendering. One bad character and the entire sitemap fails to load.
Encoding mismatches are next. Your sitemap.xml syntax declares UTF-8 in the prolog, but the file gets saved with BOM, or as Windows-1252, or with mixed encodings if you concatenated outputs from different scripts. Google rejects the file or treats accented characters as garbage, producing URLs that do not match anything on your server.
Escape characters cause silent damage. URLs with ampersands, single quotes, and angle brackets must be entity encoded inside loc tags. Invalid lastmod values are the quietest killer: if the timestamp is in the wrong format (Google expects W3C datetime), the field is ignored, the sitemap parses fine, and your update signals are gone.
The lastmod field, and why most sites get it wrong
Of all the elements in the sitemap protocol, lastmod is the one Google actually uses. John Mueller and Gary Illyes have both said publicly that lastmod is a meaningful signal for recrawl scheduling, but only when it is honest. Most sites lie to it by accident.
The classic mistake is setting lastmod to the deploy timestamp. Every time you ship a new build, the sitemap regenerates and every URL gets the same fresh lastmod, even though only three pages actually changed. After a few cycles of seeing every URL claim it changed last Tuesday, Google stops trusting your lastmod entirely.
The correct value is the last time the visible content of that specific page meaningfully changed. A typo fix counts. Updating an outdated statistic counts. Reordering navigation across the site does not count for every page. If your CMS does not track this, store a content_updated_at column and only touch it when the body, title, or main content is edited.
changefreq and priority, and why they no longer matter
The spec defines changefreq (always, hourly, daily, weekly, monthly, yearly, never) and priority (0.0 to 1.0). Both sound useful. Both are completely ignored by Google. Gary Illyes said it directly in 2017, and Google has reaffirmed it since. The fields exist because the spec is old, but the modern crawler treats them as decorative.
You can include them, leave them out, or set them to nonsense and your rankings will be identical. If you want to influence crawl behavior in a way that works, focus on an accurate lastmod, a clean internal link structure that surfaces important pages within a few clicks of the homepage, and fast server response (a sitemap with 100,000 URLs is only useful if Googlebot can crawl them before its budget runs out).
News, image, and video sitemap extensions
Google supports three sitemap extensions. The Google News sitemap (xmlns:news) lets approved publishers feed articles into Google News with publication date, title, and language. Articles must be less than 48 hours old, and the file must contain no more than 1,000 URLs.
The image extension (xmlns:image) lets you list up to 1,000 images per URL block, with optional caption, geo location, title, and license. The video extension (xmlns:video) is required for video thumbnails and rich snippets; you provide the thumbnail URL, title, description, content or player URL, and optional metadata. Without it, Google often fails to associate a video with its page.
Submitting, monitoring, and the "discovered but not indexed" trap
Once your sitemap validates clean, submit it in Google Search Console under Sitemaps using the absolute URL. Bing Webmaster Tools accepts the same file. You can also reference the sitemap from robots.txt with a Sitemap: line at the bottom.
Search Console takes hours to days to process the file. The Sitemaps report shows how many URLs were submitted, how many were discovered, and whether the fetch succeeded. Pair it with the Pages report to see which URLs are indexed, excluded, or stuck in limbo.
The most common limbo state is "Discovered, currently not indexed." Google has seen the URL via your sitemap but has not crawled it yet, usually because of crawl budget pressure or thin content signals on the page itself. The fix is rarely the sitemap. It is usually internal links pointing to the orphan page, faster server response, and stronger on-page content. Run this XML sitemap checker any time you ship redesigns or migrations. A clean sitemap will not rank you, but a broken one will quietly cap how much of your site Google can see.