Who actually needs a sitemap
A sitemap generator is one of those tools people reach for out of habit, even when they do not need one. Google has said for years that small, well linked sites under five hundred pages often discover everything fine without one. Before you spin up an xml sitemap creator, know whether you genuinely benefit.
Large sites are the obvious case. If you publish a thousand product pages, ten thousand programmatic landing pages, or a deep blog archive, a sitemap is the cheapest way to make sure crawlers see every URL. New sites come next, because you have no inbound links yet. Third are sites with weak internal linking, where deep pages are orphaned or buried four or five clicks behind faceted navigation.
Fourth is sites with non-HTML content. PDFs in a documentation folder, images on a CDN subdomain, video embedded inside React components, podcast episodes loaded over an audio player. None get discovered reliably from internal anchors. A sitemap explicitly lists them, which is often the difference between being indexed and being invisible.
Static versus dynamic sitemap generation
There are two ways to create sitemap files. Static generation builds at deploy time, ships a flat sitemap.xml in your public folder, and serves it as a plain asset. Dynamic generation builds at request time through a server route that queries your database, formats the urlset, and streams the response. Both work. Neither is universally better.
Static makes sense for marketing sites, documentation, and anything where content changes a few times per week. The build is cheap, the file is cacheable at the CDN edge, and you avoid the risk of a database hiccup taking your sitemap offline. The trade off is staleness; if you publish between deploys, the sitemap lags.
Dynamic generation wins for ecommerce catalogs that change hourly, news sites that publish all day, and any app where URL inventory is live data. You generate sitemap.xml on demand, set a sensible cache header (an hour is usually fine), and let the server keep it fresh. Pair this with on-publish triggers if you want updates the instant an article goes live.
What a valid sitemap actually contains
The format is simple. A urlset element wraps the document, each URL is a url element with a loc (absolute URL) and lastmod (date content meaningfully changed). The priority and changefreq attributes still appear in tutorials, but Google ignores both, so skip them and save bytes.
The XML namespace declaration matters more than people think. Use the standard sitemaps.org namespace, and add image, video, or news namespaces only if you include those extensions. Files must be UTF-8, URLs must be absolute, and every URL must use the same protocol and host as where the sitemap is served. Mixing http and https, or www and non-www, is a quiet way to get half your file ignored.
Include canonical, indexable URLs that return 200 and serve real content. That is the entire mental model for a sitemap builder. If a URL would not be a useful search result, it should not be in the file.
The exclude list (where most sitemaps go wrong)
The fastest way to weaken a sitemap is to dump every URL the site can produce into it. Crawlers treat the sitemap as a hint about which pages you consider important. Include junk, and you are telling Google that junk is part of your priority set.
Exclude noindexed pages, because including a URL you told Google to ignore is a contradictory signal. Exclude redirect sources; only the target belongs in the sitemap, never the URL that 301s to it. Exclude parameter URLs that exist for tracking or filtering, like utm_source variants or session ids. They bloat the file and create duplicate content patterns crawlers waste budget on.
Exclude paginated archives whose content is consolidated elsewhere. If page two through fifty of a category just rehash links to articles that have their own canonical home, listing them adds nothing. The article URLs themselves belong in the sitemap; the pagination wrappers usually do not. Same logic for tag pages, internal search results, and faceted filter combinations. If a URL exists primarily for navigation rather than as a destination, leave it out.
Image and video sitemaps for media-heavy sites
If images or video are central to your value (ecommerce store, photography portfolio, recipe site, streaming platform), a regular sitemap leaves discovery on the table. Image and video extensions let you describe media that lives on a page, including title, caption, license, and the media URL itself, even when the media is loaded by JavaScript or lives on a separate CDN host.
For image sitemaps, list up to one thousand images per page entry, with image:loc pointing at the full asset URL. Useful when you serve images through next-gen formats or responsive srcsets where the canonical asset is not obvious from the HTML. For video, the schema is richer; you supply a thumbnail, content or player URL, duration, publication date, and description. Done well, this drives video carousels and image pack appearances that pure HTML rarely earns.
News sitemaps and the 48-hour publication rule
News sitemaps are a separate format with one strict constraint: only articles published within the last forty eight hours belong. Anything older gets removed, because the news index is built around freshness. Including older pieces is treated as spam and can suppress your inclusion in Google News entirely.
The schema requires a publication name, language, ISO 8601 date, and article title. Keep it under a thousand URLs per file and regenerate on a tight cadence. Many publishers rebuild every few minutes, because the gap between going live now versus an hour from now is the gap between Top Stories placement and missing the cycle. Standard sitemaps handle long tail discovery; news sitemaps handle the breaking moment.
Submitting your sitemap (use both methods)
There are two ways to tell crawlers where your sitemap lives, and the right answer is both. Add a Sitemap line to your robots.txt pointing at the absolute URL of the file. Every well behaved crawler reads robots.txt on first contact, so this gets you discovery from Google, Bing, Yandex, DuckDuckGo, and AI crawlers like GPTBot and ClaudeBot in one move.
Then submit explicitly in Google Search Console and Bing Webmaster Tools. Search Console unlocks coverage reports showing how many submitted URLs are indexed, excluded, and why. That feedback loop is where you catch canonical conflicts, soft 404s, and crawl budget waste. Robots.txt gets you discovered; Search Console tells you whether discovery is translating into indexation.
Maintenance: lastmod, the 50k cap, and automation
A sitemap is not a one time deliverable. The most valuable signal in the file is lastmod, and it is also the field most often filled with nonsense. Google has said that if your lastmod is unreliable (for example, every URL updated to today on every build), it stops trusting the field. Update lastmod only when page content has meaningfully changed, not on every redeploy.
The protocol caps a single sitemap at fifty thousand URLs and fifty megabytes uncompressed. Past that, split into multiple files referenced from a sitemap index. A free sitemap generator should handle this split automatically; a sitemap online tool returning a bloated sixty thousand URL file is broken by definition. Group splits logically (one per content type, or one per language) so coverage reports stay readable.
The last piece is automation. Generating sitemap.xml manually is fine for a portfolio site, but anything past that should regenerate on publish. Wire your CMS to ping the sitemap endpoint, or schedule a cron, or build it into your CI pipeline. The sitemap is only as useful as it is current; the moment it falls behind real content, it stops helping crawlers and starts misleading them.