CONCEPT Cited by 3 sources

Sitemap¶

Definition¶

A sitemap is an XML file listing every URL on a site, plus per-URL metadata (last-modified, change frequency, priority). The standard is sitemaps.org (2005). A site typically publishes the sitemap URL in robots.txt via a Sitemap: directive.

Why it exists¶

Before sitemaps, crawlers had to traverse the link graph — follow every <a href> from the home page, discover pages, then re-traverse on each crawl cycle. Sitemaps let the site hand the crawler a URL list directly, with freshness metadata that short- circuits re-crawls of unchanged pages.

Role for AI agents¶

Sitemap is the agent's map of everything the site has. Unlike llms.txt (which is curated and compact), sitemaps are exhaustive and verbose. An agent typically consults the sitemap via the pointer in robots.txt, picks candidate URLs by path/title, and fetches those with markdown content negotiation where supported.

Limitations for agents¶

No semantic content — just URLs + metadata; the agent still has to fetch each page to understand what's on it.
Directory-listing pages pollute sitemaps — they're structurally valid URLs with zero semantic value. Cloudflare's 2026-04-17 dogfood explicitly removes ~450 directory-listing pages from its llms.txt for exactly this reason, while leaving them in the sitemap.
No size cap enforced — very large sites produce sitemaps too big to fit in an agent's context window; they're designed for machine-scale parsing, not for a single LLM read.
Flat URL list, no hierarchy or titles — Vercel's 2026-04-21 framing. XML sitemaps give the agent no indication of what each page is about or how pages relate to each other; markdown sitemaps are the hierarchical-titled alternative.

Canonical agent-era role¶

The Agent Readiness Score (concepts/agent-readiness-score) grades sitemap presence under the Agent Discovery dimension — alongside robots.txt and the Link: response header — as one of three discovery primitives an agent can use to enumerate a site's content.

Seen in¶

sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — canonical Agent-Discovery-dimension instance in the Agent Readiness Score.
sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process — canonical wiki instance of rendering-strategy- neutrality. "Having an updated sitemap.xml significantly reduces, if not eliminates, the time-to-discovery differences between different rendering patterns." With a sitemap, Google doesn't need to link-graph-traverse to find URLs, and SSG / ISR / SSR / CSR all benefit equally from the direct URL list. CSR benefits most because its link-discovery is weakest. Also canonicalises <lastmod> as a re-crawl signal for large sites. Composes with concepts/rendering-strategy-crawl-efficiency-tradeoff.
sources/2026-04-21-vercel-making-agent-friendly-pages-with-content-negotiation — the contrast instance. Vercel frames the flat XML sitemap as semantically thin for LLM agents and introduces the markdown sitemap as the hierarchical-titled alternative for agent discovery.

concepts/robots-txt — where sitemaps are typically advertised.
concepts/llms-txt — curated, agent-centric alternative to the exhaustive sitemap.
concepts/markdown-sitemap — the markdown/hierarchical shape for agent consumption.
concepts/agent-readiness-score — where sitemap presence is graded.