CONCEPT Cited by 3 sources
Sitemap¶
Definition¶
A sitemap is an XML file listing every URL on a site, plus
per-URL metadata (last-modified, change frequency, priority). The
standard is sitemaps.org (2005). A
site typically publishes the sitemap URL in
robots.txt via a Sitemap: directive.
Why it exists¶
Before sitemaps, crawlers had to traverse the link graph —
follow every <a href> from the home page, discover pages, then
re-traverse on each crawl cycle. Sitemaps let the site hand the
crawler a URL list directly, with freshness metadata that short-
circuits re-crawls of unchanged pages.
Role for AI agents¶
Sitemap is the agent's map of everything the site has. Unlike
llms.txt (which is curated and compact),
sitemaps are exhaustive and verbose. An agent typically
consults the sitemap via the pointer in robots.txt, picks
candidate URLs by path/title, and fetches those with
markdown content
negotiation where supported.
Limitations for agents¶
- No semantic content — just URLs + metadata; the agent still has to fetch each page to understand what's on it.
- Directory-listing pages pollute sitemaps — they're
structurally valid URLs with zero semantic value. Cloudflare's
2026-04-17 dogfood explicitly removes ~450 directory-listing
pages from its
llms.txtfor exactly this reason, while leaving them in the sitemap. - No size cap enforced — very large sites produce sitemaps too big to fit in an agent's context window; they're designed for machine-scale parsing, not for a single LLM read.
- Flat URL list, no hierarchy or titles — Vercel's 2026-04-21 framing. XML sitemaps give the agent no indication of what each page is about or how pages relate to each other; markdown sitemaps are the hierarchical-titled alternative.
Canonical agent-era role¶
The Agent Readiness Score
(concepts/agent-readiness-score) grades sitemap presence under
the Agent Discovery dimension — alongside robots.txt and the
Link: response header — as
one of three discovery primitives an agent can use to enumerate a
site's content.
Seen in¶
- sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — canonical Agent-Discovery-dimension instance in the Agent Readiness Score.
- sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process
— canonical wiki instance of rendering-strategy-
neutrality. "Having an updated
sitemap.xmlsignificantly reduces, if not eliminates, the time-to-discovery differences between different rendering patterns." With a sitemap, Google doesn't need to link-graph-traverse to find URLs, and SSG / ISR / SSR / CSR all benefit equally from the direct URL list. CSR benefits most because its link-discovery is weakest. Also canonicalises<lastmod>as a re-crawl signal for large sites. Composes with concepts/rendering-strategy-crawl-efficiency-tradeoff. - sources/2026-04-21-vercel-making-agent-friendly-pages-with-content-negotiation — the contrast instance. Vercel frames the flat XML sitemap as semantically thin for LLM agents and introduces the markdown sitemap as the hierarchical-titled alternative for agent discovery.
Related¶
- concepts/robots-txt — where sitemaps are typically advertised.
- concepts/llms-txt — curated, agent-centric alternative to the exhaustive sitemap.
- concepts/markdown-sitemap — the markdown/hierarchical shape for agent consumption.
- concepts/agent-readiness-score — where sitemap presence is graded.