Skip to content

CONCEPT Cited by 1 source

Sitemap

Definition

A sitemap is an XML file listing every URL on a site, plus per-URL metadata (last-modified, change frequency, priority). The standard is sitemaps.org (2005). A site typically publishes the sitemap URL in robots.txt via a Sitemap: directive.

Why it exists

Before sitemaps, crawlers had to traverse the link graph — follow every <a href> from the home page, discover pages, then re-traverse on each crawl cycle. Sitemaps let the site hand the crawler a URL list directly, with freshness metadata that short- circuits re-crawls of unchanged pages.

Role for AI agents

Sitemap is the agent's map of everything the site has. Unlike llms.txt (which is curated and compact), sitemaps are exhaustive and verbose. An agent typically consults the sitemap via the pointer in robots.txt, picks candidate URLs by path/title, and fetches those with markdown content negotiation where supported.

Limitations for agents

  • No semantic content — just URLs + metadata; the agent still has to fetch each page to understand what's on it.
  • Directory-listing pages pollute sitemaps — they're structurally valid URLs with zero semantic value. Cloudflare's 2026-04-17 dogfood explicitly removes ~450 directory-listing pages from its llms.txt for exactly this reason, while leaving them in the sitemap.
  • No size cap enforced — very large sites produce sitemaps too big to fit in an agent's context window; they're designed for machine-scale parsing, not for a single LLM read.

Canonical agent-era role

The Agent Readiness Score (concepts/agent-readiness-score) grades sitemap presence under the Agent Discovery dimension — alongside robots.txt and the Link: response header — as one of three discovery primitives an agent can use to enumerate a site's content.

Seen in

Last updated · 200 distilled / 1,178 read