Skip to content

CONCEPT Cited by 2 sources

robots.txt

Definition

robots.txt is a text file at the root of a site (https://example.com/robots.txt) that declares crawl rules for automated clients — which user-agents may access which paths. The Robots Exclusion Protocol was informally proposed in 1994 by Martijn Koster; standardised as RFC 9309 in 2022.

Two load-bearing purposes for an agent-era website (2026):

  1. Declare crawl rules — per-user-agent allow/deny on paths, including AI-specific user-agents (GPTBot, CCBot, ClaudeBot, etc.).
  2. Point at sitemapsSitemap: directives give crawlers a URL list of all discoverable pages without link-graph traversal.

Current state (2026-04)

Cloudflare's Radar scan of 200 k top-visited domains (2026-04-17) reports:

  • 78 % of sites have a robots.txt — nearly universal.
  • The vast majority are written for classical search-engine crawlers, not AI agents. Presence alone does not imply agent-ready — the rules need to name AI user-agents and / or declare AI-use preferences.

Extensions that matter for agents

Content Signals

The Content Signals standard adds a Content-Signal: directive that declares three orthogonal AI preferences per user-agent: ai-train, ai-input, search (see concepts/content-signals).

User-agent: *
Content-Signal: ai-train=no, search=yes, ai-input=yes

Adoption: 4 % of the top 200 k (2026-04). New standard, fast uptake.

Pointer to the authentication directory

Friendly-bot self-authentication via Web Bot Auth consumes a separate /.well-known/http-message-signatures-directory endpoint; sites can cross-link from robots.txt for discoverability.

Failure modes

  • User-agent string is spoofable. robots.txt relies on the bot reporting its identity honestly. Web Bot Auth fixes the spoof problem at the authentication layer.
  • Generalised User-agent: * rules miss AI semantics. A site that allows * inadvertently opts into AI training; explicit AI-crawler directives are needed to express ai-train=no.
  • Out-of-sync between declared preferences and origin enforcement. robots.txt is advisory; origins that want enforcement pair it with WAF / bot-management rules (Cloudflare stack) or with pay-per-crawl for monetisation.

Seen in

Last updated · 200 distilled / 1,178 read