CONCEPT Cited by 2 sources

robots.txt¶

Definition¶

robots.txt is a text file at the root of a site (https://example.com/robots.txt) that declares crawl rules for automated clients — which user-agents may access which paths. The Robots Exclusion Protocol was informally proposed in 1994 by Martijn Koster; standardised as RFC 9309 in 2022.

Two load-bearing purposes for an agent-era website (2026):

Declare crawl rules — per-user-agent allow/deny on paths, including AI-specific user-agents (GPTBot, CCBot, ClaudeBot, etc.).
Point at sitemaps — Sitemap: directives give crawlers a URL list of all discoverable pages without link-graph traversal.

Current state (2026-04)¶

Cloudflare's Radar scan of 200 k top-visited domains (2026-04-17) reports:

78 % of sites have a robots.txt — nearly universal.
The vast majority are written for classical search-engine crawlers, not AI agents. Presence alone does not imply agent-ready — the rules need to name AI user-agents and / or declare AI-use preferences.

Extensions that matter for agents¶

Content Signals¶

The Content Signals standard adds a Content-Signal: directive that declares three orthogonal AI preferences per user-agent: ai-train, ai-input, search (see concepts/content-signals).

User-agent: *
Content-Signal: ai-train=no, search=yes, ai-input=yes

Adoption: 4 % of the top 200 k (2026-04). New standard, fast uptake.

Pointer to the authentication directory¶

Friendly-bot self-authentication via Web Bot Auth consumes a separate /.well-known/http-message-signatures-directory endpoint; sites can cross-link from robots.txt for discoverability.

Failure modes¶

User-agent string is spoofable. robots.txt relies on the bot reporting its identity honestly. Web Bot Auth fixes the spoof problem at the authentication layer.
Generalised User-agent: * rules miss AI semantics. A site that allows * inadvertently opts into AI training; explicit AI-crawler directives are needed to express ai-train=no.
Out-of-sync between declared preferences and origin enforcement. robots.txt is advisory; origins that want enforcement pair it with WAF / bot-management rules (Cloudflare stack) or with pay-per-crawl for monetisation.

Seen in¶

sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — canonical wiki instance of the Radar-measured 78 % adoption + explicit framing that presence does not imply agent-ready. First check in the Access Rules dimension of concepts/agent-readiness-score.
sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — canonical wiki instance of the advisory-protocol failure mode: Cloudflare's controlled brand-new-domain experiment confirmed Perplexity AI's declared crawlers appeared to respect robots.txt, but when blocked the operator fell back to an undeclared stealth crawler that either didn't fetch robots.txt or ignored it. Counter-example: ChatGPT-User fetched robots.txt, honored the Disallow, and stopped crawling — full compliance.

concepts/content-signals — the Content-Signal: extension that makes robots.txt AI-literate.
concepts/sitemap — the other document-discovery primitive that robots.txt typically points at.
concepts/agent-readiness-score — where robots.txt presence + Content-Signals are graded under Access Rules.
concepts/robots-txt-compliance — the operator-side discipline the 2025-08-04 post sharpens (ChatGPT full compliance vs Perplexity stealth-crawler non-compliance).
concepts/declared-crawler / concepts/stealth-crawler.
systems/web-bot-auth — cryptographic answer to the user-agent-spoofing problem.