CONCEPT Cited by 2 sources
robots.txt¶
Definition¶
robots.txt is a text file at the root of a site
(https://example.com/robots.txt) that declares crawl rules for
automated clients — which user-agents may access which paths. The
Robots Exclusion Protocol
was informally proposed in 1994 by Martijn Koster; standardised
as RFC 9309
in 2022.
Two load-bearing purposes for an agent-era website (2026):
- Declare crawl rules — per-user-agent allow/deny on paths,
including AI-specific user-agents (
GPTBot,CCBot,ClaudeBot, etc.). - Point at sitemaps —
Sitemap:directives give crawlers a URL list of all discoverable pages without link-graph traversal.
Current state (2026-04)¶
Cloudflare's Radar scan of 200 k top-visited domains (2026-04-17) reports:
- 78 % of sites have a
robots.txt— nearly universal. - The vast majority are written for classical search-engine crawlers, not AI agents. Presence alone does not imply agent-ready — the rules need to name AI user-agents and / or declare AI-use preferences.
Extensions that matter for agents¶
Content Signals¶
The Content Signals standard adds
a Content-Signal: directive that declares three orthogonal AI
preferences per user-agent: ai-train, ai-input, search
(see concepts/content-signals).
Adoption: 4 % of the top 200 k (2026-04). New standard, fast uptake.
Pointer to the authentication directory¶
Friendly-bot self-authentication via
Web Bot Auth consumes a separate
/.well-known/http-message-signatures-directory endpoint; sites
can cross-link from robots.txt for discoverability.
Failure modes¶
- User-agent string is spoofable.
robots.txtrelies on the bot reporting its identity honestly. Web Bot Auth fixes the spoof problem at the authentication layer. - Generalised
User-agent: *rules miss AI semantics. A site that allows*inadvertently opts into AI training; explicit AI-crawler directives are needed to expressai-train=no. - Out-of-sync between declared preferences and origin
enforcement.
robots.txtis advisory; origins that want enforcement pair it with WAF / bot-management rules (Cloudflare stack) or with pay-per-crawl for monetisation.
Seen in¶
- sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — canonical wiki instance of the Radar-measured 78 % adoption + explicit framing that presence does not imply agent-ready. First check in the Access Rules dimension of concepts/agent-readiness-score.
- sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers
— canonical wiki instance of the
advisory-protocol failure mode: Cloudflare's controlled
brand-new-domain
experiment confirmed Perplexity AI's declared crawlers
appeared to respect
robots.txt, but when blocked the operator fell back to an undeclared stealth crawler that either didn't fetchrobots.txtor ignored it. Counter-example: ChatGPT-User fetchedrobots.txt, honored theDisallow, and stopped crawling — full compliance.
Related¶
- concepts/content-signals — the
Content-Signal:extension that makesrobots.txtAI-literate. - concepts/sitemap — the other document-discovery
primitive that
robots.txttypically points at. - concepts/agent-readiness-score — where
robots.txtpresence + Content-Signals are graded under Access Rules. - concepts/robots-txt-compliance — the operator-side discipline the 2025-08-04 post sharpens (ChatGPT full compliance vs Perplexity stealth-crawler non-compliance).
- concepts/declared-crawler / concepts/stealth-crawler.
- systems/web-bot-auth — cryptographic answer to the user-agent-spoofing problem.