PATTERN Cited by 1 source
Content-use robots.txt extension¶
Extend the robots.txt standard (and Content Signals specification) with a use field that declares how much content a crawler may retain and reshare, moving beyond the binary crawl/don't-crawl model to express a spectrum of reproduction permissions.
Problem¶
Traditional robots.txt only expresses "you may crawl this path" or "you may not." It cannot distinguish between a search engine that indexes and links back (desirable) versus a training crawler that absorbs content permanently (undesirable for many site owners). Site owners want to allow discovery without allowing full reproduction.
Solution¶
Add a use parameter to the Content Signals specification in robots.txt:
Three levels:
- use=immediate — interact, store nothing
- use=reference — index, excerpt, link back (default for managed robots.txt)
- use=full — summarize and reproduce
Combine with bot taxonomy for composable rules: "allow Search bots up to reference but block any bot requesting full."
Consequences¶
- Graduated control: site owners express nuance beyond binary allow/block.
- Machine-readable preference: automated systems can parse and respect levels programmatically.
- Advisory, not enforcement: like
Disallow, the signal is a preference — but platforms like Cloudflare add enforcement by revoking Verified status for violators. - Ecosystem coordination required: value depends on bot operators choosing to respect the signals.
Known Uses¶
- Cloudflare Managed Robots.txt — automatically prepends
Content-Signal: search=yes,ai-train=no,use=referencefor customers who enable managed content signals (Source: sources/2026-07-01-cloudflare-ai-traffic-options)