Skip to content

PATTERN Cited by 1 source

Content-use robots.txt extension

Extend the robots.txt standard (and Content Signals specification) with a use field that declares how much content a crawler may retain and reshare, moving beyond the binary crawl/don't-crawl model to express a spectrum of reproduction permissions.

Problem

Traditional robots.txt only expresses "you may crawl this path" or "you may not." It cannot distinguish between a search engine that indexes and links back (desirable) versus a training crawler that absorbs content permanently (undesirable for many site owners). Site owners want to allow discovery without allowing full reproduction.

Solution

Add a use parameter to the Content Signals specification in robots.txt:

User-agent: *
Content-Signal: search=yes,ai-train=no,use=reference
Allow: /

Three levels: - use=immediate — interact, store nothing - use=reference — index, excerpt, link back (default for managed robots.txt) - use=full — summarize and reproduce

Combine with bot taxonomy for composable rules: "allow Search bots up to reference but block any bot requesting full."

Consequences

  • Graduated control: site owners express nuance beyond binary allow/block.
  • Machine-readable preference: automated systems can parse and respect levels programmatically.
  • Advisory, not enforcement: like Disallow, the signal is a preference — but platforms like Cloudflare add enforcement by revoking Verified status for violators.
  • Ecosystem coordination required: value depends on bot operators choosing to respect the signals.

Known Uses

Seen In

Last updated · 564 distilled / 1,671 read