Skip to content

CONCEPT Cited by 1 source

Content Signals

Definition

Content Signals is a proposed extension to robots.txt that lets a site declare what AI systems may do with its content, separately for three orthogonal dimensions:

  • ai-train — may the content be used to train AI models?
  • ai-input — may it be used as AI input (inference, RAG, grounding, retrieval)?
  • search — may it appear in search results?

Each takes the value yes or no. Declared via a Content-Signal: directive per user-agent in robots.txt:

User-agent: *
Content-Signal: ai-train=no, search=yes, ai-input=yes

Maintained at contentsignals.org.

Why three dimensions

Before Content Signals, robots.txt could only say "allow this crawler" or "deny this crawler" per path. That collapses three distinct business decisions into one:

  • Opt out of training, stay in search is a common publisher stance — you want Google / Bing to drive traffic, but not see your content absorbed into an LLM.
  • Allow RAG at inference but not training lets search-engine- embedded AI answer questions citing you (read-only, attributed) without becoming part of the next model.
  • Allow training, not inference is rarer but expressible — research corpora that shouldn't serve as live oracle.

Collapsing any of these into allow/deny forces publishers to pick the wrong compromise. Content Signals decomposes the decision.

Adoption (2026-04)

4 % of the top 200 k domains have declared any Content-Signal values in their robots.txt, per Cloudflare Radar. New standard; momentum noted by Cloudflare as gaining.

Relationship to pay-per-crawl

Content Signals is declarative — it expresses publisher preference. Enforcement is a separate layer:

  • ai-train=no without enforcement is advisory; bots that ignore it aren't technically breaking any protocol.
  • pay-per-crawl is the enforcement layer that bills AI crawlers for training use, turning ai-train=no into "you can use it, but it'll cost you" or "not at any price."
  • Cloudflare-hosted sites can combine Content Signals as the publisher-preference declaration, WAF as hard-block, pay-per-crawl for monetisable-access, and the bot-management layer for classification.

Seen in

Last updated · 200 distilled / 1,178 read