Skip to content

PATTERN Cited by 1 source

Response status as content policy

Pattern

Use the HTTP response status-code surface as the protocol for per-client-class content policy. When an origin operator wants different policy for different client classes (AI training crawlers vs search engines vs humans vs paying licensees), the answer is not a bespoke signalling layer — it's pick the right status code per class, because that's "how the web communicates policy to crawlers" (Cloudflare, 2026-04-17).

From the 2026-04-17 Redirects for AI Training post:

"A 200 OK means content was served. A 403 Forbidden means access was blocked. A 402 Payment Required tells the client it needs to pay for access. Taken together, the distribution of status codes across AI crawler traffic reveals how the web is actually responding to crawlers at scale."

The status-code policy surface

Status Policy semantics Example
200 OK Content served as-is Default
301 Moved Permanently Here's the canonical version Redirects for AI Training — canonical-tag → 301 for AI Crawler category
402 Payment Required License fee required Pay Per Crawl — [concepts/http-402-payment-required
403 Forbidden Blocked; no negotiation WAF + bot-management hard-block
404 Not Found Gone; nothing here Standard HTTP semantics
451 Unavailable For Legal Reasons Blocked for legal reasons Rare but standardised

Each status carries well-defined HTTP semantics; there is no need for a parallel vocabulary or bespoke headers.

Canonical instance

Cloudflare's 2026-04-17 arc operationalises three status codes as first-class content-policy choices:

  • 301Redirects for AI Training — the 2026-04-17 launch feature. Routes AI training crawlers to the canonical URL. Declarative via <link rel="canonical"> (patterns/canonical-tag-as-crawler-redirect).
  • 402Pay Per Crawl — previously launched. Returns [[concepts/http-402-payment-required|402 Payment Required]] to AI crawlers until the crawler pays. Re-uses an HTTP status code reserved since HTTP/1.1 but essentially dormant in production.
  • 403 → WAF / bot-management hard-block. The "just reject it" policy, distinct from the categorised-redirect and monetised-access variants above.

All three are surfaced through AI Crawl Control and observable on Radar's new "Response Status Code Analysis" AI-Insights surface.

Measurement / observability

Radar AI Insights' 2026-04-17-launched Response Status Code Analysis chart is the measurement complement: which status codes does the web return to AI crawlers in aggregate? Representative numbers from the launch post:

  • Aggregate (top-5 AI crawlers): ~74 % 2xx, ~13.7 % 4xx, ~11.3 % 3xx, ~1.2 % 5xx.
  • GPTBot-specific: ~83 % 2xx, ~10 % 4xx, ~5.1 % 3xx, ~2.2 % 5xx.

Filterable by industry (via Data Explorer) and crawl purpose (Training / Search / Assistant). The "redirect-rate" slice of the 3xx bucket is a first-class measurement primitive for how widespread adoption of canonical-tag-based training-crawler redirection becomes over time.

Why this framing is load-bearing

  • Every HTTP intermediary already understands the status codes. No new parser, library, or crawler-side change is needed — the crawler already knows what 301, 402, 403, 404 mean. The policy signal rides on well-understood HTTP semantics.
  • Web-wide measurability. Status-code distributions can be counted at any vantage point (edge, origin, crawler). Cloudflare's Radar surface demonstrates the measurement leverage.
  • Polyvalent surface. One status code can mean different things in different policy lanes: 403 from a WAF is a hard-block; 403 from a paywall means "you haven't paid." The status code is the substrate; the operational meaning is carried by which ruleset emitted it.
  • Composes with existing advisory layers. robots.txt, Content Signals, noindex are all advisory — they declare intent. Status codes are enforcement — they execute intent. The pattern fills the gap between "here's what I want" and "here's what I will actually do."

Trade-offs

  • Status code semantics are coarse. "Why a 402?" or "which license applies?" requires metadata the status line doesn't carry. Payment / license detail typically rides in response body + headers on top of the status code.
  • Misclassification bites harder on status codes than on advisory signals. A bot misclassified as an AI crawler and served a 301 to canonical content is a user-visible regression; a bot misclassified and served a Content-Signal declaration it ignores is invisible.
  • Some policy answers don't have a status code match. "Ingest this for search indexing but not for model training" is expressible in Content Signals but not in any single status code — the categorical response primitive is per-request, not per-use-case. The status-code-as-policy pattern is about enforcement for the class, not usage-type licensing post-fetch.
  • Operator discretion without transparency. Every status code carries policy choice that's not visible to the client. Cloudflare does not propose a transparency-disclosure layer ("which URLs were 301'd for which crawler class when?") that would let researchers audit the policy surface across operators.
  • The measurement surface captures crawler experience, not policy intent. Radar shows the web returned ~11.3 % 3xx responses to AI crawlers in aggregate — but that bucket mixes operator-intentional redirects (Redirects for AI Training), operator-agnostic redirects (https upgrades, trailing-slash normalisation), and unintended redirects (misconfigured vhosts). Interpretation requires operator context.

Generalises beyond Cloudflare

  • Any origin / edge operator can implement the pattern — pick the right status code per class. The infrastructure exists in every reverse proxy and every application framework.
  • Measurement vantage points: CDN operators (Cloudflare Radar) are one; individual origins with their own telemetry are another; crawler operators themselves (OpenAI, Anthropic, Google, Meta) could publish their own received-status distributions for transparency.

Seen in

Last updated · 200 distilled / 1,178 read