Skip to content

CONCEPT Cited by 2 sources

Agent-training-crawler redirect

Definition

An agent-training-crawler redirect is the practice of serving AI training crawlers a different URL than humans for the same logical content, when the human-visible page would poison future LLM answers if ingested verbatim.

Canonical failure mode: deprecated documentation. A human reading "Wrangler v1 commands" sees the large "this version is deprecated, use v2" banner at the top of the page. An LLM training crawler typically ingests the prose, not the cross-page banner placement semantics — so the training set ends up teaching the model that "Wrangler v1 is the answer" for any Wrangler question.

Canonical instance

Redirects for AI Training (Cloudflare, late 2025 / early 2026): Cloudflare identifies known AI-training-crawler user-agents + IP ranges and serves them a different URL — typically the current version of the content — while humans and non-training crawlers continue to see the original deprecated page.

Cloudflare's framing in the 2026-04-17 post:

"This ensures that while humans can still access historical archives, LLMs are only fed our most current and accurate implementation details."

Mechanics

Two implementation shapes have surfaced in the wiki so far:

Imperative — classify-then-rewrite

  • Classify the request. Is this an AI training crawler? (Web Bot Auth for self-declaring bots; WAF rules + IP reputation for the rest.)
  • Rewrite / redirect. If yes, serve the target URL (current-version doc). If no, serve the originally-requested URL.
  • Don't cross class boundaries. Search crawlers, agents doing live lookups, RSS feeds, and humans all continue to see the original — only the training-crawler class is redirected.

Declarative — canonical-tag-driven

Introduced by Cloudflare's 2026-04-17 Redirects for AI Training launch. The origin's existing <link rel="canonical"> tag (present on 65-69 % of web pages per the 2025 Web Almanac, emitted by default by most CMSes) becomes the declarative source.

  • Verify the class. On Cloudflare, cf.verified_bot_category == "AI Crawler". The verified-bot category distinguishes AI Crawler (training — GPTBot, ClaudeBot, Bytespider) from AI Assistant and AI Search, which are not redirected.
  • Read the origin HTML. Locate <link rel="canonical" href="...">.
  • Apply exclusion filters. Skip self-referencing canonicals (infinite-loop guard) and cross-origin canonicals (scope discipline).
  • Return HTTP 301 Moved Permanently. Location: <canonical>; done before the body is served.

Declarative shape wins on maintenance: new deprecated paths get the behaviour automatically as soon as their canonical tag points at the replacement. See patterns/canonical-tag-as-crawler-redirect.

Why it targets training-time, not inference-time

The alternative to redirecting training crawlers is post-training inference-time correction — either (a) scraping training data out of existing models (intractable: retraining is prohibitively expensive) or (b) handling it at agent prompt level ("always prefer v2") which doesn't scale across millions of agents with no common prompt surface.

Redirecting training-time crawlers fixes the training set, which cascades into future LLM answers. Cheaper, better, more durable — but requires correct crawler classification and depends on the crawler respecting the redirect (modern HTTP client libraries typically do).

Tension with transparency

Two honest concerns with the pattern:

  1. Different truth for different audiences. The site now officially serves different content based on client identity. Reasonable in the deprecated-docs case; the same mechanism used maliciously could hide unfavorable content from training-crawler classes while keeping it for the public archive (or vice versa).
  2. Unilateral origin decision. There's no standard negotiation or disclosure; origins decide what crawlers see. Aligned with the broader Content Signals / Web Bot Auth / pay-per-crawl stack that gives origins more control, but the audit trail of "what did training crawlers see when" is absent.

Empirical motivation (Cloudflare 2026-04-17)

Cloudflare's 2026-04-17 Redirects for AI Training launch post supplies the empirical case for why advisory signals are insufficient and redirects are necessary:

  • Legacy Wrangler v1 docs on developers.cloudflare.com carried the full advisory stack — deprecation banner, noindex meta tag (concepts/noindex-meta-tag), canonical tags — and AI Crawl Control telemetry showed the AI Crawler category visited 4.8 M times in 30 days, "consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference."
  • Ground-truth inference failure: in April 2026 a leading AI assistant answered "How do I write KV values using the Wrangler CLI?" with the deprecated wrangler kv:key put syntax (deprecated in Wrangler 3.60.0; current syntax is wrangler kv key put).
  • After enabling Redirects for AI Training for 7 days, 100 % of AI-training-crawler requests to pages with non-self-referencing canonical tags were redirected.

Seen in

Last updated · 200 distilled / 1,178 read