CONCEPT Cited by 2 sources
Agent-training-crawler redirect¶
Definition¶
An agent-training-crawler redirect is the practice of serving AI training crawlers a different URL than humans for the same logical content, when the human-visible page would poison future LLM answers if ingested verbatim.
Canonical failure mode: deprecated documentation. A human reading "Wrangler v1 commands" sees the large "this version is deprecated, use v2" banner at the top of the page. An LLM training crawler typically ingests the prose, not the cross-page banner placement semantics — so the training set ends up teaching the model that "Wrangler v1 is the answer" for any Wrangler question.
Canonical instance¶
Redirects for AI Training (Cloudflare, late 2025 / early 2026): Cloudflare identifies known AI-training-crawler user-agents + IP ranges and serves them a different URL — typically the current version of the content — while humans and non-training crawlers continue to see the original deprecated page.
Cloudflare's framing in the 2026-04-17 post:
"This ensures that while humans can still access historical archives, LLMs are only fed our most current and accurate implementation details."
Mechanics¶
Two implementation shapes have surfaced in the wiki so far:
Imperative — classify-then-rewrite¶
- Classify the request. Is this an AI training crawler? (Web Bot Auth for self-declaring bots; WAF rules + IP reputation for the rest.)
- Rewrite / redirect. If yes, serve the target URL (current-version doc). If no, serve the originally-requested URL.
- Don't cross class boundaries. Search crawlers, agents doing live lookups, RSS feeds, and humans all continue to see the original — only the training-crawler class is redirected.
Declarative — canonical-tag-driven¶
Introduced by Cloudflare's 2026-04-17 Redirects for AI Training
launch. The origin's existing
<link rel="canonical"> tag (present
on 65-69 % of web pages per the 2025 Web Almanac, emitted
by default by most CMSes) becomes the declarative source.
- Verify the class. On Cloudflare,
cf.verified_bot_category == "AI Crawler". The verified-bot category distinguishes AI Crawler (training — GPTBot, ClaudeBot, Bytespider) from AI Assistant and AI Search, which are not redirected. - Read the origin HTML. Locate
<link rel="canonical" href="...">. - Apply exclusion filters. Skip self-referencing canonicals (infinite-loop guard) and cross-origin canonicals (scope discipline).
- Return
HTTP 301 Moved Permanently.Location: <canonical>; done before the body is served.
Declarative shape wins on maintenance: new deprecated paths get the behaviour automatically as soon as their canonical tag points at the replacement. See patterns/canonical-tag-as-crawler-redirect.
Why it targets training-time, not inference-time¶
The alternative to redirecting training crawlers is post-training inference-time correction — either (a) scraping training data out of existing models (intractable: retraining is prohibitively expensive) or (b) handling it at agent prompt level ("always prefer v2") which doesn't scale across millions of agents with no common prompt surface.
Redirecting training-time crawlers fixes the training set, which cascades into future LLM answers. Cheaper, better, more durable — but requires correct crawler classification and depends on the crawler respecting the redirect (modern HTTP client libraries typically do).
Tension with transparency¶
Two honest concerns with the pattern:
- Different truth for different audiences. The site now officially serves different content based on client identity. Reasonable in the deprecated-docs case; the same mechanism used maliciously could hide unfavorable content from training-crawler classes while keeping it for the public archive (or vice versa).
- Unilateral origin decision. There's no standard negotiation or disclosure; origins decide what crawlers see. Aligned with the broader Content Signals / Web Bot Auth / pay-per-crawl stack that gives origins more control, but the audit trail of "what did training crawlers see when" is absent.
Empirical motivation (Cloudflare 2026-04-17)¶
Cloudflare's 2026-04-17 Redirects for AI Training launch post supplies the empirical case for why advisory signals are insufficient and redirects are necessary:
- Legacy Wrangler v1 docs on developers.cloudflare.com carried
the full advisory stack — deprecation banner,
noindexmeta tag (concepts/noindex-meta-tag), canonical tags — and AI Crawl Control telemetry showed the AI Crawler category visited 4.8 M times in 30 days, "consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference." - Ground-truth inference failure: in April 2026 a leading AI
assistant answered "How do I write KV values using the
Wrangler CLI?" with the deprecated
wrangler kv:key putsyntax (deprecated in Wrangler 3.60.0; current syntax iswrangler kv key put). - After enabling Redirects for AI Training for 7 days, 100 % of AI-training-crawler requests to pages with non-self-referencing canonical tags were redirected.
Seen in¶
- sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content
— canonical wiki primary-source. 2026-04-17 dedicated
launch post for Cloudflare's Redirects for AI Training;
supplies the declarative-via-canonical-tag mechanism
variant, the 4.8 M-visits-per-30-days telemetry,
100 % 7-day redirect-rate dogfood, and the
cf.verified_bot_category→ AI Crawler vs AI Assistant vs AI Search distinction. - sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — companion post that first introduced the concept in the wiki. Cloudflare's developer docs use Redirects for AI Training on deprecated content (Wrangler v1, Workers Sites).
Related¶
- systems/redirects-for-ai-training — Cloudflare feature.
- systems/ai-crawl-control — parent product.
- patterns/agent-training-crawler-redirect — parent pattern.
- patterns/canonical-tag-as-crawler-redirect — declarative variant.
- concepts/canonical-tag — HTML primitive binding.
- concepts/noindex-meta-tag — adjacent advisory primitive that doesn't stop training-crawler ingest.
- concepts/agent-readiness-score — complementary technique; the agent-training redirect lives alongside the scored checks.
- systems/cloudflare-developer-documentation — reference implementation.