PATTERN Cited by 2 sources
Agent training-crawler redirect¶
Pattern¶
Classify AI-training crawlers at edge + redirect them to a different URL than the one humans see, when the human-visible page would poison future LLM answers if ingested verbatim.
Fixes future answers at training time instead of inference time — a higher-leverage intervention because training-data changes cascade into billions of downstream queries once the next model ships.
Canonical instance¶
Cloudflare's Redirects for AI Training feature applied to deprecated developer-documentation pages:
- Deprecated page (e.g. Wrangler v1 commands) has a big banner at the top visible to humans: "this version is deprecated, use Wrangler v2".
- LLM training crawlers typically ingest the prose and don't re-encode the banner's contextual meaning. Result: the training set teaches the model that "Wrangler v1 is the answer" when asked about Wrangler.
- Fix: when a known AI training crawler requests
/wrangler-v1/..., Cloudflare returns a redirect (typically 301 or an equivalent) to the current/wrangler-v2/...page. The human-facing archive URL keeps its content.
Cloudflare quote (2026-04-17):
"This ensures that while humans can still access historical archives, LLMs are only fed our most current and accurate implementation details."
Mechanics¶
Two shapes have surfaced in the wiki so far — an imperative classify-then-rewrite mechanism and a declarative canonical-tag-driven mechanism.
Imperative — classify-then-rewrite¶
- Classify the request. Known user-agent strings + IP reputation data + (cooperatively) Web Bot Auth signatures where the bot opts in.
- Route by class.
- Human / regular crawler / inference-time agent → original URL, unchanged content.
- Training-time crawler → redirect to current doc, or serve a canonical-content version inline.
- Monitor. Log the redirect rate; verify the training crawlers follow the redirect.
Declarative — canonical-tag-driven (Cloudflare 2026-04-17)¶
Canonicalised by Cloudflare's Redirects for AI Training launch:
- Verify-bot-category classifier. Check
cf.verified_bot_category == "AI Crawler". The field operationally separates AI Crawler (training — GPTBot, ClaudeBot, Bytespider) from AI Assistant (user-directed) and AI Search (indexing) — only AI Crawler is redirected. - Read origin response HTML. Locate the page's
<link rel="canonical">tag. Present on 65-69 % of web pages (2025 Web Almanac). - Apply exclusion filters. Self-referencing canonicals → skip (infinite-loop guard). Cross-origin canonicals → skip by design ("often used for domain consolidation rather than content freshness").
- Emit
HTTP 301 Moved Permanently.Location: <canonical>; body not served.
The declarative shape is maintenance-free: new deprecated paths get the behaviour automatically as long as their canonical tag is set to the current equivalent. See patterns/canonical-tag-as-crawler-redirect for the full treatment.
Why training-time, not inference-time?¶
The alternative is inference-time correction — try to fix bad LLM answers after the model ships:
- Retraining is prohibitively expensive. Once the bad data is in the training set, it's baked in until the next major model version.
- Agent-prompt corrections don't scale. Every agent would need a custom "for Wrangler, always prefer v2" prompt — and no single surface controls all agents.
- RAG-time filtering is fragile. RAG-retrieved passages that contradict the training-set prior lose to the prior unless the retrieval pipeline is heavy-handed about bias correction.
Redirecting training-time crawlers fixes the source. Cascades into all future models. Much higher leverage.
Dependencies¶
- Reliable crawler classification. User-agent is spoofable; Web Bot Auth with signed requests gives cryptographic identity for cooperating bots; uncooperative ones need behavioural detection.
- Cooperative training-crawlers. Well-known training crawlers from major labs honour redirects. Adversarial scrapers may not.
- Careful deprecation taxonomy. You need to know which pages are deprecated — not just a text-match on "deprecated" (which would redirect docs that happen to mention a deprecated feature in passing).
Trade-offs and risks¶
- Different truth for different audiences. The site now officially serves different content based on client class. Benign in the deprecated-docs case; the same mechanism used maliciously could hide unfavorable content from training- crawler classes while keeping it for humans, or vice versa. Trust depends on the site operator.
- No standard transparency. There's no "hey-crawler-here's-the-redirect-reason" header standard. The audit trail of "what did training crawlers see when" is absent unless the operator logs it.
- Regressions. A misclassified client can be served the wrong page; a link to the historical archive from a training-crawler-facing context hits the redirect even if a human wanted the archive via the automated path.
Empirical case (Cloudflare 2026-04-17)¶
The Redirects for AI Training launch post supplies production-scale numbers:
- Advisory signals fail at scale. Legacy Wrangler v1 docs
on developers.cloudflare.com had the full advisory stack
(deprecation banner +
noindexmeta tag + canonical tags). AI Crawl Control telemetry showed the AI Crawler category visited 4.8 M times in 30 days and consumed deprecated content at the same rate as current content — the advisory signals made no measurable difference. - Ground-truth inference failure: in April 2026 a leading
AI assistant answered "How do I write KV values using the
Wrangler CLI?" with the deprecated
wrangler kv:key putsyntax — the colon form was deprecated in Wrangler 3.60.0; the current syntax iswrangler kv key put. - After 7 days of Redirects for AI Training: 100 % of AI-training-crawler requests to pages with non-self- referencing canonical tags were redirected. Cloudflare carefully separates this measured redirect-rate from the hypothesised AI-answer-quality lift, which is a function of each training pipeline's recrawl cadence.
Status-code observability landed alongside:
Radar's new
Response Status Code Analysis
on AI Insights measures aggregate + per-bot 3xx
(redirect) rates; see
patterns/response-status-as-content-policy for the broader
framing.
Seen in¶
- sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content
— canonical primary-source (2026-04-17 launch post).
Supplies the declarative canonical-tag variant, the
cf.verified_bot_categorythree-category distinction, the 4.8 M-visits-per-30-days telemetry, the 100 % 7-day redirect-rate dogfood, and the why-not-redirect-rules analysis. - sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — companion post that first introduced the pattern in the wiki. Cloudflare developer-docs use Redirects for AI Training on deprecated doc URLs (Wrangler v1, Workers Sites).
Related¶
- concepts/agent-training-crawler-redirect — concept page.
- systems/redirects-for-ai-training — Cloudflare feature.
- systems/ai-crawl-control — parent product.
- patterns/canonical-tag-as-crawler-redirect — declarative-variant specialisation of this pattern.
- patterns/response-status-as-content-policy — broader HTTP-status-code-as-policy framing.
- systems/web-bot-auth — cryptographic crawler classification for cooperating bots.
- patterns/signed-bot-request — the Ed25519 + JWK + RFC 9421 pattern that cooperating bots use.
- concepts/canonical-tag — the HTML primitive the declarative variant binds to.
- concepts/noindex-meta-tag — adjacent advisory primitive that doesn't stop training-crawler ingest.
- systems/cloudflare-developer-documentation — reference implementation.