SYSTEM Cited by 2 sources

Redirects for AI Training¶

Overview¶

Redirects for AI Training (developers.cloudflare.com/ai-crawl-control/reference/redirects-for-ai-training) is a Cloudflare feature inside AI Crawl Control that detects verified AI-training crawlers and issues an HTTP 301 Moved Permanently to the page's canonical URL (per the page's own <link rel="canonical"> tag, RFC 6596) before the body is served — so LLM training datasets get Cloudflare's freshest implementation details rather than the deprecated original.

Launched 2026-04-17 at blog.cloudflare.com/ai-redirects, as a one-toggle feature on all paid Cloudflare plans. First dogfooded on developers.cloudflare.com — deprecated Wrangler v1 and Workers Sites paths now return 301 to the current v2 / equivalent paths for verified AI crawlers.

Problem it solves¶

Deprecated documentation poses a specific LLM-training data problem:

A human reading the page sees a large "this is deprecated, use v2" banner at the top and cognitively discounts the below-banner prose accordingly.
An LLM training crawler ingests the prose without the cross-page cognitive discount — so the training set teaches the model "Wrangler v1 commands are the answer" when asked about Wrangler.
Advisory signals don't stop them. Cloudflare's own docs for legacy Wrangler v1 carried the deprecation banner, the noindex meta tag (concepts/noindex-meta-tag), and canonical tags pointing at v2. AI Crawl Control telemetry showed the AI Crawler category visited 4.8 M times in 30 days and "consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference." (Source: sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content)
Ground truth: in April 2026 a leading AI assistant answered "How do I write KV values using the Wrangler CLI?" with the deprecated wrangler kv:key put syntax — the colon form was deprecated in Wrangler 3.60.0; the current syntax is wrangler kv key put.

Result: LLMs recommend outdated tooling; users act on outdated recommendations; Cloudflare customers hit regressions discovering their agent-generated code uses deprecated APIs.

Mechanism¶

Three inputs + one decision:

cf.verified_bot_category — Cloudflare's Ruleset-Engine field that categorises incoming bot requests. Redirects for AI Training keys on AI Crawler (docs) — GPTBot, ClaudeBot, Bytespider, etc. Distinct from AI Assistant (human-directed agents) and AI Search (AI-powered search indexers), which are not redirected.
<link rel="canonical" href="..."> tag in the origin's response HTML. Cloudflare reads the response body to locate the tag.
Exclusion filters. Self-referencing canonicals (infinite-loop guard) and cross-origin canonicals (scope discipline — "often used for domain consolidation rather than content freshness") don't trigger redirects.

If the request is an AI Crawler, the canonical tag is present, non-self-referencing, and same-origin, Cloudflare returns:

HTTP/1.1 301 Moved Permanently
Location: <canonical URL>

before the response body is served. Example exchange from the launch post:

GET /durable-objects/api/legacy-kv-storage-api/
Host: developers.cloudflare.com
User-Agent: Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)

HTTP/1.1 301 Moved Permanently
Location: https://developers.cloudflare.com/durable-objects/api/sqlite-storage-api/

Human traffic, search indexing, AI Assistant, and AI Search all pass through unchanged.

7-day dogfood result¶

Cloudflare enabled Redirects for AI Training on developers.cloudflare.com. After 7 days:

"100 % of AI training crawler requests to pages with non-self-referencing canonical tags were redirected and were not served with deprecated content."

The post is careful to separate the measured redirect rate (100 %, deterministic) from the hypothesised AI-answer-quality lift ("given the closed nature of training pipelines and variability in recrawl timing, this is a hypothesis we will continue to verify"). What the crawler receives at the point of access has improved immediately; what future models say is a function of each training pipeline's recrawl cadence.

Why the intervention is leveraged at training time¶

The alternative is inference-time correction: patch every agent's prompt to "always prefer Wrangler v2"; filter RAG results; or retrain the LLM. All are expensive, partial, and don't address offline agent use or future models.

Redirecting training-time crawlers fixes the source of truth for future models. Once the next training run happens, the model naturally prefers v2 — no prompt engineering needed. High leverage: one Cloudflare-side config change → every future LLM model's answer quality improves for Cloudflare-related topics.

Why not single redirect rules?¶

Cloudflare's 2026-04-17 post explicitly addresses Single Redirect Rules as the alternative:

Per-path rules don't scale — every new deprecated path requires a rule change.
User-agents must be manually tracked; crawler user-agent lists drift.
Plan-limit rule quota is finite and competes with campaign URLs, domain migrations.
"Manually re-encode[s] what canonical tags already declare and fall[s] out of sync as content changes."

The canonical-tag substrate couples crawler-redirect policy to the same infrastructure that already keeps SEO and RAG-retrieval-in-HTML in sync — one source of truth, not two. See patterns/canonical-tag-as-crawler-redirect.

Ubiquity of the substrate¶

The 2025 Web Almanac SEO chapter reports 65-69 % of web pages carry <link rel="canonical"> tags. Platforms like EmDash, WordPress, and Contentful emit them automatically. Most Cloudflare customers therefore get the behaviour for free once the toggle is flipped — no new authoring step is required.

Dashboard path¶

"In the dashboard: on any domain, go to *AI Crawl Control

Quick Actions > Redirects for AI training > toggle on*."

Path-specific control requires Configuration Rules and, for multi-tenant deployments, Cloudflare for SaaS.

Radar surface¶

Aggregate adoption + its cousin measurements surface on Cloudflare Radar's new Response Status Code Analysis AI-Insights graph — filterable by industry and crawl-purpose, with per-bot detail on the bot directory pages. Representative aggregate: ~11.3 % 3xx responses to AI crawlers (redirects in total); representative per-GPTBot: ~5.1 % 3xx. Over time, the 3xx bucket is expected to grow as more origins enable Redirects for AI Training.

Trade-offs¶

Different content for different classes. Origin unilaterally decides; the same mechanism used maliciously could hide unfavorable content from training-crawler classes. Trust depends on the operator; no standard transparency disclosure exists.
Doesn't fix already-ingested training data. Models trained before the toggle was enabled still have the deprecated content baked in; this intervention only affects future crawl ingests.
Unverified crawlers bypass. Only requests classified as cf.verified_bot_category == AI Crawler are redirected. Adversarial or unclassified scrapers are unaffected.
No-ops on 31-35 % of pages. Pages without canonical tags get no redirect — the feature piggybacks on existing SEO infrastructure.
Self-referencing canonicals are common defaults. Many CMSes emit self-referencing canonicals; deprecated pages must actively set a different canonical URL for the redirect to fire.
Cross-origin canonicals excluded by design. Pages with canonicals pointing at other domains are not redirected.
Depends on crawlers honouring 301. Modern HTTP clients do; adversarial scrapers may not.

Relationship to other Cloudflare content-routing primitives¶

AI Crawl Control — the parent product inside which this feature ships. Telemetry for it lives in AI Crawl Control's dashboard.
Pay Per Crawl — same edge classification layer, different status code (402), different policy intent (monetise instead of redirect).
WAF + bot-management rules — classical hard-block layer (403); Redirects for AI Training runs alongside / below.
Content Signals — publisher-side preference declaration (ai-train=no, ai-input=yes, etc.); Redirects for AI Training is an enforcement mechanism a site can pair with the declaration. Content Signals is advisory; this is enforcement.
patterns/response-status-as-content-policy — the broader framing: 301 for canonical-routing, 402 for pay-per-crawl, 403 for block — HTTP status codes as the policy-enforcement substrate.

Seen in¶

sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content — canonical primary-source (2026-04-17 launch post). Supplies the mechanism (cf.verified_bot_category + HTML canonical-tag parsing → 301), the exclusion filters (self-referencing, cross-origin), the 7-day 100 % redirect-rate dogfood on developers.cloudflare.com, the wrangler kv:key put ground-truth inference failure, the why-not-redirect-rules analysis, and the per-category verified-bot distinction (AI Crawler vs AI Assistant vs AI Search).
sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — companion Agents-Week post that first mentioned the feature as one dogfood mechanism on developers.cloudflare.com (deprecated Wrangler v1 + Workers Sites docs) alongside the split llms.txt, dynamic /index.md, and hidden agent directives — without launch-post mechanism detail.

companies/cloudflare — parent company.
systems/ai-crawl-control — parent product.
concepts/agent-training-crawler-redirect — concept.
concepts/canonical-tag — the HTML primitive the feature binds to.
concepts/noindex-meta-tag — adjacent advisory primitive shown to be insufficient for AI training crawlers.
patterns/agent-training-crawler-redirect — parent pattern (redirect-training-crawlers in general).
patterns/canonical-tag-as-crawler-redirect — the declarative-via-HTML-canonical-tag specific variant.
patterns/response-status-as-content-policy — broader HTTP-status-code-as-policy framing.
systems/web-bot-auth — cooperating-crawler classification substrate.
systems/cloudflare-developer-documentation — reference deployment.
systems/cloudflare-radar — Response Status Code Analysis surface measuring aggregate + per-bot 3xx (redirect) rates across AI-crawler traffic.