Skip to content

SYSTEM Cited by 2 sources

Redirects for AI Training

Overview

Redirects for AI Training (developers.cloudflare.com/ai-crawl-control/reference/redirects-for-ai-training) is a Cloudflare feature inside AI Crawl Control that detects verified AI-training crawlers and issues an HTTP 301 Moved Permanently to the page's canonical URL (per the page's own <link rel="canonical"> tag, RFC 6596) before the body is served — so LLM training datasets get Cloudflare's freshest implementation details rather than the deprecated original.

Launched 2026-04-17 at blog.cloudflare.com/ai-redirects, as a one-toggle feature on all paid Cloudflare plans. First dogfooded on developers.cloudflare.com — deprecated Wrangler v1 and Workers Sites paths now return 301 to the current v2 / equivalent paths for verified AI crawlers.

Problem it solves

Deprecated documentation poses a specific LLM-training data problem:

  • A human reading the page sees a large "this is deprecated, use v2" banner at the top and cognitively discounts the below-banner prose accordingly.
  • An LLM training crawler ingests the prose without the cross-page cognitive discount — so the training set teaches the model "Wrangler v1 commands are the answer" when asked about Wrangler.
  • Advisory signals don't stop them. Cloudflare's own docs for legacy Wrangler v1 carried the deprecation banner, the noindex meta tag (concepts/noindex-meta-tag), and canonical tags pointing at v2. AI Crawl Control telemetry showed the AI Crawler category visited 4.8 M times in 30 days and "consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference." (Source: sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content)
  • Ground truth: in April 2026 a leading AI assistant answered "How do I write KV values using the Wrangler CLI?" with the deprecated wrangler kv:key put syntax — the colon form was deprecated in Wrangler 3.60.0; the current syntax is wrangler kv key put.

Result: LLMs recommend outdated tooling; users act on outdated recommendations; Cloudflare customers hit regressions discovering their agent-generated code uses deprecated APIs.

Mechanism

Three inputs + one decision:

  1. cf.verified_bot_category — Cloudflare's Ruleset-Engine field that categorises incoming bot requests. Redirects for AI Training keys on AI Crawler (docs) — GPTBot, ClaudeBot, Bytespider, etc. Distinct from AI Assistant (human-directed agents) and AI Search (AI-powered search indexers), which are not redirected.
  2. <link rel="canonical" href="..."> tag in the origin's response HTML. Cloudflare reads the response body to locate the tag.
  3. Exclusion filters. Self-referencing canonicals (infinite-loop guard) and cross-origin canonicals (scope discipline — "often used for domain consolidation rather than content freshness") don't trigger redirects.

If the request is an AI Crawler, the canonical tag is present, non-self-referencing, and same-origin, Cloudflare returns:

HTTP/1.1 301 Moved Permanently
Location: <canonical URL>

before the response body is served. Example exchange from the launch post:

GET /durable-objects/api/legacy-kv-storage-api/
Host: developers.cloudflare.com
User-Agent: Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)

HTTP/1.1 301 Moved Permanently
Location: https://developers.cloudflare.com/durable-objects/api/sqlite-storage-api/

Human traffic, search indexing, AI Assistant, and AI Search all pass through unchanged.

7-day dogfood result

Cloudflare enabled Redirects for AI Training on developers.cloudflare.com. After 7 days:

"100 % of AI training crawler requests to pages with non-self-referencing canonical tags were redirected and were not served with deprecated content."

The post is careful to separate the measured redirect rate (100 %, deterministic) from the hypothesised AI-answer-quality lift ("given the closed nature of training pipelines and variability in recrawl timing, this is a hypothesis we will continue to verify"). What the crawler receives at the point of access has improved immediately; what future models say is a function of each training pipeline's recrawl cadence.

Why the intervention is leveraged at training time

The alternative is inference-time correction: patch every agent's prompt to "always prefer Wrangler v2"; filter RAG results; or retrain the LLM. All are expensive, partial, and don't address offline agent use or future models.

Redirecting training-time crawlers fixes the source of truth for future models. Once the next training run happens, the model naturally prefers v2 — no prompt engineering needed. High leverage: one Cloudflare-side config change → every future LLM model's answer quality improves for Cloudflare-related topics.

Why not single redirect rules?

Cloudflare's 2026-04-17 post explicitly addresses Single Redirect Rules as the alternative:

  • Per-path rules don't scale — every new deprecated path requires a rule change.
  • User-agents must be manually tracked; crawler user-agent lists drift.
  • Plan-limit rule quota is finite and competes with campaign URLs, domain migrations.
  • "Manually re-encode[s] what canonical tags already declare and fall[s] out of sync as content changes."

The canonical-tag substrate couples crawler-redirect policy to the same infrastructure that already keeps SEO and RAG-retrieval-in-HTML in sync — one source of truth, not two. See patterns/canonical-tag-as-crawler-redirect.

Ubiquity of the substrate

The 2025 Web Almanac SEO chapter reports 65-69 % of web pages carry <link rel="canonical"> tags. Platforms like EmDash, WordPress, and Contentful emit them automatically. Most Cloudflare customers therefore get the behaviour for free once the toggle is flipped — no new authoring step is required.

Dashboard path

"In the dashboard: on any domain, go to *AI Crawl Control

Quick Actions > Redirects for AI training > toggle on*."

Path-specific control requires Configuration Rules and, for multi-tenant deployments, Cloudflare for SaaS.

Radar surface

Aggregate adoption + its cousin measurements surface on Cloudflare Radar's new Response Status Code Analysis AI-Insights graph — filterable by industry and crawl-purpose, with per-bot detail on the bot directory pages. Representative aggregate: ~11.3 % 3xx responses to AI crawlers (redirects in total); representative per-GPTBot: ~5.1 % 3xx. Over time, the 3xx bucket is expected to grow as more origins enable Redirects for AI Training.

Trade-offs

  • Different content for different classes. Origin unilaterally decides; the same mechanism used maliciously could hide unfavorable content from training-crawler classes. Trust depends on the operator; no standard transparency disclosure exists.
  • Doesn't fix already-ingested training data. Models trained before the toggle was enabled still have the deprecated content baked in; this intervention only affects future crawl ingests.
  • Unverified crawlers bypass. Only requests classified as cf.verified_bot_category == AI Crawler are redirected. Adversarial or unclassified scrapers are unaffected.
  • No-ops on 31-35 % of pages. Pages without canonical tags get no redirect — the feature piggybacks on existing SEO infrastructure.
  • Self-referencing canonicals are common defaults. Many CMSes emit self-referencing canonicals; deprecated pages must actively set a different canonical URL for the redirect to fire.
  • Cross-origin canonicals excluded by design. Pages with canonicals pointing at other domains are not redirected.
  • Depends on crawlers honouring 301. Modern HTTP clients do; adversarial scrapers may not.

Relationship to other Cloudflare content-routing primitives

  • AI Crawl Control — the parent product inside which this feature ships. Telemetry for it lives in AI Crawl Control's dashboard.
  • Pay Per Crawl — same edge classification layer, different status code (402), different policy intent (monetise instead of redirect).
  • WAF + bot-management rules — classical hard-block layer (403); Redirects for AI Training runs alongside / below.
  • Content Signals — publisher-side preference declaration (ai-train=no, ai-input=yes, etc.); Redirects for AI Training is an enforcement mechanism a site can pair with the declaration. Content Signals is advisory; this is enforcement.
  • patterns/response-status-as-content-policy — the broader framing: 301 for canonical-routing, 402 for pay-per-crawl, 403 for block — HTTP status codes as the policy-enforcement substrate.

Seen in

  • sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-contentcanonical primary-source (2026-04-17 launch post). Supplies the mechanism (cf.verified_bot_category + HTML canonical-tag parsing → 301), the exclusion filters (self-referencing, cross-origin), the 7-day 100 % redirect-rate dogfood on developers.cloudflare.com, the wrangler kv:key put ground-truth inference failure, the why-not-redirect-rules analysis, and the per-category verified-bot distinction (AI Crawler vs AI Assistant vs AI Search).
  • sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — companion Agents-Week post that first mentioned the feature as one dogfood mechanism on developers.cloudflare.com (deprecated Wrangler v1 + Workers Sites docs) alongside the split llms.txt, dynamic /index.md, and hidden agent directives — without launch-post mechanism detail.
Last updated · 200 distilled / 1,178 read