PATTERN Cited by 1 source

Canonical tag as crawler redirect¶

Pattern¶

Use existing <link rel="canonical"> tags as the declarative source for class-specific HTTP 301 redirects at the edge. Per-class policy (e.g. "redirect AI training crawlers") composes with the tag the origin already ships for SEO — the same infrastructure that answers "which URL is authoritative for this content?" answers "where should a training crawler end up?"

Shape:

Origin (or CMS) already emits <link rel="canonical" href="..."> in HTML <head> as standard SEO practice (RFC 6596, present on 65-69 % of web pages per the 2025 Web Almanac).
Edge proxy classifies the incoming request by client class (e.g. "verified AI training crawler").
If class matches and canonical tag is present, non-self-referencing, and same-origin, edge returns HTTP 301 Moved Permanently + Location: <canonical>.
Other classes (humans, search engines, AI Assistant, AI Search) pass through — the original URL is still served for them.

Canonical instance¶

Cloudflare's Redirects for AI Training (2026-04-17). Verbatim exchange from the launch post:

GET /durable-objects/api/legacy-kv-storage-api/
Host: developers.cloudflare.com
User-Agent: Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)

HTTP/1.1 301 Moved Permanently
Location: https://developers.cloudflare.com/durable-objects/api/sqlite-storage-api/

Key property: Cloudflare authored zero per-path rules for this behaviour. The canonical tags were already there for SEO; the feature is a single toggle that re-uses them.

Why the declarative-via-canonical substrate matters¶

Alternative 1: per-path redirect rules (e.g. Cloudflare's Single Redirect Rules). Explicitly rejected in the 2026-04-17 post as:

"Every new deprecated path requires a change to the rule."
User-agents must be manually tracked; redirect-rule files drift as new AI-training-crawlers appear.
Eats finite plan-rule quota that competes with campaign URLs, domain migrations, etc.
"Manually re-encode[s] what canonical tags already declare and fall[s] out of sync as content changes."

Alternative 2: noindex meta tag (concepts/noindex-meta-tag) — advisory, not enforcement, and does not affect AI-training-crawler ingest at all per Cloudflare's own 4.8 M-visit dogfood.

Alternative 3: robots.txt — crawl-ban-or-allow, per-directory-pattern; doesn't say "here's the current version instead."

The canonical-tag substrate is the only widely-deployed HTML primitive that points at the right answer, which is exactly the signal a crawler-redirect needs to act on.

Mechanics¶

Read origin response. Edge proxy inspects the HTML response body. ([Prerequisite: origin returns HTML that edge can inspect; HTTP Basic-Auth or CORS-limited origins can complicate.])
Parse <link rel="canonical">. Locate the tag; extract href.
Apply exclusion rules.
Self-referencing canonical? Skip — would create infinite redirect loop.
Cross-origin canonical? Skip by design — domain consolidation use case differs from content-freshness use case. (Operator choice; Cloudflare defaults to skip.)
Verify class match. Is the request from the target class? On Cloudflare: cf.verified_bot_category == "AI Crawler".
Emit redirect. HTTP 301 Moved Permanently + Location: <canonical>.
Log. The redirect rate is the measured observable; per Cloudflare's 7-day dogfood, 100 % of AI-training-crawler requests to pages with non-self-referencing canonicals were redirected.

Trade-offs¶

No-ops on 31-35 % of the web (pages without canonical tags) — fallback mechanism needed if you want deterministic per-class routing on every page.
Self-referencing canonicals are common defaults — many CMSes ship self-referencing canonicals by default; deprecated-content pages must actively set a different canonical URL for the pattern to fire.
Origin response must expose the canonical tag in HTML. If the origin renders canonicals via client-side JS only (rare but exists), edge proxies can't read them pre-delivery. Cloudflare's statement "Cloudflare reads the response HTML" implies server-rendered HTML.
Trust in the origin's canonical declaration. The whole mechanism assumes the canonical URL is honestly the current authoritative version. A malicious operator could set canonical URLs arbitrarily to send a specific client class somewhere else; this is the same trust posture as SEO at large.
No standard transparency disclosure. There is no widely-adopted header like AI-Redirect-Reason: canonical that tells the crawler why it was redirected. The signal-vs-enforcement distinction is invisible to the crawler at 301 receive time.
Cross-origin content hierarchies are unsupported by default. Pages with <link rel="canonical" href="https://other-domain.example/..."> need a different primitive (usually operator-authored redirect rules).
301 is a specific contract. Clients treat 301 as "permanent — update your internal link"; crawlers that cache the redirect aggressively may not re-fetch. In the crawler-redirect application this is usually desirable (next crawl goes straight to canonical), but edge operators should be aware.

Generalises beyond Cloudflare¶

The pattern is edge-proxy-general. Any CDN or reverse proxy with (a) HTML-response inspection and (b) a client-class classifier can implement it:

CDN + WAF stacks (Akamai, Fastly, AWS CloudFront) could emit comparable rules on User-Agent + canonical-tag parsing.
Origin-level middleware (NGINX + Lua, Envoy filter, Caddy plugin) can do it if the origin is behind an operator-controlled proxy.
Application-framework middleware (Rails, Django, Go net/http) can do it per-request by reading the rendered HTML or pre-computing the canonical mapping.

The novelty in Cloudflare's 2026-04-17 post isn't the mechanism but the declarative + one-toggle productisation that ships it to every paid customer without per-path rule authoring.

Seen in¶

sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content — canonical wiki instance. Cloudflare's Redirects for AI Training launch post.

concepts/canonical-tag — the HTML primitive this pattern binds to.
patterns/agent-training-crawler-redirect — the parent pattern (redirect-training-crawlers); this is the declarative variant.
systems/redirects-for-ai-training — Cloudflare feature.
systems/ai-crawl-control — parent product inside which the pattern lives.
patterns/response-status-as-content-policy — the broader framing of HTTP status codes as the protocol layer for content-policy signalling.
concepts/noindex-meta-tag — adjacent crawler-policy primitive that this pattern implicitly replaces for the training-crawler use case.