PATTERN Cited by 1 source
Canonical tag as crawler redirect¶
Pattern¶
Use existing <link rel="canonical"> tags as the declarative
source for class-specific HTTP 301 redirects at the edge.
Per-class policy (e.g. "redirect AI training crawlers")
composes with the tag the origin already ships for SEO — the
same infrastructure that answers "which URL is
authoritative for this content?" answers "where should a
training crawler end up?"
Shape:
- Origin (or CMS) already emits
<link rel="canonical" href="...">in HTML<head>as standard SEO practice (RFC 6596, present on 65-69 % of web pages per the 2025 Web Almanac). - Edge proxy classifies the incoming request by client class (e.g. "verified AI training crawler").
- If class matches and canonical tag is present,
non-self-referencing, and same-origin, edge returns
HTTP 301 Moved Permanently+Location: <canonical>. - Other classes (humans, search engines, AI Assistant, AI Search) pass through — the original URL is still served for them.
Canonical instance¶
Cloudflare's Redirects for AI Training (2026-04-17). Verbatim exchange from the launch post:
GET /durable-objects/api/legacy-kv-storage-api/
Host: developers.cloudflare.com
User-Agent: Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)
HTTP/1.1 301 Moved Permanently
Location: https://developers.cloudflare.com/durable-objects/api/sqlite-storage-api/
Key property: Cloudflare authored zero per-path rules for this behaviour. The canonical tags were already there for SEO; the feature is a single toggle that re-uses them.
Why the declarative-via-canonical substrate matters¶
Alternative 1: per-path redirect rules (e.g. Cloudflare's Single Redirect Rules). Explicitly rejected in the 2026-04-17 post as:
- "Every new deprecated path requires a change to the rule."
- User-agents must be manually tracked; redirect-rule files drift as new AI-training-crawlers appear.
- Eats finite plan-rule quota that competes with campaign URLs, domain migrations, etc.
- "Manually re-encode[s] what canonical tags already declare and fall[s] out of sync as content changes."
Alternative 2: noindex meta tag
(concepts/noindex-meta-tag) — advisory, not enforcement,
and does not affect AI-training-crawler ingest at all
per Cloudflare's own 4.8 M-visit dogfood.
Alternative 3: robots.txt — crawl-ban-or-allow,
per-directory-pattern; doesn't say "here's the current
version instead."
The canonical-tag substrate is the only widely-deployed HTML primitive that points at the right answer, which is exactly the signal a crawler-redirect needs to act on.
Mechanics¶
- Read origin response. Edge proxy inspects the HTML response body. ([Prerequisite: origin returns HTML that edge can inspect; HTTP Basic-Auth or CORS-limited origins can complicate.])
- Parse
<link rel="canonical">. Locate the tag; extracthref. - Apply exclusion rules.
- Self-referencing canonical? Skip — would create infinite redirect loop.
- Cross-origin canonical? Skip by design — domain consolidation use case differs from content-freshness use case. (Operator choice; Cloudflare defaults to skip.)
- Verify class match. Is the request from the target
class? On Cloudflare:
cf.verified_bot_category == "AI Crawler". - Emit redirect.
HTTP 301 Moved Permanently+Location: <canonical>. - Log. The redirect rate is the measured observable; per Cloudflare's 7-day dogfood, 100 % of AI-training-crawler requests to pages with non-self-referencing canonicals were redirected.
Trade-offs¶
- No-ops on 31-35 % of the web (pages without canonical tags) — fallback mechanism needed if you want deterministic per-class routing on every page.
- Self-referencing canonicals are common defaults — many CMSes ship self-referencing canonicals by default; deprecated-content pages must actively set a different canonical URL for the pattern to fire.
- Origin response must expose the canonical tag in HTML. If the origin renders canonicals via client-side JS only (rare but exists), edge proxies can't read them pre-delivery. Cloudflare's statement "Cloudflare reads the response HTML" implies server-rendered HTML.
- Trust in the origin's canonical declaration. The whole mechanism assumes the canonical URL is honestly the current authoritative version. A malicious operator could set canonical URLs arbitrarily to send a specific client class somewhere else; this is the same trust posture as SEO at large.
- No standard transparency disclosure. There is no
widely-adopted header like
AI-Redirect-Reason: canonicalthat tells the crawler why it was redirected. The signal-vs-enforcement distinction is invisible to the crawler at301receive time. - Cross-origin content hierarchies are unsupported by
default. Pages with
<link rel="canonical" href="https://other-domain.example/...">need a different primitive (usually operator-authored redirect rules). 301is a specific contract. Clients treat301as "permanent — update your internal link"; crawlers that cache the redirect aggressively may not re-fetch. In the crawler-redirect application this is usually desirable (next crawl goes straight to canonical), but edge operators should be aware.
Generalises beyond Cloudflare¶
The pattern is edge-proxy-general. Any CDN or reverse proxy with (a) HTML-response inspection and (b) a client-class classifier can implement it:
- CDN + WAF stacks (Akamai, Fastly, AWS CloudFront) could
emit comparable rules on
User-Agent+ canonical-tag parsing. - Origin-level middleware (NGINX + Lua, Envoy filter, Caddy plugin) can do it if the origin is behind an operator-controlled proxy.
- Application-framework middleware (Rails, Django, Go
net/http) can do it per-request by reading the rendered HTML or pre-computing the canonical mapping.
The novelty in Cloudflare's 2026-04-17 post isn't the mechanism but the declarative + one-toggle productisation that ships it to every paid customer without per-path rule authoring.
Seen in¶
- sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content — canonical wiki instance. Cloudflare's Redirects for AI Training launch post.
Related¶
- concepts/canonical-tag — the HTML primitive this pattern binds to.
- patterns/agent-training-crawler-redirect — the parent pattern (redirect-training-crawlers); this is the declarative variant.
- systems/redirects-for-ai-training — Cloudflare feature.
- systems/ai-crawl-control — parent product inside which the pattern lives.
- patterns/response-status-as-content-policy — the broader framing of HTTP status codes as the protocol layer for content-policy signalling.
- concepts/noindex-meta-tag — adjacent crawler-policy primitive that this pattern implicitly replaces for the training-crawler use case.