CONCEPT Cited by 1 source

`noindex` meta tag¶

Definition¶

The noindex meta tag is an HTML directive, typically in the page <head>, that tells search engines not to include the page in their index:

<meta name="robots" content="noindex">

Optional per-crawler scoping is possible (e.g. <meta name="googlebot" content="noindex">). The equivalent header-based variant is the X-Robots-Tag: noindex HTTP response header. Canonical reference: Google's block-indexing docs.

What it does (and doesn't do)¶

Does: cause search engines that respect it to drop the page from their index (so it won't appear in search results).
Doesn't: prevent crawling — bots still fetch the page to read the meta tag.
Doesn't: prevent AI-training-crawler ingest, as demonstrated by the 2026-04-17 Cloudflare post.

Insufficiency for AI training crawlers (2026-04-17)¶

Cloudflare's own documentation for deprecated Wrangler v1 carried the full advisory stack — deprecation banner, the noindex meta tag, and canonical tags pointing to current docs — and yet AI-training-crawler telemetry from AI Crawl Control showed they visited 4.8 M times in 30 days and "consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference."

The noindex tag was designed for the search-engine contract where "we will fetch the page but not include it in search results." AI training crawlers don't have an "include in results" step — they have an ingest-into-training-corpus step, and noindex says nothing about that. Even respectful crawlers have no defined semantics for "don't train on this page."

This is the named failure mode in the Redirects for AI Training launch post: "For search engines, noindex functions as a rich signal system, but there's no equivalent inline directive a page can carry that says 'don't train on this'."

Gap in the standards space¶

Between robots.txt (crawl-rule authoring, pre-fetch) and noindex (index-exclusion, post-fetch), there is no widely-deployed per-page directive that declares training-use intent. Partial answers in the ecosystem:

Content Signals extends robots.txt with ai-train=no / ai-input=no / search=no — but adoption is ~4 % across top 200 k domains (Cloudflare Radar, 2026-04).
Redirects for AI Training sidesteps the missing-directive problem by redirecting training crawlers to the current canonical URL (so they ingest better content rather than refusing to ingest).

Caveats¶

The noindex semantics are a search-engine-honour-system contract, not an enforcement primitive. Even for search engines, honouring it depends on the crawler's cooperation.
Major search engines (Google, Bing, DuckDuckGo) honour it; AI training crawlers make no such promise.
Using noindex on deprecated pages works for search SERPs (the page drops out of the index) but leaves the training-data problem open — which is what the Redirects for AI Training post documents.

Seen in¶

sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content — canonical wiki instance of insufficiency for AI training crawlers. Cloudflare had noindex on legacy Workers docs; training crawlers ingested them anyway at the same rate as current docs.

concepts/robots-txt — adjacent advisory primitive (pre-fetch crawl rules). robots.txt is site-wide / pattern-scoped; noindex is per-page.
concepts/canonical-tag — the primitive that does point at the right answer; Redirects for AI Training uses it to fill the gap noindex leaves.
concepts/agent-training-crawler-redirect — the concept that operationalises redirecting rather than signalling-no-index.
systems/redirects-for-ai-training — Cloudflare feature that enforces the canonical pointer as 301.
concepts/content-signals — the robots.txt extension that adds AI-use-declaration dimensions.

noindex meta tag¶