CONCEPT Cited by 1 source
noindex meta tag¶
Definition¶
The noindex meta tag is an HTML directive, typically in
the page <head>, that tells search engines not to include
the page in their index:
Optional per-crawler scoping is possible (e.g.
<meta name="googlebot" content="noindex">). The equivalent
header-based variant is the X-Robots-Tag: noindex HTTP
response header. Canonical reference: Google's
block-indexing docs.
What it does (and doesn't do)¶
- Does: cause search engines that respect it to drop the page from their index (so it won't appear in search results).
- Doesn't: prevent crawling — bots still fetch the page to read the meta tag.
- Doesn't: prevent AI-training-crawler ingest, as demonstrated by the 2026-04-17 Cloudflare post.
Insufficiency for AI training crawlers (2026-04-17)¶
Cloudflare's own documentation for deprecated Wrangler v1
carried the full advisory stack — deprecation banner, the
noindex meta tag, and canonical tags pointing to current
docs — and yet AI-training-crawler telemetry from AI Crawl
Control showed they visited 4.8 M times in 30 days and
"consumed deprecated content at the same rate as current
content. The advisory signals made no measurable difference."
The noindex tag was designed for the search-engine contract
where "we will fetch the page but not include it in search
results." AI training crawlers don't have an "include in
results" step — they have an ingest-into-training-corpus
step, and noindex says nothing about that. Even respectful
crawlers have no defined semantics for "don't train on this
page."
This is the named failure mode in the Redirects for AI
Training launch post: "For search engines, noindex functions
as a rich signal system, but there's no equivalent inline
directive a page can carry that says 'don't train on this'."
Gap in the standards space¶
Between robots.txt (crawl-rule authoring, pre-fetch) and
noindex (index-exclusion, post-fetch), there is no
widely-deployed per-page directive that declares training-use
intent. Partial answers in the ecosystem:
- Content Signals extends
robots.txtwithai-train=no/ai-input=no/search=no— but adoption is ~4 % across top 200 k domains (Cloudflare Radar, 2026-04). - Redirects for AI Training sidesteps the missing-directive problem by redirecting training crawlers to the current canonical URL (so they ingest better content rather than refusing to ingest).
Caveats¶
- The
noindexsemantics are a search-engine-honour-system contract, not an enforcement primitive. Even for search engines, honouring it depends on the crawler's cooperation. - Major search engines (Google, Bing, DuckDuckGo) honour it; AI training crawlers make no such promise.
- Using
noindexon deprecated pages works for search SERPs (the page drops out of the index) but leaves the training-data problem open — which is what the Redirects for AI Training post documents.
Seen in¶
- sources/2026-04-17-cloudflare-redirects-for-ai-training-enforces-canonical-content
— canonical wiki instance of insufficiency for AI
training crawlers. Cloudflare had
noindexon legacy Workers docs; training crawlers ingested them anyway at the same rate as current docs.
Related¶
- concepts/robots-txt — adjacent advisory primitive
(pre-fetch crawl rules).
robots.txtis site-wide / pattern-scoped;noindexis per-page. - concepts/canonical-tag — the primitive that does
point at the right answer; Redirects
for AI Training uses it to fill the gap
noindexleaves. - concepts/agent-training-crawler-redirect — the concept that operationalises redirecting rather than signalling-no-index.
- systems/redirects-for-ai-training — Cloudflare
feature that enforces the canonical pointer as
301. - concepts/content-signals — the
robots.txtextension that adds AI-use-declaration dimensions.