Skip to content

CONCEPT Cited by 1 source

noindex meta tag

Definition

The noindex meta tag is an HTML directive, typically in the page <head>, that tells search engines not to include the page in their index:

<meta name="robots" content="noindex">

Optional per-crawler scoping is possible (e.g. <meta name="googlebot" content="noindex">). The equivalent header-based variant is the X-Robots-Tag: noindex HTTP response header. Canonical reference: Google's block-indexing docs.

What it does (and doesn't do)

  • Does: cause search engines that respect it to drop the page from their index (so it won't appear in search results).
  • Doesn't: prevent crawling — bots still fetch the page to read the meta tag.
  • Doesn't: prevent AI-training-crawler ingest, as demonstrated by the 2026-04-17 Cloudflare post.

Insufficiency for AI training crawlers (2026-04-17)

Cloudflare's own documentation for deprecated Wrangler v1 carried the full advisory stack — deprecation banner, the noindex meta tag, and canonical tags pointing to current docs — and yet AI-training-crawler telemetry from AI Crawl Control showed they visited 4.8 M times in 30 days and "consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference."

The noindex tag was designed for the search-engine contract where "we will fetch the page but not include it in search results." AI training crawlers don't have an "include in results" step — they have an ingest-into-training-corpus step, and noindex says nothing about that. Even respectful crawlers have no defined semantics for "don't train on this page."

This is the named failure mode in the Redirects for AI Training launch post: "For search engines, noindex functions as a rich signal system, but there's no equivalent inline directive a page can carry that says 'don't train on this'."

Gap in the standards space

Between robots.txt (crawl-rule authoring, pre-fetch) and noindex (index-exclusion, post-fetch), there is no widely-deployed per-page directive that declares training-use intent. Partial answers in the ecosystem:

  • Content Signals extends robots.txt with ai-train=no / ai-input=no / search=no — but adoption is ~4 % across top 200 k domains (Cloudflare Radar, 2026-04).
  • Redirects for AI Training sidesteps the missing-directive problem by redirecting training crawlers to the current canonical URL (so they ingest better content rather than refusing to ingest).

Caveats

  • The noindex semantics are a search-engine-honour-system contract, not an enforcement primitive. Even for search engines, honouring it depends on the crawler's cooperation.
  • Major search engines (Google, Bing, DuckDuckGo) honour it; AI training crawlers make no such promise.
  • Using noindex on deprecated pages works for search SERPs (the page drops out of the index) but leaves the training-data problem open — which is what the Redirects for AI Training post documents.

Seen in

Last updated · 200 distilled / 1,178 read