SYSTEM Cited by 2 sources

Googlebot¶

Googlebot is Google's production web crawler — the system that fetches public web pages and feeds them into Google Search indexing. One of the canonical declared crawlers of the modern web: published user-agent strings, IP ranges, verification via reverse- DNS, a public purpose statement in the Google Search Central docs.

Architecture (post-2018)¶

Today's Googlebot has three visible stages:

Crawl — an HTTP fetcher issues a GET for each candidate URL. Status-code triage happens here: 200 → enqueue for render; 304 → render with cached 200 body; 3xx / 4xx / 5xx → do not enqueue. noindex in the initial HTML body (<meta name="robots" content="noindex">) is detected pre-render and the URL is dropped from the render queue.
Render — the Web Rendering Service takes each enqueued URL, spins up a fresh headless Chromium session, executes all JavaScript, and emits the fully-rendered DOM. Stateless: no cookies, no state carried across renders, no click interactions with the page. Google runs "the latest stable version of Chrome/Chromium" so modern JS features (async/await, ?., top-level await, modules) work.
Index — the rendered DOM + initial HTML body are fed into Google's search index. Link discovery runs over the body text via regex (URL-shaped strings); link value assessment runs after render.

This is the current shape; the pipeline evolved through:

Period	Rendering capability
Pre-2009	Static HTML only; JS content invisible
2009–2015	AJAX crawling scheme (HTML-snapshot opt-in)
2015–2018	Early headless Chrome rendering; modern-JS-incomplete
2018–present	Latest-stable-Chrome, universal rendering, stateless, asset-cached

(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

Empirical behaviour (Vercel + MERJ, April 2024)¶

Measured on nextjs.org (supplemental monogram.io, basement.io) via the edge-middleware- bot-beacon-injection pattern over April 2024 (100,000+ fetches, 37,000+ server-beacon pairs):

100 % of indexable HTML pages fully rendered — including CSR, SPA, RSC-streaming pages. concepts/universal-rendering holds in practice, not just in Google's docs.
Rendering-delay distribution: p25 ≤ 4 s, p50 = 10 s, p75 = 26 s, p90 ≈ 3 h, p95 ≈ 6 h, p99 ≈ 18 h. Long tail real; median is tens of seconds. See concepts/rendering-delay-distribution.
Query-string URLs render slower: p75 ≈ 31 min vs 22 s path-only — suggests Google de-prioritises parameterised URLs that likely re-serve canonical content.
/docs (high-update-frequency) renders faster than /showcase (low-update-frequency) — freshness signal feeds into rendering priority.
Streamed RSCs fully rendered — React 18's streaming SSR does not impair indexing.
JS complexity doesn't change rendering success rate, though it does raise per-page rendering cost — which impacts crawl budget on sites with 10,000+ pages.

Key constraints Google publishes¶

Stateless rendering — no cookies or session state retained; every render is a fresh browser. Personalisation has to work from the stateless path for SEO purposes.
No click / no tab / no cookie-banner interaction — hidden- behind-click content is invisible to the index.
Cloaking prohibited — serving different content to users vs. Googlebot based on User-Agent is an explicit SEO violation (concepts/cloaking). Implication for builders: optimise the stateless render path for the page's actual content, do personalisation stateful-side-only.
Asset caching via internal heuristics, not HTTP Cache-Control — the WRS runs its own cache-freshness logic; Cache-Control headers don't bypass it. See concepts/google-asset-caching-internal-heuristics.

Verification¶

Googlebot verification is the canonical verified-bot flow: Google publishes an IP-range JSON at developers.google.com/search/apis/ipranges/googlebot.json; an origin checks (a) the request's source IP is in that range and (b) reverse-DNS of the IP resolves to *.googlebot.com / *.google.com and forward-DNS of that hostname resolves back to the same IP. Vercel's Edge Middleware in the 2024-08-01 study uses this verification to identify Googlebot without trusting the bare User-Agent string.

Seen in¶

sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process — canonical wiki instance. Vercel + MERJ's empirical study of Googlebot's rendering behaviour on nextjs.org (April 2024) with full rendering-delay distribution, universal-rendering confirmation, streaming-RSC rendering confirmation, and the status-code triage + noindex-pre-render enforcement disclosure.
sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — Googlebot as the canonical declared-crawler exemplar contrasting with Perplexity's stealth crawler.

systems/google-web-rendering-service — Googlebot's rendering stage implementation.
concepts/declared-crawler — Googlebot is the canonical instance.
concepts/verified-bots — Googlebot's IP-range + reverse-DNS verification flow.
concepts/universal-rendering — Googlebot renders all pages.
concepts/stateless-rendering — Googlebot's render sessions.
concepts/rendering-queue — the queue between crawl and render.
concepts/rendering-delay-distribution — the canonical p50 / p99 table.
concepts/cloaking — what Googlebot prohibits.