SYSTEM Cited by 2 sources
Googlebot¶
Googlebot is Google's production web crawler — the system that fetches public web pages and feeds them into Google Search indexing. One of the canonical declared crawlers of the modern web: published user-agent strings, IP ranges, verification via reverse- DNS, a public purpose statement in the Google Search Central docs.
Architecture (post-2018)¶
Today's Googlebot has three visible stages:
- Crawl — an HTTP fetcher issues a
GETfor each candidate URL. Status-code triage happens here: 200 → enqueue for render; 304 → render with cached 200 body; 3xx / 4xx / 5xx → do not enqueue.noindexin the initial HTML body (<meta name="robots" content="noindex">) is detected pre-render and the URL is dropped from the render queue. - Render — the
Web Rendering
Service takes each enqueued URL, spins up a fresh headless
Chromium session, executes all JavaScript, and emits the
fully-rendered DOM. Stateless: no cookies, no state carried
across renders, no click interactions with the page. Google
runs "the latest stable version of Chrome/Chromium" so
modern JS features (async/await,
?., top-level await, modules) work. - Index — the rendered DOM + initial HTML body are fed into Google's search index. Link discovery runs over the body text via regex (URL-shaped strings); link value assessment runs after render.
This is the current shape; the pipeline evolved through:
| Period | Rendering capability |
|---|---|
| Pre-2009 | Static HTML only; JS content invisible |
| 2009–2015 | AJAX crawling scheme (HTML-snapshot opt-in) |
| 2015–2018 | Early headless Chrome rendering; modern-JS-incomplete |
| 2018–present | Latest-stable-Chrome, universal rendering, stateless, asset-cached |
(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)
Empirical behaviour (Vercel + MERJ, April 2024)¶
Measured on nextjs.org (supplemental monogram.io,
basement.io) via the
edge-middleware-
bot-beacon-injection pattern over April 2024 (100,000+ fetches,
37,000+ server-beacon pairs):
- 100 % of indexable HTML pages fully rendered — including CSR, SPA, RSC-streaming pages. concepts/universal-rendering holds in practice, not just in Google's docs.
- Rendering-delay distribution: p25 ≤ 4 s, p50 = 10 s, p75 = 26 s, p90 ≈ 3 h, p95 ≈ 6 h, p99 ≈ 18 h. Long tail real; median is tens of seconds. See concepts/rendering-delay-distribution.
- Query-string URLs render slower: p75 ≈ 31 min vs 22 s path-only — suggests Google de-prioritises parameterised URLs that likely re-serve canonical content.
/docs(high-update-frequency) renders faster than/showcase(low-update-frequency) — freshness signal feeds into rendering priority.- Streamed RSCs fully rendered — React 18's streaming SSR does not impair indexing.
- JS complexity doesn't change rendering success rate, though it does raise per-page rendering cost — which impacts crawl budget on sites with 10,000+ pages.
Key constraints Google publishes¶
- Stateless rendering — no cookies or session state retained; every render is a fresh browser. Personalisation has to work from the stateless path for SEO purposes.
- No click / no tab / no cookie-banner interaction — hidden- behind-click content is invisible to the index.
- Cloaking prohibited — serving different content to users
vs. Googlebot based on
User-Agentis an explicit SEO violation (concepts/cloaking). Implication for builders: optimise the stateless render path for the page's actual content, do personalisation stateful-side-only. - Asset caching via internal heuristics, not HTTP
Cache-Control— the WRS runs its own cache-freshness logic;Cache-Controlheaders don't bypass it. See concepts/google-asset-caching-internal-heuristics.
Verification¶
Googlebot verification is the canonical
verified-bot flow: Google publishes
an IP-range JSON at
developers.google.com/search/apis/ipranges/googlebot.json;
an origin checks (a) the request's source IP is in that range
and (b) reverse-DNS of the IP resolves to *.googlebot.com /
*.google.com and forward-DNS of that hostname resolves back
to the same IP. Vercel's Edge Middleware in the 2024-08-01
study uses this verification to identify Googlebot without
trusting the bare User-Agent string.
Seen in¶
- sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process
— canonical wiki instance. Vercel + MERJ's empirical study
of Googlebot's rendering behaviour on
nextjs.org(April 2024) with full rendering-delay distribution, universal-rendering confirmation, streaming-RSC rendering confirmation, and the status-code triage +noindex-pre-render enforcement disclosure. - sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — Googlebot as the canonical declared-crawler exemplar contrasting with Perplexity's stealth crawler.
Related¶
- systems/google-web-rendering-service — Googlebot's rendering stage implementation.
- concepts/declared-crawler — Googlebot is the canonical instance.
- concepts/verified-bots — Googlebot's IP-range + reverse-DNS verification flow.
- concepts/universal-rendering — Googlebot renders all pages.
- concepts/stateless-rendering — Googlebot's render sessions.
- concepts/rendering-queue — the queue between crawl and render.
- concepts/rendering-delay-distribution — the canonical p50 / p99 table.
- concepts/cloaking — what Googlebot prohibits.