Skip to content

SYSTEM Cited by 2 sources

Googlebot

Googlebot is Google's production web crawler — the system that fetches public web pages and feeds them into Google Search indexing. One of the canonical declared crawlers of the modern web: published user-agent strings, IP ranges, verification via reverse- DNS, a public purpose statement in the Google Search Central docs.

Architecture (post-2018)

Today's Googlebot has three visible stages:

  1. Crawl — an HTTP fetcher issues a GET for each candidate URL. Status-code triage happens here: 200 → enqueue for render; 304 → render with cached 200 body; 3xx / 4xx / 5xx → do not enqueue. noindex in the initial HTML body (<meta name="robots" content="noindex">) is detected pre-render and the URL is dropped from the render queue.
  2. Render — the Web Rendering Service takes each enqueued URL, spins up a fresh headless Chromium session, executes all JavaScript, and emits the fully-rendered DOM. Stateless: no cookies, no state carried across renders, no click interactions with the page. Google runs "the latest stable version of Chrome/Chromium" so modern JS features (async/await, ?., top-level await, modules) work.
  3. Index — the rendered DOM + initial HTML body are fed into Google's search index. Link discovery runs over the body text via regex (URL-shaped strings); link value assessment runs after render.

This is the current shape; the pipeline evolved through:

Period Rendering capability
Pre-2009 Static HTML only; JS content invisible
2009–2015 AJAX crawling scheme (HTML-snapshot opt-in)
2015–2018 Early headless Chrome rendering; modern-JS-incomplete
2018–present Latest-stable-Chrome, universal rendering, stateless, asset-cached

(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

Empirical behaviour (Vercel + MERJ, April 2024)

Measured on nextjs.org (supplemental monogram.io, basement.io) via the edge-middleware- bot-beacon-injection pattern over April 2024 (100,000+ fetches, 37,000+ server-beacon pairs):

  • 100 % of indexable HTML pages fully rendered — including CSR, SPA, RSC-streaming pages. concepts/universal-rendering holds in practice, not just in Google's docs.
  • Rendering-delay distribution: p25 ≤ 4 s, p50 = 10 s, p75 = 26 s, p90 ≈ 3 h, p95 ≈ 6 h, p99 ≈ 18 h. Long tail real; median is tens of seconds. See concepts/rendering-delay-distribution.
  • Query-string URLs render slower: p75 ≈ 31 min vs 22 s path-only — suggests Google de-prioritises parameterised URLs that likely re-serve canonical content.
  • /docs (high-update-frequency) renders faster than /showcase (low-update-frequency) — freshness signal feeds into rendering priority.
  • Streamed RSCs fully rendered — React 18's streaming SSR does not impair indexing.
  • JS complexity doesn't change rendering success rate, though it does raise per-page rendering cost — which impacts crawl budget on sites with 10,000+ pages.

Key constraints Google publishes

  • Stateless rendering — no cookies or session state retained; every render is a fresh browser. Personalisation has to work from the stateless path for SEO purposes.
  • No click / no tab / no cookie-banner interaction — hidden- behind-click content is invisible to the index.
  • Cloaking prohibited — serving different content to users vs. Googlebot based on User-Agent is an explicit SEO violation (concepts/cloaking). Implication for builders: optimise the stateless render path for the page's actual content, do personalisation stateful-side-only.
  • Asset caching via internal heuristics, not HTTP Cache-Control — the WRS runs its own cache-freshness logic; Cache-Control headers don't bypass it. See concepts/google-asset-caching-internal-heuristics.

Verification

Googlebot verification is the canonical verified-bot flow: Google publishes an IP-range JSON at developers.google.com/search/apis/ipranges/googlebot.json; an origin checks (a) the request's source IP is in that range and (b) reverse-DNS of the IP resolves to *.googlebot.com / *.google.com and forward-DNS of that hostname resolves back to the same IP. Vercel's Edge Middleware in the 2024-08-01 study uses this verification to identify Googlebot without trusting the bare User-Agent string.

Seen in

Last updated · 476 distilled / 1,218 read