CONCEPT Cited by 1 source

Link discovery vs link value assessment¶

Definition¶

Googlebot's crawl-to-index pipeline splits link-handling into two distinct stages that happen at different points and on different inputs:

Link discovery — finding new URLs to add to the crawl frontier. Happens at crawl time on the initial HTML body, via a regex over URL-shaped strings. No rendering required.
Link value assessment — deciding how important a discovered link is for site architecture / PageRank-weight propagation / crawl-priority decisions. Happens after full-page rendering, on the rendered DOM.

Canonical verbatim: "Google differentiates between link discovery and link value assessment. The evaluation of a link's value for site architecture and crawl prioritization occurs after full-page rendering."

(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

The pipeline¶

  [Crawl]    fetch URL → inspect body
                ├── regex over body → new URLs found
                │    └── enqueue for crawl (DISCOVERY)
                └── enqueue for render
                       │
  [Render]             ├── WRS executes JS
                       └── emit rendered DOM
                              │
  [Index]                     ├── index rendered content
                              └── weight links for PageRank / crawl priority (VALUE ASSESSMENT)

Discovery happens on the initial HTML body, which means:

It's fast. No queue wait. URLs found → crawl frontier updated on that request.
It's shape-strict. Google looks for URL-shaped strings (start with http(s):// or a relative path, ending in path characters), not URL-semantics. The literal bytes https%3A%2F%2F... are not discovered — URL-decoding is not a pre-pass.
It's source-indifferent. <a href="/foo">, a URL in a <script type="application/json"> block, a URL string in a React Server Component payload, and a URL embedded in client-side JavaScript all count for discovery if they match the regex.

What the two-stage split implies for CSR¶

CSR pages do not lose link discovery. A CSR page that embeds its link targets in a JSON payload (an RSC payload, a preloaded Redux state blob, an API-response blob) has those links discovered at crawl time from the initial body. They're in the crawl frontier almost immediately.
CSR pages do lose link value assessment latency. Because value assessment is post-render and post-queue, a CSR page's outbound links don't get weighted until the page is rendered — which can be p50 = 10 s or p99 ≈ 18 h out. Site-architecture decisions ride on that delay.
A sitemap short-circuits discovery latency altogether. See concepts/sitemap — "Having an updated sitemap.xml significantly reduces, if not eliminates, the time-to-discovery differences between different rendering patterns."

What discovery does not do¶

Does not decode encoded URLs. https%3A%2F%2Fwebsite.com is not discovered.
Does not follow URL joins. this.baseURL + this.segment in a JS expression is not evaluated.
Does not run the JS. Link discovery is a static body pass, not an executed-program pass.
Does not assign link weight. That's value assessment, later.

Experimental confirmation¶

Vercel added a JSON object similar to an RSC payload to /showcase on nextjs.org, containing links to previously- undiscovered pages. Googlebot discovered and crawled them — confirming the regex-over-body discovery semantics. The encoded- URL in the same payload was not discovered — confirming the shape-strict rule.

Corollary patterns¶

patterns/link-in-non-rendered-json-payload-discovery — deliberate use of the discovery-via-regex behaviour to seed URL discovery from non-rendered payloads. Useful for CSR / SPA shells that have no rendered link targets but can expose a URL list in a JSON blob in the initial HTML.

Seen in¶

sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process — canonical wiki instance. Vercel + MERJ's empirical study describes the split explicitly as part of debunking the "JS- heavy sites have slower page discovery" myth.

systems/googlebot — the pipeline operator.
systems/google-web-rendering-service — where value assessment happens.
concepts/rendering-queue — why value assessment is delayed.
patterns/link-in-non-rendered-json-payload-discovery — the deliberate-exploitation pattern.
concepts/universal-rendering — the upstream property.