Skip to content

CONCEPT Cited by 1 source

Link discovery vs link value assessment

Definition

Googlebot's crawl-to-index pipeline splits link-handling into two distinct stages that happen at different points and on different inputs:

  1. Link discovery — finding new URLs to add to the crawl frontier. Happens at crawl time on the initial HTML body, via a regex over URL-shaped strings. No rendering required.
  2. Link value assessment — deciding how important a discovered link is for site architecture / PageRank-weight propagation / crawl-priority decisions. Happens after full-page rendering, on the rendered DOM.

Canonical verbatim: "Google differentiates between link discovery and link value assessment. The evaluation of a link's value for site architecture and crawl prioritization occurs after full-page rendering."

(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

The pipeline

  [Crawl]    fetch URL → inspect body
                ├── regex over body → new URLs found
                │    └── enqueue for crawl (DISCOVERY)
                └── enqueue for render
  [Render]             ├── WRS executes JS
                       └── emit rendered DOM
  [Index]                     ├── index rendered content
                              └── weight links for PageRank / crawl priority (VALUE ASSESSMENT)

Discovery happens on the initial HTML body, which means:

  • It's fast. No queue wait. URLs found → crawl frontier updated on that request.
  • It's shape-strict. Google looks for URL-shaped strings (start with http(s):// or a relative path, ending in path characters), not URL-semantics. The literal bytes https%3A%2F%2F... are not discovered — URL-decoding is not a pre-pass.
  • It's source-indifferent. <a href="/foo">, a URL in a <script type="application/json"> block, a URL string in a React Server Component payload, and a URL embedded in client-side JavaScript all count for discovery if they match the regex.

What the two-stage split implies for CSR

  • CSR pages do not lose link discovery. A CSR page that embeds its link targets in a JSON payload (an RSC payload, a preloaded Redux state blob, an API-response blob) has those links discovered at crawl time from the initial body. They're in the crawl frontier almost immediately.
  • CSR pages do lose link value assessment latency. Because value assessment is post-render and post-queue, a CSR page's outbound links don't get weighted until the page is rendered — which can be p50 = 10 s or p99 ≈ 18 h out. Site-architecture decisions ride on that delay.
  • A sitemap short-circuits discovery latency altogether. See concepts/sitemap"Having an updated sitemap.xml significantly reduces, if not eliminates, the time-to-discovery differences between different rendering patterns."

What discovery does not do

  • Does not decode encoded URLs. https%3A%2F%2Fwebsite.com is not discovered.
  • Does not follow URL joins. this.baseURL + this.segment in a JS expression is not evaluated.
  • Does not run the JS. Link discovery is a static body pass, not an executed-program pass.
  • Does not assign link weight. That's value assessment, later.

Experimental confirmation

Vercel added a JSON object similar to an RSC payload to /showcase on nextjs.org, containing links to previously- undiscovered pages. Googlebot discovered and crawled them — confirming the regex-over-body discovery semantics. The encoded- URL in the same payload was not discovered — confirming the shape-strict rule.

Corollary patterns

  • patterns/link-in-non-rendered-json-payload-discovery — deliberate use of the discovery-via-regex behaviour to seed URL discovery from non-rendered payloads. Useful for CSR / SPA shells that have no rendered link targets but can expose a URL list in a JSON blob in the initial HTML.

Seen in

Last updated · 476 distilled / 1,218 read