CONCEPT Cited by 1 source
Link discovery vs link value assessment¶
Definition¶
Googlebot's crawl-to-index pipeline splits link-handling into two distinct stages that happen at different points and on different inputs:
- Link discovery — finding new URLs to add to the crawl frontier. Happens at crawl time on the initial HTML body, via a regex over URL-shaped strings. No rendering required.
- Link value assessment — deciding how important a discovered link is for site architecture / PageRank-weight propagation / crawl-priority decisions. Happens after full-page rendering, on the rendered DOM.
Canonical verbatim: "Google differentiates between link discovery and link value assessment. The evaluation of a link's value for site architecture and crawl prioritization occurs after full-page rendering."
(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)
The pipeline¶
[Crawl] fetch URL → inspect body
├── regex over body → new URLs found
│ └── enqueue for crawl (DISCOVERY)
└── enqueue for render
│
[Render] ├── WRS executes JS
└── emit rendered DOM
│
[Index] ├── index rendered content
└── weight links for PageRank / crawl priority (VALUE ASSESSMENT)
Discovery happens on the initial HTML body, which means:
- It's fast. No queue wait. URLs found → crawl frontier updated on that request.
- It's shape-strict. Google looks for URL-shaped strings
(start with
http(s)://or a relative path, ending in path characters), not URL-semantics. The literal byteshttps%3A%2F%2F...are not discovered — URL-decoding is not a pre-pass. - It's source-indifferent.
<a href="/foo">, a URL in a<script type="application/json">block, a URL string in a React Server Component payload, and a URL embedded in client-side JavaScript all count for discovery if they match the regex.
What the two-stage split implies for CSR¶
- CSR pages do not lose link discovery. A CSR page that embeds its link targets in a JSON payload (an RSC payload, a preloaded Redux state blob, an API-response blob) has those links discovered at crawl time from the initial body. They're in the crawl frontier almost immediately.
- CSR pages do lose link value assessment latency. Because value assessment is post-render and post-queue, a CSR page's outbound links don't get weighted until the page is rendered — which can be p50 = 10 s or p99 ≈ 18 h out. Site-architecture decisions ride on that delay.
- A sitemap short-circuits discovery latency altogether. See
concepts/sitemap — "Having an updated
sitemap.xmlsignificantly reduces, if not eliminates, the time-to-discovery differences between different rendering patterns."
What discovery does not do¶
- Does not decode encoded URLs.
https%3A%2F%2Fwebsite.comis not discovered. - Does not follow URL joins.
this.baseURL + this.segmentin a JS expression is not evaluated. - Does not run the JS. Link discovery is a static body pass, not an executed-program pass.
- Does not assign link weight. That's value assessment, later.
Experimental confirmation¶
Vercel added a JSON object similar to an RSC payload to
/showcase on nextjs.org, containing links to previously-
undiscovered pages. Googlebot discovered and crawled them —
confirming the regex-over-body discovery semantics. The encoded-
URL in the same payload was not discovered — confirming the
shape-strict rule.
Corollary patterns¶
- patterns/link-in-non-rendered-json-payload-discovery — deliberate use of the discovery-via-regex behaviour to seed URL discovery from non-rendered payloads. Useful for CSR / SPA shells that have no rendered link targets but can expose a URL list in a JSON blob in the initial HTML.
Seen in¶
- sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process — canonical wiki instance. Vercel + MERJ's empirical study describes the split explicitly as part of debunking the "JS- heavy sites have slower page discovery" myth.
Related¶
- systems/googlebot — the pipeline operator.
- systems/google-web-rendering-service — where value assessment happens.
- concepts/rendering-queue — why value assessment is delayed.
- patterns/link-in-non-rendered-json-payload-discovery — the deliberate-exploitation pattern.
- concepts/universal-rendering — the upstream property.