Skip to content

PATTERN Cited by 1 source

Server-beacon pairing for render measurement

Pattern

To measure the end-to-end rendering delay of a client (browser, crawler) that you don't control, inject a unique request identifier into both:

  • the server access log when the initial HTML response is emitted, and
  • a beacon JavaScript library injected into that HTML response, which POSTs the identifier to a beacon server after render completion.

Later, join the two data streams on the request identifier to recover per-request (crawl_time, render_complete_time) pairs. Compute render_delay = render_complete_time - crawl_time.

Aggregate across many paired events to recover the full rendering-delay distribution (p50, p75, p99), slice by URL shape / URL prefix / response size, and compute the rendering success rate as "fraction of access-log entries with a matched beacon within a reasonable window."

Why this architecture

The measurement problem: you know when your server answered a crawl request, but the crawler's rendering happens inside the crawler's own infrastructure, outside your control. You need some signal from inside the crawler's sandbox to correlate with the server's crawl-time record.

The solution: inject JS that runs inside the crawler's headless browser, and fire a POST back to a server you do control, carrying the correlation key. The injected JS borrows the crawler's browser's own outbound HTTP capability.

Correlation-key design

edge middleware injects into HTML response:

  <script>
    (function() {
      var REQUEST_ID = "crawl-abc123def";   ← generated at edge
      window.addEventListener('load', function() {
        // fire once the page is fully rendered
        fetch('https://beacon.example.com/wrm', {
          method: 'POST',
          body: JSON.stringify({
            req_id: REQUEST_ID,
            url: window.location.href,
            t:   Date.now()                  ← render-complete time
                                               (or beacon-server side, safer)
          }),
          keepalive: true
        });
      });
    })();
  </script>

server access log line for the same request:

  2024-04-17T10:32:55Z GET /docs/foo  200  req_id=crawl-abc123def
                                             user_agent=Googlebot/2.1  ...

Request ID is the join key. One record per request in each log; pairing is 1:1.

Canonical instance

Vercel + MERJ, 2024-08-01, nextjs.org, April 2024:

  • 100,000+ access-log entries for Googlebot requests.
  • 37,000+ beacon-matched pairs recovered (pairing yields less than 1:1 because non-indexable URLs, errored responses, and post-render beacon failures don't complete the loop — all explainable).
  • 100 % of indexable HTML pages (200/304, no noindex) paired — the claimed rendering-success rate.
  • Distribution recovered: p25 ≤ 4 s, p50 = 10 s, p75 = 26 s, p90 ≈ 3 h, p95 ≈ 6 h, p99 ≈ 18 h. See concepts/rendering-delay-distribution.

(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

Where the timestamp is measured

Preferred: server-side at beacon-POST arrival. Avoids clock skew between the crawler's rendering sandbox and the measurer's server-side infrastructure. Delay is measured as a local time-difference on equipment the measurer controls.

"The timestamp of the rendering completion (this is calculated using the JavaScript Library request reception time on the server)." (Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

Pairing failure modes

Every measurement loses some pairs; the loss shape tells you about the crawler:

  • Access-log entry with no beacon: crawl happened, render didn't complete / beacon didn't fire. Could be:
  • Non-indexable status (3xx/4xx/5xx → no render).
  • Render failed (fraction of pages the crawler couldn't render).
  • Bot's JS sandbox blocks outbound HTTP (AI-training crawlers may).
  • Beacon POST was lost / timed out / rate-limited.
  • Beacon entry with no access-log match: should not happen under normal operation — the beacon JS only gets injected when edge middleware tagged the request. If it does happen, suspects access-log sampling or retention gaps.

The ratio of matched:unmatched pairs is itself a signal — at nextjs.org it told Vercel the rendering-success rate was effectively 100 % for indexable pages.

Trade-offs

  • Post-render-event timing precision. window.onload fires after main-thread idle; precise-render-complete definition is fuzzy (is it after main-thread idle? after all async settled? after all network done?). Fine for a distribution; less fine for a millisecond-precision latency.
  • Long-tail renders complete long after the request-ID lookup window. At p99 ≈ 18 h, your access-log index needs to retain enough history for the pairing to succeed. Pipeline design cost.
  • Beacon server must be high-availability. A beacon server outage looks like a rendering-success drop until you disentangle the failure mode.
  • Request-ID must be unique but opaque. Don't encode PII or sensitive routing info; it's in the HTML response body and the client-side JS, both visible to the crawler.

Adjacent patterns

  • CSP / content-security-policy compatibility. Injected beacon JS needs to be permitted by the site's CSP. Usually requires a CSP nonce or an allow-list entry. Edge middleware can set both in the same request.
  • Keepalive / sendBeacon. fetch(..., {keepalive: true}) or navigator.sendBeacon() — both ensure the POST completes even if the rendering session terminates immediately after.

Seen in

Last updated · 476 distilled / 1,218 read