PATTERN Cited by 1 source

Edge-middleware bot-beacon injection¶

Pattern¶

To measure how search-engine crawlers actually render your pages (not what the crawler vendor says they do), use programmable edge middleware to:

Intercept every request on the CDN edge, before it reaches origin.
Identify crawler requests via User-Agent + source IP range + reverse-DNS verification (the verified-bot flow).
Inject a lightweight JavaScript beacon library into the HTML response for matched crawler requests — the library fires on window.onload / render completion and POSTs the render-complete timestamp to your beacon server.
Pass human traffic through unchanged so performance for real users is not affected.
Tag the HTML response with a unique request ID that also appears in the server access log, so beacon payloads can be paired with original crawl requests. See patterns/server-beacon-pairing-for-render-measurement.

Architecture¶

              ┌──────────────────────────┐
  user ──────▶│                          │─────▶ origin (unmodified)
              │   Edge Middleware        │
  Googlebot ─▶│   - UA + IP + rDNS check │─────▶ origin
              │   - inject beacon JS     │
              │     + unique req-id      │
              │   - tag access log       │
              └──────────────────────────┘
                         │
                         │  crawler's headless browser
                         ▼  runs the injected JS after render
              ┌──────────────────────────┐
              │  beacon server           │
              │  - receives POST with    │
              │    req-id + timestamp    │
              └──────────────────────────┘

Why edge middleware is the right layer¶

Every request traverses it. No sampling gap.
It can inject response body. Not every CDN layer has response-rewrite capability; the Workers / Edge Functions / Edge Middleware tier does.
It runs before origin. Bot detection / beacon injection doesn't require origin changes or cooperation.
It has access to verified-bot signals. Vercel / Cloudflare / Netlify edge layers know which IPs belong to which published crawler operators.
It has low incremental cost per request — V8-isolate cold-start is sub-millisecond on modern edge platforms.

Canonical instance¶

Vercel + MERJ 2024-08-01 study ran this pattern on nextjs.org for April 2024:

Edge middleware: Vercel's Edge Middleware (V8-isolate).
Bot detection: User-Agent + Google's published IP-range JSON
reverse-DNS verification.
Beacon library: MERJ's Web Rendering Monitor (systems/merj-web-rendering-monitor).
Beacon server: a long-running server capturing POSTs with request-ID + page URL + render-complete timestamp.
Dataset produced: 100,000+ Googlebot fetches, 37,000+ server-beacon-matched render-delay measurements.

Output: the canonical rendering-delay distribution (p50 = 10 s; p99 ≈ 18 h) and the 100 %-rendering- success claim.

(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

Trade-offs¶

Injected JS must not trigger cloaking suspicions. The JS is a research beacon, not content; Google's cloaking policy doesn't forbid beacon injection (it's observability, not content manipulation). But the injected script should be minimal and unambiguously non- content.
Doesn't work for crawlers whose sandbox blocks outbound HTTP. OpenAI / Anthropic / some AI-training crawlers may have stricter JS sandboxes. Measurement only recovers data from bots whose render environment permits fetch() back.
Measures rendering success, not rendering quality. You find out when a page was rendered; you don't find out what DOM the bot built — that requires the crawler vendor's tools (e.g. Google's URL Inspection Tool).
Non-trivial ongoing cost. Beacon-server fleet, access-log storage, pair-join pipeline. Worth it for research; overkill for day-to-day SEO monitoring.
Skews toward declared, well-behaved crawlers. A stealth crawler won't show up as a matched crawler at all; an undeclared crawler with a browser-like UA lands in the "human" bucket.

Variants¶

Single-crawler measurement. The Vercel study focuses on Googlebot for a clean, large sample. Multi-crawler measurement requires per-crawler verification flows.
Measurement-as-product. A SaaS could offer "how Google actually renders your site" as a productised report. MERJ's WRM is open-source because the study is research-intent, but the pattern supports commercialisation.
Sampling strategies for very high traffic. At Googlebot traffic rates on a large site, you don't need 100 % injection — a deterministic sample keyed on request-ID-hash gives unbiased distribution estimates at lower cost.

Seen in¶

sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process — canonical wiki instance. The 100,000-fetch / 37,000-pair dataset powering the first public empirical rendering-delay distribution for Googlebot on a Next.js site at scale.

patterns/server-beacon-pairing-for-render-measurement — the downstream pairing pattern that joins beacon POSTs with server access logs.
systems/vercel-edge-functions — the edge-middleware substrate.
systems/googlebot — the measured crawler.
systems/merj-web-rendering-monitor — the beacon library.
concepts/rendering-delay-distribution — the output.
concepts/declared-crawler — the verification flow that identifies bots without trusting the bare User-Agent.