PATTERN Cited by 1 source
Edge-middleware bot-beacon injection¶
Pattern¶
To measure how search-engine crawlers actually render your pages (not what the crawler vendor says they do), use programmable edge middleware to:
- Intercept every request on the CDN edge, before it reaches origin.
- Identify crawler requests via
User-Agent+ source IP range + reverse-DNS verification (the verified-bot flow). - Inject a lightweight JavaScript beacon library into the
HTML response for matched crawler requests — the library
fires on
window.onload/ render completion and POSTs the render-complete timestamp to your beacon server. - Pass human traffic through unchanged so performance for real users is not affected.
- Tag the HTML response with a unique request ID that also appears in the server access log, so beacon payloads can be paired with original crawl requests. See patterns/server-beacon-pairing-for-render-measurement.
Architecture¶
┌──────────────────────────┐
user ──────▶│ │─────▶ origin (unmodified)
│ Edge Middleware │
Googlebot ─▶│ - UA + IP + rDNS check │─────▶ origin
│ - inject beacon JS │
│ + unique req-id │
│ - tag access log │
└──────────────────────────┘
│
│ crawler's headless browser
▼ runs the injected JS after render
┌──────────────────────────┐
│ beacon server │
│ - receives POST with │
│ req-id + timestamp │
└──────────────────────────┘
Why edge middleware is the right layer¶
- Every request traverses it. No sampling gap.
- It can inject response body. Not every CDN layer has response-rewrite capability; the Workers / Edge Functions / Edge Middleware tier does.
- It runs before origin. Bot detection / beacon injection doesn't require origin changes or cooperation.
- It has access to verified-bot signals. Vercel / Cloudflare / Netlify edge layers know which IPs belong to which published crawler operators.
- It has low incremental cost per request — V8-isolate cold-start is sub-millisecond on modern edge platforms.
Canonical instance¶
Vercel + MERJ 2024-08-01 study ran this pattern on
nextjs.org for April 2024:
- Edge middleware: Vercel's Edge Middleware (V8-isolate).
- Bot detection:
User-Agent+ Google's published IP-range JSON - reverse-DNS verification.
- Beacon library: MERJ's Web Rendering Monitor (systems/merj-web-rendering-monitor).
- Beacon server: a long-running server capturing POSTs with request-ID + page URL + render-complete timestamp.
- Dataset produced: 100,000+ Googlebot fetches, 37,000+ server-beacon-matched render-delay measurements.
Output: the canonical rendering-delay distribution (p50 = 10 s; p99 ≈ 18 h) and the 100 %-rendering- success claim.
(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)
Trade-offs¶
- Injected JS must not trigger cloaking suspicions. The JS is a research beacon, not content; Google's cloaking policy doesn't forbid beacon injection (it's observability, not content manipulation). But the injected script should be minimal and unambiguously non- content.
- Doesn't work for crawlers whose sandbox blocks outbound
HTTP. OpenAI / Anthropic / some AI-training crawlers may have
stricter JS sandboxes. Measurement only recovers data from
bots whose render environment permits
fetch()back. - Measures rendering success, not rendering quality. You find out when a page was rendered; you don't find out what DOM the bot built — that requires the crawler vendor's tools (e.g. Google's URL Inspection Tool).
- Non-trivial ongoing cost. Beacon-server fleet, access-log storage, pair-join pipeline. Worth it for research; overkill for day-to-day SEO monitoring.
- Skews toward declared, well-behaved crawlers. A stealth crawler won't show up as a matched crawler at all; an undeclared crawler with a browser-like UA lands in the "human" bucket.
Variants¶
- Single-crawler measurement. The Vercel study focuses on Googlebot for a clean, large sample. Multi-crawler measurement requires per-crawler verification flows.
- Measurement-as-product. A SaaS could offer "how Google actually renders your site" as a productised report. MERJ's WRM is open-source because the study is research-intent, but the pattern supports commercialisation.
- Sampling strategies for very high traffic. At Googlebot traffic rates on a large site, you don't need 100 % injection — a deterministic sample keyed on request-ID-hash gives unbiased distribution estimates at lower cost.
Seen in¶
- sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process — canonical wiki instance. The 100,000-fetch / 37,000-pair dataset powering the first public empirical rendering-delay distribution for Googlebot on a Next.js site at scale.
Related¶
- patterns/server-beacon-pairing-for-render-measurement — the downstream pairing pattern that joins beacon POSTs with server access logs.
- systems/vercel-edge-functions — the edge-middleware substrate.
- systems/googlebot — the measured crawler.
- systems/merj-web-rendering-monitor — the beacon library.
- concepts/rendering-delay-distribution — the output.
- concepts/declared-crawler — the verification flow that
identifies bots without trusting the bare
User-Agent.