Skip to content

PATTERN Cited by 1 source

Edge-middleware bot-beacon injection

Pattern

To measure how search-engine crawlers actually render your pages (not what the crawler vendor says they do), use programmable edge middleware to:

  1. Intercept every request on the CDN edge, before it reaches origin.
  2. Identify crawler requests via User-Agent + source IP range + reverse-DNS verification (the verified-bot flow).
  3. Inject a lightweight JavaScript beacon library into the HTML response for matched crawler requests — the library fires on window.onload / render completion and POSTs the render-complete timestamp to your beacon server.
  4. Pass human traffic through unchanged so performance for real users is not affected.
  5. Tag the HTML response with a unique request ID that also appears in the server access log, so beacon payloads can be paired with original crawl requests. See patterns/server-beacon-pairing-for-render-measurement.

Architecture

              ┌──────────────────────────┐
  user ──────▶│                          │─────▶ origin (unmodified)
              │   Edge Middleware        │
  Googlebot ─▶│   - UA + IP + rDNS check │─────▶ origin
              │   - inject beacon JS     │
              │     + unique req-id      │
              │   - tag access log       │
              └──────────────────────────┘
                         │  crawler's headless browser
                         ▼  runs the injected JS after render
              ┌──────────────────────────┐
              │  beacon server           │
              │  - receives POST with    │
              │    req-id + timestamp    │
              └──────────────────────────┘

Why edge middleware is the right layer

  • Every request traverses it. No sampling gap.
  • It can inject response body. Not every CDN layer has response-rewrite capability; the Workers / Edge Functions / Edge Middleware tier does.
  • It runs before origin. Bot detection / beacon injection doesn't require origin changes or cooperation.
  • It has access to verified-bot signals. Vercel / Cloudflare / Netlify edge layers know which IPs belong to which published crawler operators.
  • It has low incremental cost per request — V8-isolate cold-start is sub-millisecond on modern edge platforms.

Canonical instance

Vercel + MERJ 2024-08-01 study ran this pattern on nextjs.org for April 2024:

  • Edge middleware: Vercel's Edge Middleware (V8-isolate).
  • Bot detection: User-Agent + Google's published IP-range JSON
  • reverse-DNS verification.
  • Beacon library: MERJ's Web Rendering Monitor (systems/merj-web-rendering-monitor).
  • Beacon server: a long-running server capturing POSTs with request-ID + page URL + render-complete timestamp.
  • Dataset produced: 100,000+ Googlebot fetches, 37,000+ server-beacon-matched render-delay measurements.

Output: the canonical rendering-delay distribution (p50 = 10 s; p99 ≈ 18 h) and the 100 %-rendering- success claim.

(Source: sources/2024-08-01-vercel-how-google-handles-javascript-throughout-the-indexing-process.)

Trade-offs

  • Injected JS must not trigger cloaking suspicions. The JS is a research beacon, not content; Google's cloaking policy doesn't forbid beacon injection (it's observability, not content manipulation). But the injected script should be minimal and unambiguously non- content.
  • Doesn't work for crawlers whose sandbox blocks outbound HTTP. OpenAI / Anthropic / some AI-training crawlers may have stricter JS sandboxes. Measurement only recovers data from bots whose render environment permits fetch() back.
  • Measures rendering success, not rendering quality. You find out when a page was rendered; you don't find out what DOM the bot built — that requires the crawler vendor's tools (e.g. Google's URL Inspection Tool).
  • Non-trivial ongoing cost. Beacon-server fleet, access-log storage, pair-join pipeline. Worth it for research; overkill for day-to-day SEO monitoring.
  • Skews toward declared, well-behaved crawlers. A stealth crawler won't show up as a matched crawler at all; an undeclared crawler with a browser-like UA lands in the "human" bucket.

Variants

  • Single-crawler measurement. The Vercel study focuses on Googlebot for a clean, large sample. Multi-crawler measurement requires per-crawler verification flows.
  • Measurement-as-product. A SaaS could offer "how Google actually renders your site" as a productised report. MERJ's WRM is open-source because the study is research-intent, but the pattern supports commercialisation.
  • Sampling strategies for very high traffic. At Googlebot traffic rates on a large site, you don't need 100 % injection — a deterministic sample keyed on request-ID-hash gives unbiased distribution estimates at lower cost.

Seen in

Last updated · 476 distilled / 1,218 read