Skip to content

CONCEPT Cited by 1 source

Stealth crawler

Definition

A stealth crawler is an automated web client that deliberately misrepresents its identity to evade origin-side crawl controls — robots.txt, user-agent allow/deny rules, WAF blocks, bot-management scoring. Typical tactics, in rough order of increasing effort:

  1. User-agent spoofing — send a generic browser UA rather than the organization's declared crawler UA. (See concepts/user-agent-rotation.)
  2. IP sourcing outside the published range — use cloud / residential / scraping-proxy IPs rather than the operator's documented crawler range.
  3. ASN rotation — distribute traffic across multiple autonomous systems to defeat ASN- level blocks.
  4. robots.txt non-fetching or ignoring — deliberately skip the file, or fetch and disregard.
  5. Escalation on block — engage stealth tactics only in response to explicit origin-side enforcement, so the behavior is invisible to origins that don't block the declared crawler. (See patterns/stealth-on-block-fallback.)

A stealth crawler is not the same as an undeclared crawler: the stealth label implies active evasion, while "undeclared" just means "not named in the operator's published crawler list" (could be benign, could be internal, could be new). Every stealth crawler is undeclared; not every undeclared crawler is stealth.

Canonical instance

Perplexity AI's third crawler (Cloudflare, 2025-08-04):

  • User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 — generic Chrome-on-macOS.
  • IPs outside Perplexity's published range.
  • Rotates across multiple ASNs.
  • Doesn't fetch / doesn't respect robots.txt.
  • 3-6 M requests/day across tens of thousands of domains.
  • Activates when PerplexityBot and Perplexity-User (the declared crawlers) are blocked — the stealth-on-block pattern.

Cloudflare's response: ML + network-signal fingerprinting to produce a bot-management signature that survives the rotation, then delist Perplexity from Verified Bots and ship block signatures into the managed AI-bots ruleset for all customers including free tier.

Why the cryptographic-identity answer matters

Stealth crawling is the failure mode that Web Bot Auth (Ed25519 keypair + JWK directory + per-request RFC 9421 signatures) is designed to eliminate for cooperating crawlers. ChatGPT Agent — the positive control in the 2025-08-04 post — signs via Web Bot Auth; if OpenAI's crawler tried to stealth-crawl, the absence of the signature would be the signal.

Non-cooperating crawlers are the harder problem. The structural answer there is ML-based fingerprinting over features the stealth crawler can't cheaply spoof — TLS fingerprints, HTTP/2 frame ordering, request timing shape, traffic-graph signals — combined with gossip-propagation so a fingerprint learned in one POP defends all POPs.

Seen in

Last updated · 200 distilled / 1,178 read