Skip to content

CLOUDFLARE 2025-08-04 Tier 1

Read original ↗

Cloudflare — Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

Summary

Cloudflare reports on Perplexity AI fetching content from origins that have explicitly disallowed Perplexity's declared crawlers — both via robots.txt and via Cloudflare WAF rules naming PerplexityBot and Perplexity-User. When blocked, Perplexity falls back to an undeclared stealth crawler that impersonates a generic Chrome-on-macOS user-agent, rotates across IPs outside Perplexity's published ranges, and rotates across multiple ASNs. Cross-domain volume observed: 3–6 M stealth requests/day vs 20–25 M declared requests/day, across tens of thousands of domains. Cloudflare fingerprinted the stealth crawler via ML + network signals, de-listed Perplexity from its Verified Bots program, and shipped block signatures into its managed AI-crawler ruleset. The post contrasts Perplexity's behavior against ChatGPT, which fetched robots.txt, honored Disallow, and stopped on network block — canonical example of the "respect website preferences" posture Cloudflare is arguing for.

Key takeaways

  1. Controlled empirical test via brand-new unindexed domains. Cloudflare purchased multiple never-before-registered domains (similar to testexample.com / secretexample.com), published a robots.txt disallowing all automated access, and made no public announcement. They then asked Perplexity AI questions about content on these domains and received detailed answers — demonstrating retrieval had to have occurred through Perplexity's crawler infrastructure, not via search-engine indexing or public discovery. Canonical wiki instance of patterns/brand-new-domain-experiment.

  2. Two-tier crawler topology: declared + stealth. Perplexity has two distinct fetch paths:

  3. DeclaredMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user), 20-25 M daily requests, published IP range, robots.txt-aware.
  4. StealthMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36, 3-6 M daily requests, IPs not in Perplexity's published range, rotating across multiple ASNs, does not fetch robots.txt or ignores it. (See concepts/declared-crawler / concepts/stealth-crawler / concepts/undeclared-crawler.)

  5. IP + ASN rotation triggered by block. "This undeclared crawler utilized multiple IPs not listed in Perplexity's official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare. In addition to rotating IPs, we observed requests coming from different ASNs in attempts to further evade website blocks." Canonical wiki instances of IP-rotation / concepts/asn-rotation — evasion escalates in response to enforcement, not a static alternative channel. (patterns/stealth-on-block-fallback.)

  6. ML + network-signal fingerprinting closes the loop on rotation. Cloudflare fingerprinted the stealth crawler "using a combination of machine learning and network signals", then shipped signature matches into the managed AI-crawler ruleset available to all customers including free tier. Rotation defeats naive-IP-based blocks; ML-learned request-shape features survive the rotation. Canonical instance of concepts/ml-bot-fingerprinting as the structural answer to concepts/user-agent-rotation + concepts/asn-rotation.

  7. Verified-bot delisting as enforcement. Cloudflare de-listed Perplexity from its Verified Bots directory. Delisting removes the bot from the trusted-identity corpus that downstream features (pay-per-crawl, Web Bot Auth) rely on, and flips the default posture from "known-good, allow by default" to "unknown, subject to bot-management scoring." Canonical wiki instance of patterns/verified-bot-delisting — the program's incentive lever is membership itself.

  8. Counter-example: ChatGPT. Cloudflare ran the same controlled experiment with ChatGPT. ChatGPT-User fetched robots.txt, honored the Disallow, and stopped crawling. When the Disallow was removed but a block page was returned, ChatGPT again stopped — "we saw no additional crawl attempts from other user agents." Additionally ChatGPT Agent signs requests using Web Bot Auth. Demonstrates robots.txt + block-page + signed-bot-request is a sufficient protocol when crawlers comply. (See systems/chatgpt-user / patterns/signed-bot-request.)

  9. Graceful-degradation-on-block is the non-stealth baseline. "When the stealth crawler was successfully blocked, we observed that Perplexity uses other data sources — including other websites — to try to create an answer. However, these answers were less specific and lacked details from the original content, reflecting the fact that the block had been successful." Proves the block works end-to-end: the user- facing answer is measurably degraded, not silently served from a different evasion path.

  10. Five operator norms named explicitly. Cloudflare lays out the crawler-etiquette framework it's arguing for: (a) be transparent (unique UA + declared IP ranges or Web Bot Auth + contact); (b) be well-behaved netizens (no flooding, no sensitive-data scraping, no stealth); (c) serve a clear purpose (publicly documented per bot); (d) separate bots for separate activities (so site owners can allow a search bot but deny a training bot); (e) follow the rules (robots.txt, rate limits, no bypassing security). This is the first wiki consolidation of Cloudflare's crawler- etiquette doctrine as a named five-point framework.

Operational numbers

  • Declared crawler volume: 20-25 M requests/day.
  • Stealth crawler volume: 3-6 M requests/day.
  • Scope: tens of thousands of domains, millions of requests per day.
  • Bot-management score: all stealth-UA traffic scored as bot, unable to pass managed challenges.
  • Content Independence Day (2025-07-01) adoption: "over two and a half million websites" disallow AI training via Cloudflare's managed robots.txt or AI-crawler block rule by 2025-08-04 (about a month later).
  • RFC: RFC 9309 (Robots Exclusion Protocol).

Methodology notes

  • Controlled experiment design — the load-bearing move. Any claim of "this crawler ignored robots.txt" on already-indexed domains is confounded by prior indexing, third-party embeds, cached links, user shares. Brand-new-never-published domains remove every confounder — the only way content reaches the crawler is through the crawler's own fetch path.
  • Attribution via ML + network signals, not just UA parsing — UA string alone is meaningless because the stealth crawler is deliberately UA-spoofing a generic Chrome. Cloudflare cites "machine learning and network signals" as the fingerprint basis; specifics not disclosed (likely deliberate — publishing features makes evasion easier).
  • Anti-publishing posture explicitly acknowledged: "Once this post is live the behavior we saw will almost certainly change, and the methods we use to stop them will keep evolving as well." Post is a point-in-time snapshot, not a detection contract.

Architectural framing

Sibling to three prior Cloudflare posts on the wiki that together form a coherent 2025 framework:

  1. Pay Per Crawl (2025-07-01) — establishes Web Bot Auth as the cryptographic identity substrate for wanted crawlers. Requires cooperative crawlers. Pay-per-crawl cannot exist at scale unless stealth crawling is also blocked — otherwise crawlers route around billing.
  2. This post (2025-08-04) — shows the enforcement half against unwanted stealth crawling, via ML fingerprinting + verified-bot delisting + managed AI-crawler ruleset.
  3. Moving past bots vs. humans (2026-04-21) — structural-framing post reframing the whole problem around the rate-limit trilemma; positions Web Bot Auth as the identity branch and Privacy Pass / ARC / ACT as the anonymous branch.

The 2025-08-04 post is the enforcement precedent that makes the 2026-04-21 framing post's "cooperative crawlers self-identify, everyone else gets fingerprinted" architecture concrete.

Caveats

  • Point-in-time snapshot. Post explicitly notes behavior will change once the writeup publishes; fingerprints documented are not a stable detection contract.
  • ML fingerprint details not disclosed. "Machine learning and network signals" is a framework statement; no feature list, no false-positive rate, no confidence distribution. By design — disclosure aids evasion.
  • One-sided attribution. Perplexity's side of the story is not in the post; the framing is Cloudflare's investigation + customer complaints.
  • Volume per bot not disaggregated across ASNs. "Multiple ASNs" is stated qualitatively; per-ASN breakdown not published.
  • ChatGPT is the only positive control. Post doesn't survey the broader AI-crawler landscape (Anthropic Claude, Google Gemini, Meta, open-weight deployments) for comparison.
  • Stealth crawler does not fetch — or fails to fetch — robots.txt. Post notes both behaviors. Whether the stealth crawler deliberately skips robots.txt or inherits a broken fetch path isn't disclaimed.
  • Cloudflare is the merchant-of-record for pay-per-crawl and the authority over Verified Bots. Delisting is a commercial / policy decision inside Cloudflare's own program, not an industry-wide determination.

Source

Last updated · 200 distilled / 1,178 read