PATTERN Cited by 1 source
Stealth crawler detection fingerprint¶
Shape¶
When a crawler operator evades identity-layer enforcement (UA spoofing, IP rotation, ASN rotation), build an ML classifier over content-independent request features that produces a stable detection signature the attacker cannot cheaply rotate around. Ship the signature as a customer-deployable block rule.
Structural steps¶
- Collect labels. Use controlled-experiment traces (patterns/brand-new-domain-experiment), cooperative- crawler declarations, honeypot domains, and retrospective analysis of customer complaints to produce positive and negative examples of the stealth operator's traffic.
- Feature engineering over network-level signals. TLS fingerprints (JA3 / JA4), HTTP/2 frame ordering, request timing distributions, IP reputation, ASN shape across a session, cross-domain request-graph patterns. Avoid features the attacker controls cheaply (UA, referer, cookie values).
- Train a classifier. Typical choice: gradient-boosted trees on request-level features producing a bot score; the inference is fast enough to run in the edge request path.
- Validate against retention of the detection under known evasion tactics — retrain when the attacker rotates features.
- Ship as a managed rule ( Cloudflare-managed rule) so customers get the protection without building their own ML stack.
- Propagate fingerprints via patterns/gossip-fingerprint-propagation so a newly- learned signature defends the whole POP fleet, not just the POP that observed the attack.
Complements¶
- patterns/verified-bot-delisting — the policy-layer enforcement. Delisting flips the default posture from "allow known bot" to "run bot-management scoring"; the ML fingerprint is what makes scoring produce a non-zero signal.
- patterns/brand-new-domain-experiment — the labeling methodology that produces high-confidence positives.
Canonical instance¶
Cloudflare's Perplexity stealth-crawler signature (2025-08-04). The post discloses:
- "We were able to fingerprint this crawler using a combination of machine learning and network signals."
- All stealth-UA traffic scored as bot; failed managed challenges.
- Block signatures added to the managed AI-bots ruleset, available to all customers including free tier.
- Survives Perplexity's IP + ASN rotation.
The post does not publish the feature list — deliberate, because publication accelerates evasion iteration.
The adversarial feedback loop¶
The pattern is point-in-time, not terminal:
"Once this post is live the behavior we saw will almost certainly change, and the methods we use to stop them will keep evolving as well."
Each fingerprint buys a window; the operator iterates; the defender iterates. The steady-state posture is continuous retraining + propagation, not a one-shot detection.
Seen in¶
- sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — canonical wiki instance.