CONCEPT Cited by 2 sources

ML bot fingerprinting¶

Definition¶

ML bot fingerprinting is the use of machine-learning classifiers over content-independent request features — TLS handshake fingerprints (JA3 / JA4), HTTP/2 frame ordering, request timing patterns, IP reputation, ASN distribution shape, cross-domain request graph — to classify an incoming request as human, verified bot, or unverified bot, producing a bot score that downstream policy engines act on.

The defining property: features come from how the client connects, not what the client claims. A request header like User-Agent: Googlebot/2.1 is self-reported and easily forged; a JA4 fingerprint of the TLS ClientHello is a property of the TLS stack the client actually uses, harder (not impossible) to spoof.

Why ML, not static rules¶

Feature space is high-dimensional — dozens of TLS + HTTP/2
timing features, interactions matter.
Ground truth is noisy — labels come from a mix of cooperative-crawler declarations, honeypots, customer reports, and retrospective analysis.
Adversaries iterate — stealth operators rotate IPs, change TLS libraries, tweak timing; a static rule set degrades weekly.
False-positive cost is asymmetric — blocking a real user is worse than letting a bot through, so calibrated probabilistic scoring beats binary rules.

Canonical instance¶

Cloudflare's stealth-crawler detection (2025-08-04): "We were able to fingerprint this crawler using a combination of machine learning and network signals." The post doesn't publish the feature list — deliberate; disclosure makes evasion easier. What it does disclose:

All of Perplexity's stealth traffic was scored as bot and failed managed challenges.
The resulting signature shipped as a block rule in the managed AI-bots ruleset, available to all Cloudflare customers including free tier.
The signature survives Perplexity's IP rotation + ASN rotation + UA spoofing — features the attacker cheaply controls don't appear in the fingerprint.

The fingerprinting ambiguity¶

ML bot fingerprinting inherits the mitigation ↔ tracking duality Cloudflare flags in the 2026-04-21 bots-vs-humans post: the same signals that distinguish a legitimate browser from a stealth crawler also re-identify users across sites. See concepts/fingerprinting-vector for the failure mode. In the bot-mitigation context the trade-off is accepted because the target is clearly abusive; in privacy-sensitive contexts (first-party analytics, ad targeting) it's more contested.

Cross-product integration¶

Gossip propagation — fingerprints learned in one POP multicast within-POP and globally so attacker rotation across geography buys seconds, not days.
Bot-management score feeds pay-per-crawl (systems/pay-per-crawl) — a crawler scored as bot-but- unverified doesn't reach the billing layer.
Managed rules (systems/cloudflare-managed-ruleset) — ML-derived signatures shipped as customer-deployable rules.

Seen in¶

sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — canonical wiki instance; ML fingerprinting closes the loop on IP + ASN + UA rotation by attacking features the stealth operator cannot cheaply change.
sources/2026-04-21-vercel-botid-deep-analysis-catches-a-sophisticated-bot-network-in-real-time — second independent wiki instance, at the browser-telemetry layer rather than TLS / HTTP network layer. Vercel BotID / Kasada apply the same ML-over-content-independent-features approach to browser fingerprints + behavioural patterns, with the post disclosing cross-session correlation (concepts/proxy-node-correlation-signal) as the trigger for adaptive reclassification. Reinforces the deliberate-opacity norm (neither Cloudflare nor Vercel / Kasada publish feature lists).

concepts/fingerprinting-vector — the mitigation-vs- tracking duality.
concepts/stealth-crawler / concepts/verified-bots.
patterns/stealth-crawler-detection-fingerprint / patterns/gossip-fingerprint-propagation.
systems/cloudflare-bot-management / systems/cloudflare-managed-ruleset.