PATTERN Cited by 1 source

Brand-new domain experiment¶

Shape¶

A controlled empirical methodology for testing whether a specific crawler or bot operator respects origin-side directives:

Purchase brand-new domains that have never been registered before (testexample.com, secretexample.com — arbitrary).
Host content on them but make no public announcement — no links, no social mentions, not indexed by any search engine, not in the Common Crawl corpus.
Configure the target directive — typically a robots.txt with a blanket Disallow: / for User-agent: *, plus any additional WAF / bot-management rules being tested.
Query the suspected crawler's product surface about content on the domain — for an AI answer engine, ask it a question whose answer is only on the test domain.
If the product returns the answer, the crawler fetched the content in violation of the directive. By construction, there is no other path for the content to have reached the product — no public discovery, no training-set contamination, no third-party indexing.

The load-bearing property is zero confounders. On a pre- existing site, a crawler that fetches content you've explicitly disallowed can plausibly claim the content came from a prior crawl, third-party embed, cached link, user share, or training-set redistribution. A brand-new unindexed domain eliminates all of those vectors — the only way the crawler can know about the content is to have fetched it directly in violation of the directive.

Canonical instance¶

Cloudflare's 2025-08-04 Perplexity investigation. The team purchased multiple never-before-registered domains ("similar to testexample.com and secretexample.com"), implemented robots.txt disallowing all automated access, and made no public announcement. They then queried Perplexity AI about content on these domains:

"This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers."

Perplexity's ability to answer the questions in detail was unambiguous evidence of direct fetch in violation of robots.txt. Combined with network-level analysis from customer traffic, this let Cloudflare disambiguate the declared crawlers' robots.txt-respecting behavior from the third stealth crawler's evasion.

The same experimental apparatus run against ChatGPT produced the positive control: ChatGPT-User fetched robots.txt, honored the Disallow, and stopped — with no alternate-UA follow-up attempts visible on the test domains.

When the pattern applies¶

Use brand-new domain experiments when:

You need to attribute content retrieval to a specific crawler operator and rule out other channels.
Plausible deniability ("maybe our training set already had it") is the most likely defense against a violation claim.
A controlled comparison between operators is valuable — run the same test against multiple crawlers and compare compliance behavior.
You can afford multiple domain purchases + operational overhead — the experiment cost scales linearly with the number of operators being tested.

Non-instances¶

Compliance audits on existing sites — can't use this pattern because pre-existing content is already at risk of indirect discovery.
Broad population surveys — the pattern is point-measurement per operator, not fleet-wide.

concepts/stealth-crawler — the phenomenon this methodology was designed to catch.
concepts/declared-crawler / concepts/robots-txt-compliance.
patterns/stealth-crawler-detection-fingerprint — the ML- fingerprinting follow-up that generalizes from point attribution to fleet-wide detection.

Seen in¶

sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — canonical wiki instance.