CONCEPT Cited by 1 source

robots.txt compliance¶

Definition¶

robots.txt compliance is the practice — by a crawler operator — of fetching /robots.txt from every origin before crawling, and honoring the Disallow, Allow, and crawl-delay directives declared in it.

RFC 9309 (Robots Exclusion Protocol, 2022) formalizes the file format and semantics. The protocol is advisory — nothing in the HTTP stack enforces it — so compliance is a matter of operator posture, not protocol mechanism.

The 2025-08-04 Cloudflare post operationalizes the concept by running a controlled test: brand-new unindexed domains publishing a blanket Disallow and then observing which AI crawlers respected it. Two outcomes on record:

ChatGPT (ChatGPT-User) — fetched robots.txt, honored the Disallow, stopped crawling, no follow-up from alternate UAs. Full compliance. See systems/chatgpt-user.
Perplexity — the declared crawlers (systems/perplexitybot / systems/perplexity-user) appeared to comply, but when blocked, Perplexity fell back to a stealth crawler that either didn't fetch robots.txt or ignored it. Non-compliance at the operator level, even if individual crawlers complied individually.

The five Cloudflare operator norms¶

The post names five cooperative-crawler norms; robots.txt compliance is norm #5 ("Follow the rules"):

Be transparent.
Be well-behaved netizens.
Serve a clear purpose.
Separate bots for separate activities.
Follow the rules — check and respect robots.txt, stay within rate limits, never bypass security protections.

See the source page for the full framework.

Enforcement, not compliance¶

Because the protocol is advisory, non-compliant crawlers require network-layer enforcement:

WAF rules blocking declared UAs (systems/cloudflare-waf).
Bot-management scoring (systems/cloudflare-bot-management).
ML fingerprinting of stealth crawlers (concepts/ml-bot-fingerprinting).
Cryptographic bot identity (systems/web-bot-auth) — raises the cost of being undeclared.
Monetization via pay-per-crawl — makes compliance economically attractive.

Seen in¶

sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — canonical wiki instance; binary split between compliant (ChatGPT) and non-compliant (Perplexity stealth) crawlers from a single controlled test.

concepts/robots-txt — the file / protocol itself.
concepts/declared-crawler / concepts/stealth-crawler.
concepts/verified-bots.
systems/chatgpt-user / systems/perplexitybot / systems/perplexity-user.
patterns/signed-bot-request.

robots.txt compliance¶

Definition¶

The five Cloudflare operator norms¶

Enforcement, not compliance¶

Seen in¶

Related¶