CONCEPT Cited by 1 source
robots.txt compliance¶
Definition¶
robots.txt compliance is the practice — by a crawler
operator — of fetching /robots.txt from every origin before
crawling, and honoring the Disallow, Allow, and crawl-delay
directives declared in it.
RFC 9309 (Robots Exclusion Protocol, 2022) formalizes the file format and semantics. The protocol is advisory — nothing in the HTTP stack enforces it — so compliance is a matter of operator posture, not protocol mechanism.
The 2025-08-04 Cloudflare post operationalizes the concept by
running a controlled test: brand-new unindexed domains
publishing a blanket Disallow and then observing which AI
crawlers respected it. Two outcomes on record:
- ChatGPT (
ChatGPT-User) — fetchedrobots.txt, honored theDisallow, stopped crawling, no follow-up from alternate UAs. Full compliance. See systems/chatgpt-user. - Perplexity — the declared crawlers
(systems/perplexitybot / systems/perplexity-user)
appeared to comply, but when blocked, Perplexity fell back
to a stealth crawler that either
didn't fetch
robots.txtor ignored it. Non-compliance at the operator level, even if individual crawlers complied individually.
The five Cloudflare operator norms¶
The post names five cooperative-crawler norms; robots.txt
compliance is norm #5 ("Follow the rules"):
- Be transparent.
- Be well-behaved netizens.
- Serve a clear purpose.
- Separate bots for separate activities.
- Follow the rules — check and respect
robots.txt, stay within rate limits, never bypass security protections.
See the source page for the full framework.
Enforcement, not compliance¶
Because the protocol is advisory, non-compliant crawlers require network-layer enforcement:
- WAF rules blocking declared UAs (systems/cloudflare-waf).
- Bot-management scoring (systems/cloudflare-bot-management).
- ML fingerprinting of stealth crawlers (concepts/ml-bot-fingerprinting).
- Cryptographic bot identity (systems/web-bot-auth) — raises the cost of being undeclared.
- Monetization via pay-per-crawl — makes compliance economically attractive.
Seen in¶
- sources/2025-08-04-cloudflare-perplexity-stealth-undeclared-crawlers — canonical wiki instance; binary split between compliant (ChatGPT) and non-compliant (Perplexity stealth) crawlers from a single controlled test.
Related¶
- concepts/robots-txt — the file / protocol itself.
- concepts/declared-crawler / concepts/stealth-crawler.
- concepts/verified-bots.
- systems/chatgpt-user / systems/perplexitybot / systems/perplexity-user.
- patterns/signed-bot-request.