Skip to content

CONCEPT Cited by 1 source

robots.txt compliance

Definition

robots.txt compliance is the practice — by a crawler operator — of fetching /robots.txt from every origin before crawling, and honoring the Disallow, Allow, and crawl-delay directives declared in it.

RFC 9309 (Robots Exclusion Protocol, 2022) formalizes the file format and semantics. The protocol is advisory — nothing in the HTTP stack enforces it — so compliance is a matter of operator posture, not protocol mechanism.

The 2025-08-04 Cloudflare post operationalizes the concept by running a controlled test: brand-new unindexed domains publishing a blanket Disallow and then observing which AI crawlers respected it. Two outcomes on record:

  • ChatGPT (ChatGPT-User) — fetched robots.txt, honored the Disallow, stopped crawling, no follow-up from alternate UAs. Full compliance. See systems/chatgpt-user.
  • Perplexity — the declared crawlers (systems/perplexitybot / systems/perplexity-user) appeared to comply, but when blocked, Perplexity fell back to a stealth crawler that either didn't fetch robots.txt or ignored it. Non-compliance at the operator level, even if individual crawlers complied individually.

The five Cloudflare operator norms

The post names five cooperative-crawler norms; robots.txt compliance is norm #5 ("Follow the rules"):

  1. Be transparent.
  2. Be well-behaved netizens.
  3. Serve a clear purpose.
  4. Separate bots for separate activities.
  5. Follow the rules — check and respect robots.txt, stay within rate limits, never bypass security protections.

See the source page for the full framework.

Enforcement, not compliance

Because the protocol is advisory, non-compliant crawlers require network-layer enforcement:

Seen in

Last updated · 200 distilled / 1,178 read