PATTERN Cited by 1 source
Comparative documentation benchmark¶
Pattern¶
Benchmark your documentation site's agent-consumption cost against competitor / industry-average documentation sites using a fixed agent model + client and a fixed set of technical questions. Report relative improvement (tokens + wall-clock) — ideally with published methodology that lets third parties reproduce the ranking.
Canonical instance¶
Cloudflare's 2026-04-17 dogfood of developers.cloudflare.com:
- Agent model: Kimi-k2.5 (kimi-k2-5).
- Client harness: OpenCode.
- Input: the
llms.txtof Cloudflare docs vs. thellms.txtof other large technical documentation sites. - Task: answer "highly specific technical questions."
- Measured: tokens consumed + time-to-correct-answer per site.
- Reported: 31 % fewer tokens, 66 % faster to correct answer on average vs the non-refined baseline.
Why the pattern matters¶
Before the agent era, documentation quality was measured by human readability, search relevance in the site's own index, and the occasional usability study. For AI agents the metric is different: tokens consumed and time-to-answer against a fixed question set. Optimising for the old metrics leaves the new metrics unimproved. Publishing a comparative benchmark makes the new metric visible + actionable.
The pattern is sibling to Cloudflare's 2026-04-17 comparative RUM benchmarking at the network-performance layer — same posture (measure yourself in a common framework alongside peers, publish ranking + methodology), applied at the documentation layer instead.
Recommended recipe¶
- Fix the agent model + harness. A single agent/model/client combination for reproducibility. Changing any dimension invalidates the comparison.
- Fix the question set. N technical questions known to have correct answers discoverable in the docs. Publish them.
- Point at
llms.txt. Thellms.txtURL is the agent's entry point; baseline sites that don't have one can be rated separately or via a reasonable fallback (sitemap + homepage). - Measure tokens + time to correct answer. A fast wrong answer is worse than a slow right answer.
- Report relative improvement not absolute numbers — comparing agents' absolute costs across hardware and date is noisy; relative rankings are robust.
- Publish methodology + questions so third parties can verify the result.
Cloudflare's 2026-04-17 post reports the relative numbers; absolute latencies, per-site breakdowns, and the specific question set are not published in the post itself.
Relationship to biases¶
Same bias classes as any agent benchmark (concepts/benchmark-methodology-bias):
- Correlated noise — same agent against same site on the same day sees correlated failures (e.g. network blips). Cross-site comparisons should sample different days.
- Hardware-gen lottery — which accelerator is serving Kimi-k2.5 on which date matters to absolute timing, not to relative ranking if held constant.
- Client-side latency — agent-to-docs-site fetch latency skews slower-to-answer numbers; measure from a fixed vantage point.
- Selection bias in question set — questions that are easy to answer from Cloudflare's structured docs may not reflect general queries.
None of these invalidate the pattern; they argue for honest caveats, rep measurement, and third-party verification.
Seen in¶
- sources/2026-04-17-cloudflare-introducing-the-agent-readiness-score-is-your-site-agent-ready — canonical wiki instance; Kimi-k2.5 via OpenCode, 31 % / 66 % headline numbers.
Related¶
- patterns/comparative-rum-benchmarking — sibling pattern one layer down (network-performance); same Cloudflare 2026-04-17 Agents-Week week launched both.
- concepts/benchmark-methodology-bias — the bias classes authors should disclose.
- concepts/llms-txt — the usual input surface.
- concepts/agent-readiness-score — complementary grading vehicle; this pattern measures outcome, the Readiness Score grades input-substrate.
- systems/cloudflare-developer-documentation — the subject of the canonical measurement.