CONCEPT Cited by 1 source

Byte-size pagination¶

Byte-size pagination is the API-design choice of bounding a paged response by total byte size rather than by number of items. The motivation is predictable latency SLOs: when items are variable in size, capping by count makes page latency a function of what the caller happens to land — capping by bytes makes it a property of the page itself.

The Netflix anchor¶

Netflix's KV DAL uses bytes-per-page as the pagination limit and cites the SLO reason directly:

"We chose payload size in bytes as the limit per response page rather than the number of items because it allows us to provide predictable operation SLOs. For instance, we can provide a single-digit millisecond SLO on a 2 MiB page read. Conversely, using the number of items per page as the limit would result in unpredictable latencies due to significant variations in item size." (Source: sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer)

A 10-item page of 1 KiB items and a 10-item page of 1 MiB items are the same API-surface request but ~1000× different in deserialization / network / downstream work — the caller has no way to bound their page-handling time without knowing row sizes.

The implementation friction¶

Most backing stores (Cassandra, DynamoDB, etc.) paginate by row count, not bytes. KV DAL translates a caller's byte budget into backing-store queries using a static row-count limit, then either:

Over-fetches (items smaller than expected) → multiple backing-store queries to reach the byte budget → read amplification.
Under-fetches discarded (items larger than expected) → throws away rows it pulled → same read amplification.

This is the precise problem concepts/adaptive-pagination addresses: use observed item size to tune the underlying-store limit adaptively.

Trade-offs¶

Predictable response-size SLO. The page is bounded in bytes, so downstream code can plan buffers, network budgets, and GC.
Implementation complexity. Adds a server-side layer of size-estimation + loop / backfill logic. (Adaptive pagination is the standard partner.)
Partial-page pagination becomes normal. Callers can't pre-compute "total pages" — and shouldn't try to.
Server has to serialize enough to know the size. There's no way to byte-count cheaply without doing most of the work. But if the server stops early (see patterns/slo-aware-early-response), this is fine — work done so far gets returned.
Doesn't solve agent-context token budgets directly. Agent tools whose callers are LLMs need to bound by tokens, not bytes — see patterns/token-budget-pagination. Byte-size pagination and token-budget pagination are siblings: both reject row-count limits, but differ on the scarce-resource axis (bytes vs tokens, which are not linearly related).

Cross-pattern placement¶

Scarce resource	Limit unit	Canonical instance
End-to-end request latency at variable row size	bytes	Netflix KV DAL (this concept)
Agent token-context budget	tokens	Datadog MCP server — patterns/token-budget-pagination
Linked-list / log traversal	records	Classic `?limit=N` pagination (what Netflix + Datadog both reject)

Seen in¶

sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — canonical definition; page_size_bytes in the Selection message; single-digit-ms SLO on 2 MiB page anchor.

concepts/adaptive-pagination — needed in practice because backing stores paginate by row count, not bytes.
systems/netflix-kv-dal — canonical instance.
patterns/slo-aware-early-response — complementary primitive: even a byte-bounded page can miss a latency SLO, and early return
page token handles that.
patterns/token-budget-pagination — sibling concept for agent- tool API design.
systems/apache-cassandra · systems/dynamodb — backing stores that paginate by row count, forcing the translation layer.