PATTERN Cited by 1 source

Count over byte-size batch trigger¶

Pattern¶

When a batching layer supports multiple batch-close triggers (accumulate until count messages OR byte_size bytes OR period ms elapsed — whichever first), prefer the count- based trigger over the byte-size-based trigger on hot produce/sink paths where per-message size information isn't already computed. The count trigger is an integer-increment- and-compare; the byte-size trigger requires serialised-size knowledge per message before the batch-close decision.

Canonical production instance: Redpanda's 14.5 GB/s Snowflake benchmark.

"Using counts as the batching factor resulted in higher performance over byte_size due to less calculation overhead." (Source: sources/2025-10-02-redpanda-real-time-analytics-redpanda-snowflake-streaming)

Why it works¶

The cost model is trigger-evaluation CPU:

Count-based trigger. On each new message: `n += 1; if n

= N: close_batch()`. One integer compare per message. No knowledge of message payload or serialised size required.
Byte-size-based trigger. On each new message: compute serialised size → bytes += size; if bytes >= B: close_batch(). The size computation is the expensive part — it requires either pre-serialisation (expensive if the path doesn't already need the bytes yet) or inspection of a pre-serialised buffer (acceptable only when the batching layer sits after serialisation).

For connectors where the batching layer sits before serialisation (common in ingest connectors that collect logical rows and serialise only at commit time), byte-size triggers force premature size computation on every message. On a hot path at 15 GB/s wire throughput and ~1 KB payloads, that's ~15 million size-computations per second — a non-trivial CPU budget item.

Composes with `period` timeout¶

Batch triggers are typically composed as "whichever threshold is reached first". The canonical three-trigger shape:

count — cheap, drives most closes under high volume.
byte_size — cap on memory use per batch; drives closes when messages are large but arrive slowly.
period — cap on latency; drives closes when volume is low.

The pattern is to use count as the primary trigger and period as the latency floor, avoiding byte_size entirely if the batch-size can be bounded adequately by count × average-message-size.

When byte-size triggers are the right call¶

Byte-size triggers are not universally worse. Prefer them when:

Memory pressure is the constraint. Batches must stay under a hard memory ceiling regardless of message count.
Message sizes vary by orders of magnitude. Count-based batching with variable-size messages produces wildly variable batch memory footprints; byte-size is predictable.
Destination API charges per byte. Some ingest APIs have per-batch byte ceilings; matching the trigger to the API constraint avoids rejected batches.
The batching layer already has size information. If messages are already serialised buffers (e.g., Kafka records post-encode), size is a cheap property lookup and the CPU-overhead argument disappears.

Instance in the Redpanda → Snowflake benchmark¶

In the 14.5 GB/s / P99 7.49 s benchmark, the snowflake_streaming connector's batching policy fed Snowpipe Streaming channels at ~15 million messages per second aggregate. Switching from byte_size to count-based trigger was one of the four disclosed tuning findings that achieved the 45%-over-documented-ceiling result. The other three:

AVRO over JSON (~20% throughput uplift via smaller wire size) — patterns/binary-format-for-broker-throughput.
build_paralellism tuned to (cores − small reserve) — concepts/build-parallelism-for-ingest-serialization.
Channel-count scaling via channel_prefix × max_in_flight — concepts/snowpipe-streaming-channel.

Seen in¶

sources/2025-10-02-redpanda-real-time-analytics-redpanda-snowflake-streaming — canonical wiki statement of the pattern from Redpanda's 14.5 GB/s Snowflake benchmark. Count-based batching named as one of the four tuning findings behind the 45%-over-documented-ceiling result.

concepts/batching-latency-tradeoff — the throughput-vs-latency trade-off that batch triggers implement.
concepts/effective-batch-size — what actually ends up in each batch is a function of trigger + arrival pattern.
concepts/fixed-vs-variable-request-cost — the substrate economics that make batching valuable; this pattern is about reducing the CPU cost of the batching decision itself, not the amortisation argument.
patterns/batch-over-network-to-broker — the canonical producer-side batching pattern; this is a refinement of the trigger-selection within it.
systems/redpanda-connect — the connector whose snowflake_streaming output exposed the trigger choice.