PATTERN Cited by 1 source

Atomic conditional batch claim¶

Pattern¶

When a scheduler needs to pop a variable-size batch bounded by a budget from a shared queue — pop as many items as fit, but no more — three primitives must travel together in one atomic operation:

Peek across pending items to see each one's contribution to the budget (token count, byte size, priority weight, …).
Compute cumulative budget across the peeked prefix.
Claim the subset whose cumulative budget fits, removing them from the queue visibly to all other consumers at once.

None of these can happen in separate steps without a race: between peek and claim, another worker could see the same items and double-dispatch; between partial-claim and claim-the-rest, a worker could pick up the remainder and overrun the budget.

Canonical implementation: a Redis Lua script popping items from a list until a running budget is reached, returning the batch as the script's result, with per-item TTLs set inside the same script call. Canonical application: token-count-based batching for GPU embedding inference (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Motivation¶

General-purpose brokers like RabbitMQ and Kafka don't natively provide this primitive:

RabbitMQ uses a push model with a request-count-based prefetch window. Consumers don't peek; messages are pushed to whichever consumer has budget. There's no efficient way to ask "give me as many as fit under budget X" with X computed from message payloads.
Kafka batches by bytes / messages within a partition — broker-side, not consumer-driven, and batched against a byte budget, not a caller-supplied token budget. Token counts vary with text and tokeniser; Kafka can't know them.

Both work around the limit with an in-process aggregator in front of the broker that consumes into its own batcher. Both paths cost an extra tier.

A store that natively supports atomic peek-compute-claim collapses that tier. Voyage AI's choice — Redis with a Lua script — is the canonical instance:

"In our implementation, we use Redis because it lets us atomically 'pop up to the optimal batch size' and set per-item TTLs within a single lua script call."

Mechanism¶

The Redis-Lua version, in outline:

-- KEYS[1] = request queue (list)
-- ARGV[1] = optimal_batch_size (e.g. 600 tokens)
-- ARGV[2] = per-item TTL seconds
local budget = tonumber(ARGV[1])
local used   = 0
local batch  = {}

while used < budget do
    local item = redis.call('LINDEX', KEYS[1], 0)
    if not item then break end
    local cost = tonumber(cjson.decode(item).token_count)
    if used + cost > budget and #batch > 0 then break end
    table.insert(batch, redis.call('LPOP', KEYS[1]))
    used = used + cost
end

-- optional: set per-item TTL / ack keys / dead-letter metadata
-- all in one atomic script execution
return batch

Redis's single-threaded script execution makes the whole block atomic — no other worker can interleave. Either the script returns a batch or returns empty; there's no partial-visible state.

Properties this guarantees¶

No double-dispatch. Two model-server workers calling the script concurrently see disjoint batches. The second call blocks on Redis's command queue, then sees the list with the first call's items already gone.
Budget never overrun. The script's pre-check on each item ensures Σ cost ≤ budget or the batch contains at least one item (avoid deadlocking on an item larger than the budget).
No partial-drain races. If the script is interrupted (client disconnect) Redis's script-execution semantics either complete the pops or leave nothing visibly popped.
Co-located metadata. Per-item TTLs, ack keys, dead-letter references are set inside the same atomic script. One round-trip per batch regardless of batch size.

Trade-offs¶

Durability. Redis is not a durable queue by default. Voyage AI accepts this and surfaces rare data loss as HTTP 503 to callers for retry: "The probability of Redis losing data is very low. In the rare case that it does happen, users may receive 503 Service Unavailable errors and can simply retry." Production callers must be idempotent.
Single-shard throughput ceiling. Redis's single-threaded execution is a throughput bottleneck at extreme scale. Multi- shard designs need either client-side routing to saturation-point-per-shard budgets or a queue-broker upgrade.
Observability. Batch composition is decided in the script — standard queue monitoring doesn't see the batching logic. Teams instrument the script (or the caller) to emit batch-size / batch-cost / partial-fill reasons.
Correctness under variable cost estimates. If token_count is an estimate (cheap char-count / ratio heuristic), the true GPU work can drift. Either tokenise on enqueue (pay it twice — the model will tokenise inside the engine too) or accept small over/under-fills.
Not a fit for message-broker semantics (at-least-once delivery with ack, DLQ, fan-out). Redis + Lua is a compute-serving queue, not a message bus. Hybrid: broker in front of a Redis batcher keeps both semantics.

Generalises beyond token-count batching¶

The pattern applies whenever:

the batch budget is a summed attribute of queued items (byte size, priority weight, computed cost, memory footprint),
items must be claimed atomically to avoid double-dispatch,
per-item bookkeeping (TTL, lease, DLQ pointer) must be set in the same atomic step.

Other candidate applications: rate-limit token-bucket refill (claim up to refill capacity atomically), work-stealing in a scheduler with variable-size tasks, database connection-pool claim up to a memory budget.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance; Redis + Lua script atomically pops requests until the GPU's optimal token budget is reached and sets per-item TTLs in the same call; enables the Voyage AI query-embedding serving pipeline's 50 % / 3×-GPU / 8×-throughput result (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).