Skip to content

Adaptive write request scheduling in Redpanda's Cloud Topics

Summary

Redpanda's Cloud Topics write data directly to S3 as "Level Zero" (L0) objects before acknowledging produce requests. The write-request scheduler dynamically adjusts upload parallelism across CPU shards using a buddy-allocator-inspired algorithm to balance batching efficiency (fewer S3 PUTs, lower cost) against latency (faster acknowledgement). The system auto-tunes without operator intervention: under light load it converges to a single upload stream (maximizing batch size), under heavy load groups split to increase parallelism — all using only local state checks, no global coordinator.

Key takeaways

  1. Per-shard batching is wasteful: A 32-core machine issuing 32 concurrent PUTs per cycle produces tiny payloads, wastes S3 requests, and increases latency when individual shards can't fill their size threshold before the 200 ms time threshold fires.

  2. Cost scales linearly with shard count under per-shard batching: PUT cost at 200 ms interval ≈ $65/month/broker. With 32 shards: $2,000/broker/month; a 5-broker cluster: $120,000/year in PUT requests alone.

  3. Single-shard funneling minimizes cost but caps throughput: One upload stream for the same 5-broker cluster costs ~$3,750/year in PUTs, but creates a bottleneck under heavy load.

  4. Centralized coordinator doesn't scale: One shard deciding for all requires cross-shard locks on every cycle, becoming a bottleneck regardless of load.

  5. Coordinator-with-groups is better but complex: A global coordinator assigns shards to groups; groups upload independently. But splits/merges require global coordination and complex hand-off protocols.

  6. Buddy allocator eliminates the coordinator entirely: Borrowed from OS memory management, the algorithm uses power-of-two-sized groups where split/merge decisions are local — a group leader only checks its own backlog and its buddy's.

  7. Shared state is minimal: Only per-group mutex (held briefly during request-pulling, never during S3 upload) and cache-line-padded atomic counters for backlog reads.

  8. Hysteresis prevents oscillation: A short window between consecutive split/merge decisions prevents thrashing near threshold boundaries.

  9. Pipeline decouples "what" from "which shard": The scheduler is a small local-only algorithm (atomic counter reads + buddy state machine) that can evolve independently from the batcher or other pipeline stages.

Operational numbers

Metric Per-shard batching (32 cores) Single-shard Adaptive scheduler
S3 PUT cost (5 brokers/year) ~$120,000 ~$3,750 Approaches single-shard at low load, scales up under heavy load
Upload parallelism 32× (fixed) 1× (fixed) 1× to N× (dynamic)
Time-threshold latency adder Up to 200 ms Up to 200 ms (but less likely to hit) Minimized by size-threshold hits at higher batch fill
Cross-shard coordination None (each shard independent) Every shard funnels to one Buddy-local only (atomic reads)

Architecture

The scheduler maps shards to groups via an array:

Initial (1 group):       [0, 0, 0, 0, 0, 0, 0, 0]   parallelism: 1x
After first split:       [0, 0, 0, 0, 4, 4, 4, 4]   parallelism: 2x
After further splits:    [0, 0, 2, 2, 4, 4, 4, 4]   parallelism: 3x
Maximum:                 [0, 1, 2, 3, 4, 5, 6, 7]   parallelism: 8x

  • Split: Group leader detects backlog exceeds threshold proportional to group size → splits in half. Each half is an independent upload stream.
  • Merge: Both a group and its buddy have empty backlogs → lower-numbered group absorbs the buddy. Only the lower group can initiate merge (prevents race).
  • Group ID: Shard ID of the first member (no allocation needed).
  • Round-robin within group: Shards within a group take turns uploading.

Caveats

  • The post does not disclose the exact backlog threshold formula or the hysteresis window duration.
  • No benchmarks or latency numbers are given for the scheduler itself (the post is architectural, not a performance write-up).
  • The approach is specific to Seastar's shared-nothing, thread-per-core model — it may not translate directly to systems with shared memory or shared schedulers.

Source

Last updated · 542 distilled / 1,571 read