Skip to content

PATTERN Cited by 1 source

Intra-node parallelism via input/output scaling

Pattern

For data-pipeline runtimes that expose a broker-style input/output multiplexing primitive — Redpanda Connect, Benthos, NiFi, and similar — the decisive throughput- scaling dimension is often not adding more nodes but running more parallel input/output pipelines within a single node to fully saturate the per-node CPU, network, and storage.

"We found that what mattered most was batch settings and increasing parallelism as much as possible. Whether that be through partition count, snowflake_streaming options, or — as it turned out — unlocking massive throughput by scaling the number of inputs and outputs within a single node." (Source: sources/2025-10-02-redpanda-real-time-analytics-redpanda-snowflake-streaming)

Why it works

A single input/output pipeline in Connect-style runtimes typically saturates a few cores and a fraction of the NIC. The machine has more of both. Rather than horizontal scaling the node count (adding more machines, with their orchestration and network-fanout costs), the broker primitive lets the operator spin up N parallel pipelines inside the Connect process. Each reads from its own share of the source partitions and writes to its own share of destination channels.

Canonical form in Redpanda Connect:

input:
  broker:
    inputs:
      - kafka_franz: { seed_brokers: [...], topics: [...] }
      - kafka_franz: { seed_brokers: [...], topics: [...] }
      # ... N of these

output:
  broker:
    outputs:
      - snowflake_streaming: { channel_prefix: prefix-1, ... }
      - snowflake_streaming: { channel_prefix: prefix-2, ... }
      # ... N of these

Each (input, output) pair runs its own goroutine pool, batching policy, and channel pool; they share the same process's TCP stack, memory, and kernel scheduler but not application-level synchronisation.

The scaling-dimension hierarchy

The Redpanda benchmark names three composable scaling dimensions in order of effectiveness:

  1. Partition count — upstream. The source-topic partition count is the coarsest parallelism unit; each partition can be consumed by exactly one consumer in a consumer group.
  2. Destination-side options — e.g., Snowpipe channel count via channel_prefix × max_in_flight for the Snowflake sink.
  3. Intra-node input/output fan-out — the broker primitive spawning N pipelines per node.

Dimension 3 is the one the benchmark singles out as the "turned out" finding — the surprise scaling lever. Dimensions 1 and 2 were already known; dimension 3 was the quantity-of-pipelines-per-Connect-node that moved the winning configuration from "saturated at a ceiling" to "14.5 GB/s, 45% above documented per-table ceiling."

When intra-node scaling beats adding nodes

  • Per-node resource headroom. A Connect node running a single pipeline on a 48-core machine uses a small fraction of cores. More nodes just distributes the same under-utilisation across more machines.
  • Cost economics. Fewer larger nodes with intra-node parallelism is typically cheaper than more smaller nodes at the same aggregate throughput — fewer instances to pay fixed per-instance costs for (monitoring sidecars, logging, metadata overhead).
  • Network locality. Intra-process pipelines share a single TCP socket pool to the upstream broker and downstream sink. Node-count scaling spawns N × pools, amplifying broker-side connection overhead.

When to prefer node-count scaling instead

  • Fault-isolation. Process-level crash takes down all intra-node pipelines at once; node-level diversity reduces blast radius.
  • Upgrade rollouts. Rolling node upgrades at node granularity is safer than restarting a node that carries many pipelines.
  • Resource quotas. Some orchestrators enforce per-container CPU/memory caps that limit how much intra-node parallelism one node can usefully run.
  • Heterogeneous workloads. Mixing hot and cold pipelines on the same node risks noisy-neighbour contention.

The production sweet spot is usually both dimensions tuned: run enough nodes for fault-isolation and upgrade-safety, then maximise intra-node parallelism on each node to hit the per-node resource ceiling.

Instance in the Redpanda → Snowflake benchmark

  • 12 Connect nodes on AWS EC2 m7gd.12xlarge
  • Intra-node broker input/output fan-out — factor not disclosed in the post, but named as the decisive scaling dimension
  • Sustained 14.5 GB/s aggregate to a single Snowflake table
  • Hit Snowpipe-API-level errors ("the Snowpipe API screaming at us") on tests that pushed channel count too aggressively, indicating intra-node parallelism was scaled until it saturated the destination's per-table channel ceiling

Seen in

Last updated · 470 distilled / 1,213 read