PATTERN Cited by 1 source

tc-latency injection for geo-distributed simulation¶

Pattern¶

To benchmark a multi-region cluster without the expense of actually deploying across regions, deploy all brokers in a single AZ and use Linux tc (traffic control, typically tc qdisc ... netem delay) to selectively inject network latency between broker pairs — simulating cross-region RTT on the replication-side links while leaving other links untouched. Run the usual benchmark (e.g., OpenMessaging Benchmark) against the cluster and measure produce/consume latency as if the cluster were geographically stretched.

The technique's load-bearing property is the ability to isolate inter-broker latency from client-to-broker latency, so you can study the cross-region-quorum-RTT dimension without paying real cross-region cloud bandwidth or operating a real multi-region deployment during testing.

Canonical framing — Redpanda¶

"To simulate a multi-region Redpanda cluster, we set up a 3-node Redpanda cluster with i3en.xlarge VMs. These VMs have four cores per node with 32 MB of memory each, and simulate a Tier-2 Redpanda Cloud cluster. We used tc to only add network latency between Redpanda brokers. No network latency was added between the OMB worker nodes and Redpanda broker nodes to simulate leader pinning."

(Source: sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters)

Why isolate broker-to-broker from client-to-broker¶

On a real multi-region stretch cluster, client-to-broker and broker-to-broker latencies are different variables:

With leader pinning applied, the client-to-leader hop is intra-region — effectively zero relative to cross-region latency.
The broker-to-broker replication links are cross-region — the expensive dimension.

If a simulation injects latency uniformly on every network path, it measures the wrong configuration: producers and consumers pay cross-region RTT to the leader they're talking to, as well as replication-side cross-region RTT. That's a no-leader-pinning + no-follower-fetching worst-case baseline, not the optimised shape operators actually run.

By restricting tc netem delay to inter-broker interfaces, the simulation matches the leader-pinned deployment profile — the configuration operators actually use in production.

Mechanism sketch¶

On Linux, tc qdisc add dev <iface> root netem delay <time> inserts a latency queue on the egress of <iface>. To restrict the latency to a specific peer (other broker), combine with tc filter matching on the peer IP — or apply the qdisc on a dedicated interface/VLAN used only for inter-broker traffic. The Redpanda post does not publish the exact tc invocation; the substance of the pattern is the topology decision (inter-broker only), not the exact command.

Sample approximation of the topology:

                       [tc delay +δ]     [tc delay +δ]
  Broker A <─────────> Broker B <─────────> Broker C
      ↑                    ↑                    ↑
      │ no delay           │ no delay           │ no delay
      ▼                    ▼                    ▼
  OMB worker 1         OMB worker 2         OMB worker 3

Where δ is the simulated cross-region RTT (e.g., 30 ms for cross- AZ, 60 ms for us-east ↔ us-west, 150 ms transoceanic).

Calibration¶

The technique's main degree of freedom is δ — the injected inter-broker delay. The Redpanda post does not publish the δ values used for the different "stretch configurations" charted in the published publish-latency graph. Practitioners calibrate δ from canonical denominators:

Topology	Typical RTT	Reference
Same-AZ	< 1 ms	baseline
Cross-AZ in region	1-10 ms	AWS single-digit-ms guidance
`us-east-1 ↔ us-west-1`	60+ ms	cloudping.co
Transoceanic (US ↔ EU / Asia)	100-200 ms+	cloudping.co

Running the benchmark at multiple δ values produces the latency-vs- stretch-configuration curve the Redpanda post references.

What the technique doesn't capture¶

Cross-region bandwidth constraints: tc netem delay injects latency but not a bandwidth cap. Real cross-region links have both. Add tc tbf/tc htb for a bandwidth cap to simulate both dimensions.
Packet loss / jitter: tc netem supports loss N% and jitter X but the Redpanda setup only used delay. Real WAN links have tail-latency jitter and occasional loss; benchmark latency tails on pure-delay will understate real-world variability.
Asymmetric topology: the baseline 3-broker star-to-OMB topology doesn't capture asymmetric replica placement (e.g., RF=5 across 3 regions where one region has 2 replicas). Realistic multi-region deployments often have this asymmetry.
Control-plane latency: metadata operations (rpk admin commands, leader-election RPCs) pay the same injected latency as data-plane replication. Real control planes are often separately latency-sensitive.
Cost dimension: tc-simulated clusters are all intra-AZ in cloud billing — the real cross-region bandwidth cost of a production deployment is invisible to the benchmark.

When to use¶

Pre-production performance evaluation for a new region topology before committing cloud spend.
Comparing producer config (acks, linger.ms, batch.size) latency behaviour at different simulated stretch configurations.
Regression-testing broker changes against a standard set of cross-region latency points.

When not to use¶

Go-live validation: real cross-region cloud networks behave differently (asymmetric bandwidth, BGP path changes, CSP network-maintenance events). Always do some real-region canary before full production rollout.
Cost estimation: tc tells you nothing about cross-region bandwidth cost.

Composes with¶

systems/openmessaging-benchmark: the canonical benchmark runner; the Redpanda post uses OMB on this topology.
concepts/leader-pinning: the deployment configuration this technique simulates (broker-only latency, no client-to- broker latency).
patterns/multi-region-raft-quorum: the quorum shape whose latency behaviour this technique characterises.

Seen in¶

sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters — canonical wiki introduction of the technique for multi-region Redpanda performance testing; explicit framing as a leader-pinning-equivalent topology.