Skip to content

PATTERN Cited by 2 sources

Binary format for broker throughput

Pattern

When raw streaming-broker throughput is the constraint, prefer binary encodings (AVRO, Protobuf) over text encodings (JSON) for on-wire payloads. The wire-size reduction translates directly to broker-throughput uplift because the broker's per-record cost is partly fixed and partly proportional to bytes-on-wire; smaller records move the variable component down.

Canonical production quantification: ~20% throughput uplift from AVRO vs JSON at 1 KB payload, randomised content (designed to neutralise compression effects). Disclosed in Redpanda's 14.5 GB/s Redpanda → Snowflake benchmark.

"Using a binary format like AVRO showed ~20% throughput improvement over a textual format like JSON." (Source: sources/2025-10-02-redpanda-real-time-analytics-redpanda-snowflake-streaming)

Why it works

Two layered effects compound:

  1. Smaller uncompressed wire size. JSON has structural overhead — field names repeated per record, whitespace, delimiter characters — that binary encodings lack. AVRO records carry no inline field names; schema is shipped separately via a schema registry.
  2. Higher effective compression ratio. Even when compression is enabled, binary encodings typically compress to smaller final sizes than the equivalent text payload. On randomised content (as in the benchmark — designed to defeat compression effects), the uncompressed-size delta dominates.

The benchmark's randomised-content methodology isolates the encoding-format effect from compression-effectiveness variation that would otherwise confound the comparison on real-world payloads.

Composes with client-side compression

The 20% uplift is a standalone effect from the encoding format. Composing with client-side compression — where the producer compresses the batch before wire transit, so the broker sees already-compressed bytes — is compositional and delivers further throughput gains.

In practice, production Kafka/Redpanda deployments typically pick both: binary encoding (AVRO or Protobuf) and client-side compression (lz4, zstd).

Trade-offs

  • Schema registry dependency. AVRO records don't carry field names inline, so the consumer needs the schema to decode. This requires a schema registry or equivalent out-of-band schema-distribution mechanism.
  • Debuggability. A JSON record is human-readable on console; AVRO is not. Debug-time instrumentation (decoder tooling, schema lookups) is required.
  • Schema-evolution discipline. Binary formats have stricter rules for schema changes (AVRO: name-preserving field adds with defaults; Protobuf: tag-number preservation). Text formats are more forgiving but break silently.
  • Codec compatibility with destination. Not all sinks natively consume the broker's binary format — e.g., the Snowpipe-Streaming destination may decode AVRO in the ingest connector rather than pass-through. The throughput gain is on the producer → broker → consumer hops; destination-side decoding is additional work.

The AVRO-over-JSON finding was one of four tuning insights from the 14.5 GB/s Redpanda → Snowflake run:

Seen in

Last updated · 470 distilled / 1,213 read