Skip to content

CONCEPT Cited by 1 source

Read-side detection of storage pathology

Definition

Read-side detection of storage pathology is the design choice to detect storage-engine pathologies (wide partitions, hot keys, oversized blobs, fragmenting tables) on the read path rather than the write path, then emit an asynchronous event for downstream remediation. The choice is principally about economising detection cost: when most stored data does not exhibit the pathology, paying detection cost on every write is waste. Detecting on reads pays the cost only on the data that callers actually touch.

Canonicalised on the wiki by Netflix's TimeSeries Abstraction in the 2026-06-03 dynamic partition splitting disclosure (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads).

Mechanism

The Netflix TimeSeries instantiation makes the trade explicit:

"Our decision to detect wide partitions on reads, as opposed to writes, is based on our observation that the majority of the data in the wild doesn't need this treatment. The slight downside is that some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up."

Each read tracks per-partition byte count. If bytes-read exceeds a configured threshold, the server emits a detection event to Kafka with the partition identifiers, and an immutable flag computed by the server (TimeSeries servers can determine when a partition is no longer receiving writes). A separate worker consumes the event stream and triggers remediation (in this case, dynamic partition splitting).

Critical: detection is decoupled from remediation. The read path emits one Kafka event per detected wide partition; remediation runs asynchronously and is allowed to take seconds, minutes, or hours. The read path remains fast.

Why not detect on writes?

Three reasons make read-side detection structurally preferred for pathologies that are mostly cold:

  1. Skewed access distribution. Most stored data is cold. Most pathologies (wide partition, oversized row, fragmented index) only matter when something reads the affected data — cold pathological data is a storage-cost issue, not a latency issue. Detection on writes pays cost on every write; detection on reads pays cost only on the ~5% of data that's actually hot.
  2. Detection signal is read-shaped. Wide partition is operationally defined by the bytes a read has to scan. The natural signal is bytes_read_for_this_partition, which is only available on reads.
  3. Write amplification of detection itself. Adding write-time detection to a high-throughput write path costs more than read-time detection because writes are typically larger in volume on log-structured stores like Cassandra.

Trade-off: detection lag

The cost is real and explicit:

"The slight downside is that some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up."

The first few reads after a partition crosses the threshold still pay the wide-partition tax. Only after the async pipeline catches up do subsequent reads benefit from the remediation. The trade is acceptable when:

  • Most reads on the pathological data have a single client (so the lag affects that client briefly), OR
  • The pathology is amortised quickly (seconds to minutes), OR
  • The pathology is rare, so absolute number of slow reads is small even on first encounter.

Sibling design choices

Choice When it wins When it loses
Read-side detection (this concept) Most data is cold; pathology is read-defined Pathology has consequences before any read happens (e.g. storage-cost-only failures)
Write-side detection Pathology emerges on writes (e.g. row-count-based hot-row detection) Most writes don't produce the pathology — wasted detection cost
Periodic offline scan Pathology is statically defined and table-wide Lag time scales with full-table scan cost
Histogram-based control loop (patterns/auto-tuning-control-loop-on-storage-histograms) Pathology is table-wide and statistically detectable Per-key outliers don't show up in aggregate stats

In practice these compose: Netflix uses both the histogram-based control loop (table-wide) and read-side detection (per-key outliers).

What makes the pattern reusable

The structural shape is reusable beyond wide partitions:

  • Read-side detect → emit Kafka event → async remediation worker → routed reads via metadata table + Bloom-filter gate.

This pattern can apply to:

  • Detecting fragmenting LSM tables during reads, triggering targeted compaction.
  • Detecting hot-key contention during reads, triggering replica-fan-out or cache promotion.
  • Detecting stale cache entries during reads, triggering background refresh (patterns/async-refresh-cache-loader).
  • Detecting oversized blobs during reads, triggering archive-tier migration.

The pattern's natural pair is idempotent detection — emitting the same Kafka event multiple times for the same partition must be safe, because reads on a wide partition will keep emitting until the remediation lands.

Seen in

Last updated · 542 distilled / 1,571 read