CONCEPT Cited by 1 source
Read-side detection of storage pathology¶
Definition¶
Read-side detection of storage pathology is the design choice to detect storage-engine pathologies (wide partitions, hot keys, oversized blobs, fragmenting tables) on the read path rather than the write path, then emit an asynchronous event for downstream remediation. The choice is principally about economising detection cost: when most stored data does not exhibit the pathology, paying detection cost on every write is waste. Detecting on reads pays the cost only on the data that callers actually touch.
Canonicalised on the wiki by Netflix's TimeSeries Abstraction in the 2026-06-03 dynamic partition splitting disclosure (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads).
Mechanism¶
The Netflix TimeSeries instantiation makes the trade explicit:
"Our decision to detect wide partitions on reads, as opposed to writes, is based on our observation that the majority of the data in the wild doesn't need this treatment. The slight downside is that some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up."
Each read tracks per-partition byte count. If bytes-read exceeds a configured threshold, the server emits a detection event to Kafka with the partition identifiers, and an immutable flag computed by the server (TimeSeries servers can determine when a partition is no longer receiving writes). A separate worker consumes the event stream and triggers remediation (in this case, dynamic partition splitting).
Critical: detection is decoupled from remediation. The read path emits one Kafka event per detected wide partition; remediation runs asynchronously and is allowed to take seconds, minutes, or hours. The read path remains fast.
Why not detect on writes?¶
Three reasons make read-side detection structurally preferred for pathologies that are mostly cold:
- Skewed access distribution. Most stored data is cold. Most pathologies (wide partition, oversized row, fragmented index) only matter when something reads the affected data — cold pathological data is a storage-cost issue, not a latency issue. Detection on writes pays cost on every write; detection on reads pays cost only on the ~5% of data that's actually hot.
- Detection signal is read-shaped. Wide partition is operationally defined by the bytes a read has to scan. The natural signal is
bytes_read_for_this_partition, which is only available on reads. - Write amplification of detection itself. Adding write-time detection to a high-throughput write path costs more than read-time detection because writes are typically larger in volume on log-structured stores like Cassandra.
Trade-off: detection lag¶
The cost is real and explicit:
"The slight downside is that some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up."
The first few reads after a partition crosses the threshold still pay the wide-partition tax. Only after the async pipeline catches up do subsequent reads benefit from the remediation. The trade is acceptable when:
- Most reads on the pathological data have a single client (so the lag affects that client briefly), OR
- The pathology is amortised quickly (seconds to minutes), OR
- The pathology is rare, so absolute number of slow reads is small even on first encounter.
Sibling design choices¶
| Choice | When it wins | When it loses |
|---|---|---|
| Read-side detection (this concept) | Most data is cold; pathology is read-defined | Pathology has consequences before any read happens (e.g. storage-cost-only failures) |
| Write-side detection | Pathology emerges on writes (e.g. row-count-based hot-row detection) | Most writes don't produce the pathology — wasted detection cost |
| Periodic offline scan | Pathology is statically defined and table-wide | Lag time scales with full-table scan cost |
| Histogram-based control loop (patterns/auto-tuning-control-loop-on-storage-histograms) | Pathology is table-wide and statistically detectable | Per-key outliers don't show up in aggregate stats |
In practice these compose: Netflix uses both the histogram-based control loop (table-wide) and read-side detection (per-key outliers).
What makes the pattern reusable¶
The structural shape is reusable beyond wide partitions:
- Read-side detect → emit Kafka event → async remediation worker → routed reads via metadata table + Bloom-filter gate.
This pattern can apply to:
- Detecting fragmenting LSM tables during reads, triggering targeted compaction.
- Detecting hot-key contention during reads, triggering replica-fan-out or cache promotion.
- Detecting stale cache entries during reads, triggering background refresh (patterns/async-refresh-cache-loader).
- Detecting oversized blobs during reads, triggering archive-tier migration.
The pattern's natural pair is idempotent detection — emitting the same Kafka event multiple times for the same partition must be safe, because reads on a wide partition will keep emitting until the remediation lands.
Seen in¶
- sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads —
Canonical wiki disclosure as the detection-stage architectural choice in Netflix
TimeSeries' dynamic partition splitting
pipeline. The post explicitly justifies the choice with "the majority of the data in
the wild doesn't need this treatment" and acknowledges the detection-lag cost.
Detection event is emitted to Kafka with
time_slice,time_series_id,time_bucket,event_bucket, and animmutableflag computed by the server.
Related¶
- concepts/wide-partition-problem — the canonical pathology this detection variant targets.
- concepts/dynamic-partition-splitting — the remediation triggered by this detection.
- concepts/read-amplification — sibling read-path concept.
- patterns/dynamic-partition-split-async-pipeline — the canonical pipeline this detection feeds into.
- systems/netflix-timeseries-abstraction — the canonical instance.
- systems/apache-cassandra — the substrate.