Skip to content

PATTERN Cited by 1 source

Partition marking stops CDC bleeding

Definition

The partition marking stops CDC bleeding pattern bounds the CDC bad-data propagation hazard by annotating bad partitions in metadata rather than treating all landed data as authoritative. The marking is interpreted by the ingestion system at runtime and changes downstream behaviour by partition role:

Bad-marked partition Ingestion-system behaviour
Delta partition Stop new data landing; alert operator
Target partition Substitute with older known-good partition + merge with more deltas

The data in the bad partition is left in place — the marking is a metadata-level signal, not in-place correction or deletion.

"During the reverse shadow phase, if any data quality issues were detected in a specific partition, that partition would be marked in its metadata as having bad data quality. If this partition was a delta partition, then new data would stop landing, and an alert would be sent to a team member. If this partition was a target partition, the system would instead select an older partition and merge it with more deltas. In this way we could stop bad data propagation quickly. For rollback, we could quickly query the metadata to find all partitions that were marked with bad data quality and fix them with backfill." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Why these two behaviours

The two behaviours are not symmetric because the partition roles aren't symmetric in the tri-layer CDC schema:

  • Delta partitions are changes to apply forward — a bad delta would corrupt every subsequent target-table state if consumed. Halting consumption (and alerting) is the correct containment.
  • Target partitions are snapshots of state at a moment — a bad target partition can be substituted with an older known-good target partition merged forward through additional deltas, producing a fresh target partition that bypasses the corruption without consumer disruption.

The substitute-and-merge primitive on the target side is what makes this pattern uniquely fit for CDC pipelines: target-table consumers see a continuously-correct state even while corruption is being contained behind the scenes.

Why mark, not delete

Three reasons:

  1. Reversibility. A bad-quality determination might be a false positive; un-marking is cheaper than re-running the computation.
  2. Rollback substrate. "For rollback, we could quickly query the metadata to find all partitions that were marked with bad data quality and fix them with backfill." The marks index the partitions needing remediation — finding them post-hoc without the marks would require comparing against an external reference.
  3. Forensic value. The original bad data is the evidence for understanding why the bug existed; deleting it loses that evidence.

Composes with

  • vs DLQ (dead-letter queue): DLQs hold messages that failed processing; partition marking holds partitions that succeeded processing but produced suspect output.
  • vs soft-delete (e.g. is_deleted=true): soft-delete is at the row grain; partition marking is at the partition grain.
  • vs error events: error events are emitted on detection but don't change downstream behaviour; this pattern's marks drive downstream behaviour (stop landing / substitute partition).
  • vs automatic rollback to last known good binary (in CI/CD): both share "keep the last good thing as a fallback" but at completely different layers — code-deployment vs data-state-at-rest.

When to use

  • Any CDC pipeline where target state is computed from prior target state plus deltas — the bad-data-propagation hazard is structural to that schema.
  • Continuous data-quality detection is in place — the marking primitive needs detection events to be useful.
  • Consumers can tolerate "older known-good substitute" target partitions transparently — i.e. they read the latest target partition but don't care whether it was just-produced or re-derived from an older snapshot.

When NOT to use

  • Pipelines where every produced output is immediately final — e.g. event-sourced systems where each event is the canonical artefact, not a partition-of-state. Partition marking doesn't apply when there are no partitions to mark.
  • Consumers care about exact production timestamp — the substitute-and-merge primitive may produce a target partition at a different timestamp than the original; if that's externally visible and matters, this pattern breaks the contract.

Seen in

Last updated · 542 distilled / 1,571 read