PATTERN Cited by 1 source
Partition marking stops CDC bleeding¶
Definition¶
The partition marking stops CDC bleeding pattern bounds the CDC bad-data propagation hazard by annotating bad partitions in metadata rather than treating all landed data as authoritative. The marking is interpreted by the ingestion system at runtime and changes downstream behaviour by partition role:
| Bad-marked partition | Ingestion-system behaviour |
|---|---|
| Delta partition | Stop new data landing; alert operator |
| Target partition | Substitute with older known-good partition + merge with more deltas |
The data in the bad partition is left in place — the marking is a metadata-level signal, not in-place correction or deletion.
"During the reverse shadow phase, if any data quality issues were detected in a specific partition, that partition would be marked in its metadata as having bad data quality. If this partition was a delta partition, then new data would stop landing, and an alert would be sent to a team member. If this partition was a target partition, the system would instead select an older partition and merge it with more deltas. In this way we could stop bad data propagation quickly. For rollback, we could quickly query the metadata to find all partitions that were marked with bad data quality and fix them with backfill." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale
Why these two behaviours¶
The two behaviours are not symmetric because the partition roles aren't symmetric in the tri-layer CDC schema:
- Delta partitions are changes to apply forward — a bad delta would corrupt every subsequent target-table state if consumed. Halting consumption (and alerting) is the correct containment.
- Target partitions are snapshots of state at a moment — a bad target partition can be substituted with an older known-good target partition merged forward through additional deltas, producing a fresh target partition that bypasses the corruption without consumer disruption.
The substitute-and-merge primitive on the target side is what makes this pattern uniquely fit for CDC pipelines: target-table consumers see a continuously-correct state even while corruption is being contained behind the scenes.
Why mark, not delete¶
Three reasons:
- Reversibility. A bad-quality determination might be a false positive; un-marking is cheaper than re-running the computation.
- Rollback substrate. "For rollback, we could quickly query the metadata to find all partitions that were marked with bad data quality and fix them with backfill." The marks index the partitions needing remediation — finding them post-hoc without the marks would require comparing against an external reference.
- Forensic value. The original bad data is the evidence for understanding why the bug existed; deleting it loses that evidence.
Composes with¶
- concepts/data-quality-checksum-comparison — the detection primitive that triggers marking. Continuous row-count + checksum comparison between two parallel sources of the same data flags partitions whose marks should be set to bad.
- patterns/shadow-then-reverse-shadow-migration — Meta's marking-during-reverse-shadow rollout shape that gives the marking system its first production deployment. After the migration, the marking remains in production as part of release validation.
- patterns/data-quality-analysis-tool-with-edge-case-logging — the debugging substrate that turns aggregate mismatch detection into actionable example rows.
Distinguishing from related primitives¶
- vs DLQ (dead-letter queue): DLQs hold messages that failed processing; partition marking holds partitions that succeeded processing but produced suspect output.
- vs soft-delete (e.g.
is_deleted=true): soft-delete is at the row grain; partition marking is at the partition grain. - vs error events: error events are emitted on detection but don't change downstream behaviour; this pattern's marks drive downstream behaviour (stop landing / substitute partition).
- vs automatic rollback to last known good binary (in CI/CD): both share "keep the last good thing as a fallback" but at completely different layers — code-deployment vs data-state-at-rest.
When to use¶
- Any CDC pipeline where target state is computed from prior target state plus deltas — the bad-data-propagation hazard is structural to that schema.
- Continuous data-quality detection is in place — the marking primitive needs detection events to be useful.
- Consumers can tolerate "older known-good substitute" target partitions transparently — i.e. they read the latest target partition but don't care whether it was just-produced or re-derived from an older snapshot.
When NOT to use¶
- Pipelines where every produced output is immediately final — e.g. event-sourced systems where each event is the canonical artefact, not a partition-of-state. Partition marking doesn't apply when there are no partitions to mark.
- Consumers care about exact production timestamp — the substitute-and-merge primitive may produce a target partition at a different timestamp than the original; if that's externally visible and matters, this pattern breaks the contract.
Seen in¶
- sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale — Meta's data-ingestion-system migration; canonical wiki instance with both delta-partition + target-partition behaviours specified. Used during reverse shadow phase to contain bad-data propagation; metadata-as-rollback-index for bulk backfill.
Related¶
- concepts/cdc-bad-data-propagation — the hazard this contains
- concepts/partition-quality-marking — the metadata mechanism
- concepts/data-quality-checksum-comparison — the detection trigger
- concepts/full-dump-vs-delta-vs-target — the schema role-distinctions matter
- concepts/blast-radius — the broader containment concept
- patterns/shadow-then-reverse-shadow-migration — the migration shape this lives inside
- systems/meta-data-ingestion-system — canonical wiki instance
- companies/meta — company hub