Skip to content

PATTERN Cited by 1 source

Known-issue exclusion in batch selection

Definition

The known-issue exclusion in batch selection pattern is a batch-planning discipline for capacity-constrained migrations: when an underlying issue affects multiple jobs (or changes), the affected jobs are excluded from the next batch until the issue is resolved, then migrated as a group once the fix is in place.

"Engineering teams worked to ensure that the environment was properly prepared before creating a batch. For instance, they established selection criteria to exclude jobs with known issues that were still being resolved, thereby reducing noise caused by duplicate issues.""We avoided creating new shadow jobs with known issues until those issues were resolved. When an issue was detected we removed any potentially affected jobs from the migration list and held them until a fix was in place." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Why this is structural, not bureaucratic

A known issue affecting N jobs generates N copies of the same alert through the data-quality detection tooling. The first occurrence is signal; the N+1th occurrence is noise. Operators triaging the data-quality stream then face an alert fatigue-like problem: the bug they care about is buried in duplicate firings of bugs they already know about and are working on.

The mitigation: defer affected jobs' migration until the root issue is fixed, then migrate them as a batch. This:

  1. Keeps the data-quality signal stream clean — only novel bugs surface, not duplicates of known ones.
  2. Avoids wasted full-dumps. In CDC pipelines like Meta's, "creating shadow jobs while known issues were still present would therefore trigger a lot of unnecessary full dumps, both at job creation time and again during data-quality remediation." See the tri-layer CDC schema for why each remediation triggers a fresh full-dump.
  3. Enables "fix-once-deploy-many" remediation. Once the root fix is in place, the deferred batch all benefits from it simultaneously rather than each affected job being individually debugged.

Why this matters more under capacity constraint

Meta's post explicitly names capacity as the framing constraint:

"Because migration capacity was limited, we could not run all shadow jobs at once. Instead, we migrated the jobs in batches. Migration efficiency depends heavily on how jobs are selected for each migration batch."

When capacity bounds parallelism, every batch slot is precious — filling a batch slot with a job that will fail because of a known unresolved issue wastes:

  • The batch slot itself (capacity that could have moved a clean job).
  • The full-dump cost (in CDC migrations specifically).
  • Operator triage time when the failure surfaces.
  • Potentially another full-dump after the issue is fixed.

Composes with

  • vs canary: canary intentionally exposes a small sample to catch issues; this pattern intentionally defers units to avoid exposing them to known-broken paths.
  • vs rolling deployment: rolling deployment moves units through phases serially without filtering; this pattern filters the input set into batches based on known incidents.
  • vs error-budget pause: error-budget pause halts all changes when too many errors accumulate; this pattern halts only the subset of changes that share a root cause with open issues.

When to use

  • Capacity-constrained batch migration where parallelism is bounded.
  • A known-issue tracker is in place that lets you query "which jobs are affected by this issue?" cheaply.
  • Issues are root-caused and tracked, not just observed — i.e. you can identify the affected job set, not just the alert count.
  • Each migration unit is expensive (full-dump cost, validation cost, operator-triage cost) — the savings from exclusion compound.

When NOT to use

  • Migration units are very cheap — the discipline overhead may exceed the benefit.
  • Issues are not root-caused — without a clear "affected set" it's impossible to decide what to exclude.
  • Strict deadline-driven migration — deferring batches may miss a fixed cutoff date that simple parallelism would meet.

Seen in

Last updated · 542 distilled / 1,571 read