Skip to content

PATTERN Cited by 1 source

Data-quality analysis tool with edge-case logging

Definition

The data-quality analysis tool with edge-case logging pattern is a debugging primitive on top of a warm-query telemetry store: log mismatches at first detection in aggregate form (just the fact that partition P had a row-count or checksum mismatch), then periodically read those mismatch logs and run targeted queries against the source data to find example offending rows, and log the debugging information back to the same store so operators query the augmented log stream rather than the source data directly.

"For each landed shadow table partition, the system would read the corresponding production table partition and compare both the row count and checksum. Any mismatches were logged to Scuba, Meta's data management system for real-time analysis. Every hour, the data quality analysis tool read the logs from Scuba, ran queries to identify example rows causing mismatches, and logged detailed debugging information back to Scuba. This process enabled team members to quickly determine the root cause of issues and assess whether they were already known and being addressed." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

The two-tier loop

Tier 1 — continuous, cheap aggregate detection: row-count + checksum mismatches are computed continuously, per-partition. Cost is O(partitions), not O(rows). Mismatch events are logged to a warm-query store like Scuba.

Tier 2 — periodic, targeted, expensive enrichment: hourly, an analysis tool reads the mismatch events from tier 1, picks specific partitions to drill into, runs targeted queries against the source data to find example rows that caused the mismatch, and logs the example-row analysis back to the same warm-query store as augmented entries.

The pattern's identity move: enrich the log stream itself with analysis output, rather than running fresh analysis on every operator query. The augmented log stream becomes the operator-facing primary surface.

Why this works

Three load-bearing properties:

  1. Cheap aggregate at high frequency, expensive analysis at low frequency matches the cost structure of the data — most partitions are correct most of the time, and the rare mismatches warrant the deeper analysis.
  2. The augmented log stream is queryable by name. "Team members [could] quickly determine the root cause of issues and assess whether they were already known and being addressed." New mismatches can be quickly classified as "known issue X" or "novel" by querying for a recent occurrence of similar debugging output.
  3. The tool persists past the migration. "This same data quality analysis tool is still in use after the migration as part of the release validation process." The migration is the forcing function for building the tool; the tool's amortised value extends past the migration deadline.

Composes with

  • vs on-demand log query: on-demand queries are run when an operator asks; this pattern's analysis runs continuously on a schedule and writes its output back, so the operator's query hits already-enriched data.
  • vs alerting + paging: alerts are signals to act; this pattern's output is context for understanding. Both can exist; this pattern is upstream of alerting in the workflow.
  • vs streaming enrichment (e.g. concepts/sql-tag-propagation primitive): streaming enrichment adds context at the source event; this pattern adds context after detection by re-querying the source data.

When to use

  • Aggregate-level detection is cheap, row-level analysis is expensive — the cost asymmetry justifies the two-tier structure.
  • A warm-query telemetry store is in place that supports both write-many and query-many access patterns (Scuba in Meta's case).
  • Operators need to triage a high-volume mismatch stream — manual per-mismatch investigation doesn't scale.
  • The mismatch surface is rich — i.e. mismatches have semantic structure (different bug categories, different affected systems) that benefits from periodic analysis to classify them.

When NOT to use

  • Detection itself is expensive enough to do row-level analysis at detection time — collapse the two tiers.
  • Mismatches are rare and operators can investigate individually — the tooling overhead doesn't pay back.
  • There is no warm-query store — this pattern is shape- dependent on the underlying telemetry substrate.

Seen in

Last updated · 542 distilled / 1,571 read