Skip to content

PATTERN Cited by 1 source

Accept unattributed flows

Design posture: a small percentage of unattributed records is acceptable; any misattribution is not. Systems embracing this tradeoff return a "don't know" signal for records they can't confidently resolve, rather than guessing. Downstream consumers treat unattributed records as censored data, not as noise to be absorbed into correct-looking answers.

Quote

"For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable." — Netflix, 2025

Why this is a meaningful design posture

Many observability systems implicitly treat coverage as a proxy for quality: "we attributed 99.9% of flows." But if 5% of those attributions are wrong, the downstream consumer — service dependency auditing, security analysis, incident triage — silently eats the incorrect answers. A single misattributed flow in a dependency graph creates a non-existent dependency that can't be disproved without external ground truth.

By contrast, an unattributed record signals "we don't know" explicitly. Downstream consumers can:

  • Filter them out (know you're looking at partial data).
  • Bound the uncertainty ("0.5% unattributed" is a quality metric rather than a silent source of wrong answers).
  • Retry or escalate (some consumers may re-query after more heartbeats arrive).

When this posture is available

  • The downstream consumer can tolerate a small censored window in the data (dependency graphs, billing summaries, usage metrics).
  • The service has a way to detect uncertainty at query time. Canonical mechanism: a heartbeat time-range map has natural "no range covers this timestamp" gaps that are observable without reference to external state.

When it's the wrong posture

  • Downstream treats every record as a positive signal (e.g. billing a customer for a flow "attributed to them" — unattributed records can't be billed at all).
  • The system's SLO is coverage, not correctness (e.g. fraud detection; false negatives are more expensive than false positives).

Canonical instance

systems/netflix-flowcollector — if a flow's t_start timestamp falls outside any ownership time range for the remote IP, retry after a delay and eventually give up, delivering the flow unattributed. Netflix makes this explicit: "Such failures may occur when flows are lost or broadcast messages are delayed. For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable."

The prior event-based system didn't have this option — it would return a stale attribution rather than unknown. Making the "unknown" verdict a first-class return value is a direct benefit of the heartbeat architecture.

Trade-offs

  • Coverage percentage is now a real quality metric. The operator must track and SLO it.
  • Consumers must handle unattributed records. Previously they might have assumed every record carried an attribution.
  • Failure modes are visible, not invisible. Missing heartbeats produce attribution gaps rather than silently wrong answers — arguably the main payoff.

Seen in

Last updated · 319 distilled / 1,201 read