PATTERN Cited by 1 source

Accept unattributed flows¶

Design posture: a small percentage of unattributed records is acceptable; any misattribution is not. Systems embracing this tradeoff return a "don't know" signal for records they can't confidently resolve, rather than guessing. Downstream consumers treat unattributed records as censored data, not as noise to be absorbed into correct-looking answers.

Quote¶

"For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable." — Netflix, 2025

Why this is a meaningful design posture¶

Many observability systems implicitly treat coverage as a proxy for quality: "we attributed 99.9% of flows." But if 5% of those attributions are wrong, the downstream consumer — service dependency auditing, security analysis, incident triage — silently eats the incorrect answers. A single misattributed flow in a dependency graph creates a non-existent dependency that can't be disproved without external ground truth.

By contrast, an unattributed record signals "we don't know" explicitly. Downstream consumers can:

Filter them out (know you're looking at partial data).
Bound the uncertainty ("0.5% unattributed" is a quality metric rather than a silent source of wrong answers).
Retry or escalate (some consumers may re-query after more heartbeats arrive).

When this posture is available¶

The downstream consumer can tolerate a small censored window in the data (dependency graphs, billing summaries, usage metrics).
The service has a way to detect uncertainty at query time. Canonical mechanism: a heartbeat time-range map has natural "no range covers this timestamp" gaps that are observable without reference to external state.

When it's the wrong posture¶

Downstream treats every record as a positive signal (e.g. billing a customer for a flow "attributed to them" — unattributed records can't be billed at all).
The system's SLO is coverage, not correctness (e.g. fraud detection; false negatives are more expensive than false positives).

Canonical instance¶

systems/netflix-flowcollector — if a flow's t_start timestamp falls outside any ownership time range for the remote IP, retry after a delay and eventually give up, delivering the flow unattributed. Netflix makes this explicit: "Such failures may occur when flows are lost or broadcast messages are delayed. For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable."

The prior event-based system didn't have this option — it would return a stale attribution rather than unknown. Making the "unknown" verdict a first-class return value is a direct benefit of the heartbeat architecture.

Trade-offs¶

Coverage percentage is now a real quality metric. The operator must track and SLO it.
Consumers must handle unattributed records. Previously they might have assumed every record carried an attribution.
Failure modes are visible, not invisible. Missing heartbeats produce attribution gaps rather than silently wrong answers — arguably the main payoff.

Seen in¶

sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs — canonical wiki instance; explicit quote makes the posture structural rather than accidental.