PATTERN Cited by 1 source
Accept unattributed flows¶
Design posture: a small percentage of unattributed records is acceptable; any misattribution is not. Systems embracing this tradeoff return a "don't know" signal for records they can't confidently resolve, rather than guessing. Downstream consumers treat unattributed records as censored data, not as noise to be absorbed into correct-looking answers.
Quote¶
"For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable." — Netflix, 2025
Why this is a meaningful design posture¶
Many observability systems implicitly treat coverage as a proxy for quality: "we attributed 99.9% of flows." But if 5% of those attributions are wrong, the downstream consumer — service dependency auditing, security analysis, incident triage — silently eats the incorrect answers. A single misattributed flow in a dependency graph creates a non-existent dependency that can't be disproved without external ground truth.
By contrast, an unattributed record signals "we don't know" explicitly. Downstream consumers can:
- Filter them out (know you're looking at partial data).
- Bound the uncertainty ("0.5% unattributed" is a quality metric rather than a silent source of wrong answers).
- Retry or escalate (some consumers may re-query after more heartbeats arrive).
When this posture is available¶
- The downstream consumer can tolerate a small censored window in the data (dependency graphs, billing summaries, usage metrics).
- The service has a way to detect uncertainty at query time. Canonical mechanism: a heartbeat time-range map has natural "no range covers this timestamp" gaps that are observable without reference to external state.
When it's the wrong posture¶
- Downstream treats every record as a positive signal (e.g. billing a customer for a flow "attributed to them" — unattributed records can't be billed at all).
- The system's SLO is coverage, not correctness (e.g. fraud detection; false negatives are more expensive than false positives).
Canonical instance¶
systems/netflix-flowcollector — if a flow's t_start timestamp
falls outside any ownership time range for the remote IP, retry
after a delay and eventually give up, delivering the flow
unattributed. Netflix makes this explicit: "Such failures may
occur when flows are lost or broadcast messages are delayed. For
our use cases, it is acceptable to leave a small percentage of
flows unattributed, but any misattribution is unacceptable."
The prior event-based system didn't have this option — it would return a stale attribution rather than unknown. Making the "unknown" verdict a first-class return value is a direct benefit of the heartbeat architecture.
Trade-offs¶
- Coverage percentage is now a real quality metric. The operator must track and SLO it.
- Consumers must handle unattributed records. Previously they might have assumed every record carried an attribution.
- Failure modes are visible, not invisible. Missing heartbeats produce attribution gaps rather than silently wrong answers — arguably the main payoff.
Seen in¶
- sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs — canonical wiki instance; explicit quote makes the posture structural rather than accidental.