Skip to content

CONCEPT Cited by 1 source

Best-effort log delivery

Best-effort log delivery names the reliability contract on the loosest end of the streaming-delivery spectrum: a log record may be missed, arrive late, or be duplicated, with no framework-level retry or ordering guarantees. It is cheaper and simpler than at-least-once or exactly-once delivery — and is often the only contract available from managed-cloud logging primitives (notably S3 Server Access Logs).

Definition (AWS SAL canonical)

"Delivery of access logs is best-effort, meaning a log may occasionally be missed, arrive late, or have duplicates." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale, quoting the AWS docs)

Designing around best-effort

Three disciplines make best-effort acceptable for a given use case:

  1. Measure the straggler tail. Yelp's measurement at fleet scale on SAL:
    • < 0.001% arrive > 2 days late.
    • Longest observed straggler: ~9 days.
    • Implies that any pipeline with a multi-week retention window swallows the tail naturally.
  2. Pick a retention window longer than the tail. Yelp: "Our retention periods are much longer than the maximum log delay." If your downstream decision (e.g. deletion) is gated on "has this object been accessed recently?", ensure the window of "recently" comfortably exceeds your measured straggler tail.
  3. Aggregate to a coarser grain where stragglers still add up. Yelp's access-based retention works at prefix granularity, not object granularity: "deletions are based on prefixes—so missing all logs for a given prefix would only occur for truly inactive data." Even if one object's log line is dropped, the rest of the prefix will generate enough log lines to signal access.

When best-effort is unacceptable

  • Billing events / financial transactions — missed records are missed revenue. Use at-least-once + idempotency tokens or a durable changelog (CDC).
  • Incident / compromise traceability — if the log is the sole source-of-truth for a security audit, straggler loss can hide the attack. (Yelp's SAL-for-incident-response mitigates this implicitly by not needing every log line for pattern recognition — the access fingerprint shows up in many log lines, losing some doesn't hide the attack.)
  • Deduplication-by-row-key for analytics — if you need exact counts (not order-of-magnitude), best-effort's duplicate-risk leads to over-counting.

Comparison on the delivery-semantics ladder

Tier Miss Late Duplicate Typical cost
Best-effort yes yes yes lowest — often free with the source service
At-least-once no yes yes +idempotency or dedupe cost downstream
Exactly-once no yes no +transactional outbox / 2PC / Kafka EoS overhead
In-order at-least-once no bounded yes +single-writer-per-key or total-order broker

Straggler policy as a design axis

Given best-effort delivery, the system owner must decide what to do with stragglers that arrive after the downstream job has already processed the window. Yelp's choice:

"We decided that the straggler logs can be ignored to deliver business value in a timely fashion. The straggler logs can be inserted at a later time after tagged objects have expired."

Alternatives:

  • Late-arriving-record re-insertion — pipeline re-opens the window and inserts stragglers. Higher correctness, more orchestration complexity.
  • Separate late-arrival sink — stragglers land in a distinct object / table; used for one-off audits.
  • Drop (Yelp's choice on SAL) — when downstream decisions are at aggregate / prefix granularity.

Seen in

  • sources/2025-09-26-yelp-s3-server-access-logs-at-scale — canonical first-party disclosure at fleet-scale: < 0.001% > 2 days late; max observed ~9 days. Yelp's access-based retention explicitly depends on best-effort delivery's measured tail being well-inside retention windows + deletion granularity being prefix-level.
Last updated · 476 distilled / 1,218 read