Skip to content

PATTERN Cited by 1 source

Raw-to-columnar log compaction

Problem

A continuous log producer (S3 Server Access Logs, CloudFront access logs, Lambda request logs, ALB access logs, application request logs) writes many small raw-text objects to a destination store over the course of a day. At scale this is unqueryable:

  • S3 API rate limits throttle a query engine trying to scan millions of tiny objects.
  • Per-request overhead dominates useful payload. Transitions to cheaper storage classes (Glacier / Deep Archive) are more expensive than necessary because the per-object transition fee is the same for a 1-KB object as for a 128-MB object.
  • Text format has no column-level predicate pushdown — every query reads every byte.
  • No row-group statistics — no pruning of sub-file ranges.

The pain is load-bearing. Yelp (2025-09-26):

"Inevitably, the accumulated volume of logs for these buckets with SimplePrefix became impossible to query in Athena due to S3 API rate limits." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Pattern

On a daily (or hourly, weekly) batch, read all raw-text log objects for a window, parse them, and rewrite as a small number of columnar objects (typically Parquet) partitioned for the expected query shape.

Canonical shape

  1. Raw sink: source logs land at the destination in small text objects, in whatever partition layout the producer supports. For S3 SAL: PartitionedPrefix + [account]/[region]/[bucket]/[date]/ per the concepts/s3-server-access-logs spec.
  2. Compaction job runs on a batch scheduler (systems/tron at Yelp) once per window: a. Enumerate the window's raw objects. b. Run an idempotent INSERT into the columnar target table (patterns/idempotent-athena-insertion-via-left-join). c. Verify row counts (Yelp: GetQueryRuntimeStatistics Rows on first insertion; distinct-count on re-insertion). d. Tag the source raw objects for lifecycle expiration (patterns/object-tagging-for-lifecycle-expiration).
  3. Query the columnar target, not the raw, for all downstream analytics and debugging.

Reported outcomes (Yelp on SAL at fleet scale)

  • 85% storage reduction — Parquet's per-column compression
  • row-group encoding dominates.
  • 99.99% object-count reduction — small text objects collapsed into a small number of per-partition Parquet objects.

Both deltas improve downstream storage-class transitions, query-engine latency, and API-request cost.

Partitioning discipline

The columnar target's partition layout should match the common query shape. Yelp's SAL compaction uses (bucket_name, yyyy/MM/dd) partition keys — matching "a day's worth of logs" as the canonical query window. Finer partition granularity would cause over-partitioning; coarser would force too much data scanned per query.

See partition projection for the enumeration mechanism.

Idempotence requirements

The compaction job will be retried — cluster evictions (concepts/athena-shared-resource-contention), quota hits, transient S3 errors. Retry-safety is mandatory:

  • Use self-LEFT-JOIN on a unique column to avoid duplicate rows on retry.
  • Delete raw objects only after the insertion + count verification succeeds, via tag-based expiration rather than immediate delete (so a bug discovered later can re-run from still-present raw).

Trade-offs vs alternatives

Approach Cost Query perf Complexity
Raw text only storage: highest (no compression); API: per-query cost scales with tiny-object count very bad at scale trivial
CloudTrail Data Events (AWS alt) $1/M events (orders of magnitude higher at fleet scale) good low
Raw-to-columnar compaction (this pattern) storage: lowest; batch-job CPU; API: one-shot per window very good moderate (idempotence, retention, partitioning)
Streaming columnar (Firehose → Parquet) streaming infra cost; good perf good high (streaming guarantees + schema evolution)

Seen in

Last updated · 476 distilled / 1,218 read