Skip to content

PATTERN Cited by 1 source

Precomputed sketch column in Delta table

Context

Analytical dashboards need percentiles, distinct counts, top-K, and distinct-plus-metric aggregations over time windows that users select at query time. Typical time windows are hourly, daily, weekly, custom ranges — and the user expects sub-second refresh.

The exact-query approach is to run PERCENTILE, COUNT(DISTINCT), GROUP BY ... ORDER BY ... LIMIT K, or composed COUNT(DISTINCT) + SUM over raw events every time the dashboard refreshes. At scale (billions of rows), these queries force global sorts, full shuffles of user IDs, or cluster-wide reductions that cost minutes or gigabytes of memory per refresh.

For decision-support queries where 1–2% error is acceptable, the raw-scan cost is pure waste — the same dashboard could be served from precomputed summaries in milliseconds.

Pattern

Build a mergeable sketch per partition per time bucket during ETL, store it as a BLOB column in a Delta Lake table, and merge sketches on read.

Storage shape

CREATE TABLE hourly_rollup (
  event_hour       TIMESTAMP,
  dimension        STRING,      -- e.g. region, campaign_id
  event_count      BIGINT,
  latency_sketch   BINARY,      -- KLL sketch
  distinct_users   BINARY,      -- Theta sketch
  top_queries      BINARY,      -- approx-top-K sketch
  user_revenue     BINARY       -- Tuple sketch
) USING DELTA
PARTITIONED BY (DATE(event_hour));

Each column is a serialised sketch — kilobyte-scale per row.

Write path: ETL builds sketches once

INSERT INTO hourly_rollup
SELECT
  date_trunc('hour', event_time)       AS event_hour,
  region                               AS dimension,
  COUNT(*)                             AS event_count,
  kll_sketch_agg_double(latency_ms)    AS latency_sketch,
  theta_sketch_agg(user_id)            AS distinct_users,
  approx_top_k_accumulate(search_term) AS top_queries,
  tuple_sketch_agg_integer(user_id, revenue_cents) AS user_revenue
FROM events
WHERE event_time BETWEEN :from AND :to
GROUP BY date_trunc('hour', event_time), region;

The heavy compute — the per-partition sort, shuffle, dedup, group — happens once per hour, in the ETL pipeline, not on every dashboard refresh.

Read path: dashboard merges sketches

-- P99 latency for US-East this week
SELECT kll_get_quantile_bigint(
  kll_merge(latency_sketch),  -- merge 168 hourly sketches
  0.99
)
FROM hourly_rollup
WHERE dimension = 'us-east'
  AND event_hour BETWEEN :week_start AND :week_end;

-- Audience overlap: customers who saw ad A but not ad B
SELECT theta_sketch_estimate(
  theta_difference(
    theta_union(distinct_users) FILTER (WHERE campaign = 'A'),
    theta_union(distinct_users) FILTER (WHERE campaign = 'B')
  )
)
FROM campaign_rollup
WHERE event_hour >= :campaign_start;

Because the sketches are mergeable, an arbitrary time window is a reduce over the relevant sketches in the table — not a scan of raw events.

Consequences

Good:

Tradeoffs:

  • Relative error is real. The 1–2% envelope must be acceptable to the consumer; regulated or billing-adjacent dashboards need exact queries instead.
  • Sketch bytes take storage. Per-row kilobyte overhead adds up over billions of rollup rows; this is typically small relative to the raw-event footprint it replaces, but not zero.
  • ETL schema gets wider. Each sketch family is a distinct BLOB column; tables accumulate sketch columns as the dashboard's query classes grow.
  • Read queries need sketch-aware SQL. The BI tool must speak the kll_merge / theta_union / approx_top_k_combine / tuple-merge verbs — not stock SQL.

Relationship to adjacent patterns

Seen in

Last updated · 438 distilled / 1,268 read