PATTERN Cited by 1 source

Precomputed sketch column in Delta table¶

Context¶

Analytical dashboards need percentiles, distinct counts, top-K, and distinct-plus-metric aggregations over time windows that users select at query time. Typical time windows are hourly, daily, weekly, custom ranges — and the user expects sub-second refresh.

The exact-query approach is to run PERCENTILE, COUNT(DISTINCT), GROUP BY ... ORDER BY ... LIMIT K, or composed COUNT(DISTINCT) + SUM over raw events every time the dashboard refreshes. At scale (billions of rows), these queries force global sorts, full shuffles of user IDs, or cluster-wide reductions that cost minutes or gigabytes of memory per refresh.

For decision-support queries where 1–2% error is acceptable, the raw-scan cost is pure waste — the same dashboard could be served from precomputed summaries in milliseconds.

Pattern¶

Build a mergeable sketch per partition per time bucket during ETL, store it as a BLOB column in a Delta Lake table, and merge sketches on read.

Storage shape¶

CREATE TABLE hourly_rollup (
  event_hour       TIMESTAMP,
  dimension        STRING,      -- e.g. region, campaign_id
  event_count      BIGINT,
  latency_sketch   BINARY,      -- KLL sketch
  distinct_users   BINARY,      -- Theta sketch
  top_queries      BINARY,      -- approx-top-K sketch
  user_revenue     BINARY       -- Tuple sketch
) USING DELTA
PARTITIONED BY (DATE(event_hour));

Each column is a serialised sketch — kilobyte-scale per row.

Write path: ETL builds sketches once¶

INSERT INTO hourly_rollup
SELECT
  date_trunc('hour', event_time)       AS event_hour,
  region                               AS dimension,
  COUNT(*)                             AS event_count,
  kll_sketch_agg_double(latency_ms)    AS latency_sketch,
  theta_sketch_agg(user_id)            AS distinct_users,
  approx_top_k_accumulate(search_term) AS top_queries,
  tuple_sketch_agg_integer(user_id, revenue_cents) AS user_revenue
FROM events
WHERE event_time BETWEEN :from AND :to
GROUP BY date_trunc('hour', event_time), region;

The heavy compute — the per-partition sort, shuffle, dedup, group — happens once per hour, in the ETL pipeline, not on every dashboard refresh.

Read path: dashboard merges sketches¶

-- P99 latency for US-East this week
SELECT kll_get_quantile_bigint(
  kll_merge(latency_sketch),  -- merge 168 hourly sketches
  0.99
)
FROM hourly_rollup
WHERE dimension = 'us-east'
  AND event_hour BETWEEN :week_start AND :week_end;

-- Audience overlap: customers who saw ad A but not ad B
SELECT theta_sketch_estimate(
  theta_difference(
    theta_union(distinct_users) FILTER (WHERE campaign = 'A'),
    theta_union(distinct_users) FILTER (WHERE campaign = 'B')
  )
)
FROM campaign_rollup
WHERE event_hour >= :campaign_start;

Because the sketches are mergeable, an arbitrary time window is a reduce over the relevant sketches in the table — not a scan of raw events.

Consequences¶

Good:

Dashboards become merges, not scans. Weekly-trending dashboard = merge of 168 hourly sketches, executed in milliseconds. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
ETL cost is amortised. The heavy global-sort / global- dedup work runs once per period, not once per query.
Streaming-friendly. For Structured Streaming, merge each micro-batch's sketch into a running total — a live leaderboard replaces a batch job.
Raw data stays on disk for audit. "The raw data is still there when the auditors ask. For everything else, a 1% error margin and a 1000x speedup is a welcome trade-off." (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
Cross-engine portability. A sketch written in Spark ETL can be read by any DataSketches-compatible consumer — C++, Java, Druid, Pinot.

Tradeoffs:

Relative error is real. The 1–2% envelope must be acceptable to the consumer; regulated or billing-adjacent dashboards need exact queries instead.
Sketch bytes take storage. Per-row kilobyte overhead adds up over billions of rollup rows; this is typically small relative to the raw-event footprint it replaces, but not zero.
ETL schema gets wider. Each sketch family is a distinct BLOB column; tables accumulate sketch columns as the dashboard's query classes grow.
Read queries need sketch-aware SQL. The BI tool must speak the kll_merge / theta_union / approx_top_k_combine / tuple-merge verbs — not stock SQL.

Relationship to adjacent patterns¶

patterns/set-algebra-on-theta-sketches — the specific read-side operation for audience-overlap / incrementality queries. Uses Theta sketches stored under this pattern.
patterns/local-global-aggregation-split — the general decomposition that this pattern specialises to mergeable probabilistic structures.
patterns/dual-granularity-rollup-tables — pre-existing rollup pattern; this pattern extends it from count / sum aggregates to percentile / distinct / top-K / distinct-plus- metric.
concepts/sketch-as-mysql-binary-column — same pattern applied to MySQL BLOB columns with loadable C++ functions (PlanetScale Insights). The Delta Lake version benefits from Spark's distributed merge; the MySQL version from OLTP read path.

Seen in¶

sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks canonicalises the pattern as the intended workflow for its new sketch function families. The post's explicit framing: "Build them once during your daily ETL. Store them as columns in Delta tables. When a dashboard needs P50/P90/P99 for any time range, merge the precomputed sketches in milliseconds instead of rescanning raw data."