PATTERN Cited by 1 source
Precomputed sketch column in Delta table¶
Context¶
Analytical dashboards need percentiles, distinct counts, top-K, and distinct-plus-metric aggregations over time windows that users select at query time. Typical time windows are hourly, daily, weekly, custom ranges — and the user expects sub-second refresh.
The exact-query approach is to run PERCENTILE, COUNT(DISTINCT),
GROUP BY ... ORDER BY ... LIMIT K, or composed COUNT(DISTINCT) +
SUM over raw events every time the dashboard refreshes. At scale
(billions of rows), these queries force global sorts,
full shuffles of user IDs, or cluster-wide reductions that
cost minutes or gigabytes of memory per refresh.
For decision-support queries where 1–2% error is acceptable, the raw-scan cost is pure waste — the same dashboard could be served from precomputed summaries in milliseconds.
Pattern¶
Build a mergeable sketch per partition per time bucket during ETL, store it as a BLOB column in a Delta Lake table, and merge sketches on read.
Storage shape¶
CREATE TABLE hourly_rollup (
event_hour TIMESTAMP,
dimension STRING, -- e.g. region, campaign_id
event_count BIGINT,
latency_sketch BINARY, -- KLL sketch
distinct_users BINARY, -- Theta sketch
top_queries BINARY, -- approx-top-K sketch
user_revenue BINARY -- Tuple sketch
) USING DELTA
PARTITIONED BY (DATE(event_hour));
Each column is a serialised sketch — kilobyte-scale per row.
Write path: ETL builds sketches once¶
INSERT INTO hourly_rollup
SELECT
date_trunc('hour', event_time) AS event_hour,
region AS dimension,
COUNT(*) AS event_count,
kll_sketch_agg_double(latency_ms) AS latency_sketch,
theta_sketch_agg(user_id) AS distinct_users,
approx_top_k_accumulate(search_term) AS top_queries,
tuple_sketch_agg_integer(user_id, revenue_cents) AS user_revenue
FROM events
WHERE event_time BETWEEN :from AND :to
GROUP BY date_trunc('hour', event_time), region;
The heavy compute — the per-partition sort, shuffle, dedup, group — happens once per hour, in the ETL pipeline, not on every dashboard refresh.
Read path: dashboard merges sketches¶
-- P99 latency for US-East this week
SELECT kll_get_quantile_bigint(
kll_merge(latency_sketch), -- merge 168 hourly sketches
0.99
)
FROM hourly_rollup
WHERE dimension = 'us-east'
AND event_hour BETWEEN :week_start AND :week_end;
-- Audience overlap: customers who saw ad A but not ad B
SELECT theta_sketch_estimate(
theta_difference(
theta_union(distinct_users) FILTER (WHERE campaign = 'A'),
theta_union(distinct_users) FILTER (WHERE campaign = 'B')
)
)
FROM campaign_rollup
WHERE event_hour >= :campaign_start;
Because the sketches are mergeable, an arbitrary time window is a reduce over the relevant sketches in the table — not a scan of raw events.
Consequences¶
Good:
- Dashboards become merges, not scans. Weekly-trending dashboard = merge of 168 hourly sketches, executed in milliseconds. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
- ETL cost is amortised. The heavy global-sort / global- dedup work runs once per period, not once per query.
- Streaming-friendly. For Structured Streaming, merge each micro-batch's sketch into a running total — a live leaderboard replaces a batch job.
- Raw data stays on disk for audit. "The raw data is still there when the auditors ask. For everything else, a 1% error margin and a 1000x speedup is a welcome trade-off." (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
- Cross-engine portability. A sketch written in Spark ETL can be read by any DataSketches-compatible consumer — C++, Java, Druid, Pinot.
Tradeoffs:
- Relative error is real. The 1–2% envelope must be acceptable to the consumer; regulated or billing-adjacent dashboards need exact queries instead.
- Sketch bytes take storage. Per-row kilobyte overhead adds up over billions of rollup rows; this is typically small relative to the raw-event footprint it replaces, but not zero.
- ETL schema gets wider. Each sketch family is a distinct BLOB column; tables accumulate sketch columns as the dashboard's query classes grow.
- Read queries need sketch-aware SQL. The BI tool must
speak the
kll_merge/theta_union/approx_top_k_combine/ tuple-merge verbs — not stock SQL.
Relationship to adjacent patterns¶
- patterns/set-algebra-on-theta-sketches — the specific read-side operation for audience-overlap / incrementality queries. Uses Theta sketches stored under this pattern.
- patterns/local-global-aggregation-split — the general decomposition that this pattern specialises to mergeable probabilistic structures.
- patterns/dual-granularity-rollup-tables — pre-existing rollup pattern; this pattern extends it from count / sum aggregates to percentile / distinct / top-K / distinct-plus- metric.
- concepts/sketch-as-mysql-binary-column — same pattern applied to MySQL BLOB columns with loadable C++ functions (PlanetScale Insights). The Delta Lake version benefits from Spark's distributed merge; the MySQL version from OLTP read path.
Seen in¶
- sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks canonicalises the pattern as the intended workflow for its new sketch function families. The post's explicit framing: "Build them once during your daily ETL. Store them as columns in Delta tables. When a dashboard needs P50/P90/P99 for any time range, merge the precomputed sketches in milliseconds instead of rescanning raw data."