Skip to content

CONCEPT Cited by 1 source

Partition access pattern

Definition

A partition access pattern is the signature shape that emerges when you plot the partition values accessed against access-event time for a single analytics table — typically with the accessing entity (IAM role, user, service) used as the colour axis. Different consumer types produce visually distinct signatures with no manual labelling.

The three canonical signatures (Yelp 2026-05-21)

"Given an analytics table partitioned by date (dt=yyyy-mm-dd), we can plot partitions accessed against event time."

1. Diagonal y=x line — daily batch consumer

"Daily batch consumers present a diagonal line showing y=x. Today's job reads today's partition, yesterday's job reads yesterday's partition, and so on."

The y=x signature is the cleanest tell of a scheduled batch — the partition the job reads is always the same offset from "now" (usually 0 days back, sometimes 1 day for late-landing data). On the chart, this appears as a 45° line.

2. Vertical line — backfill

"Backfill events that scan many partitions present a vertical line."

A backfill — one job at one moment scanning the historical depth of a table — pins event time to a single x-coordinate while sweeping through many partition values on the y-axis. Visually: a vertical streak.

3. Scatter without pattern — ad hoc inspection

"Ad hoc queries appear as scatter points with no clear pattern. We commonly see this type of access from roles attributed to internal engineering teams — often inspecting data to confirm it meets expectations."

Random scatter signals a human-in-the-loop reading partitions in no predictable order — the third canonical Yelp shape, typically tied to IAM roles attributed to internal engineering teams.

Why the visualisation works

Three properties make signature recognition cheap and reliable:

  1. Time partition keys are already temporal coordinates. The y- axis is not just any partition value — it is a partition value that corresponds to a date, so the y-axis and the x-axis (event time) share the same units. A diagonal y=x line has a literal meaning: the partition read is the partition for "today".
  2. Three primitive consumer types map to three visually distinct shapes. Diagonal (batch) ≠ vertical (backfill) ≠ scatter (ad hoc) — no two consumer types produce the same chart shape, so the inverse mapping (chart shape → consumer type) is unambiguous.
  3. The IAM-role colour coding distinguishes which batch / which backfill / which engineer is reading. See patterns/iam-role-attribution-from-s3-access-logs.

Why this concept matters

It is the observability primitive that unblocks data-platform efficiency wins:

  • Stakeholder discovery — answers "who consumes this table" off a chart. The data owner no longer relies on stakeholder conversations / documentation / word of mouth.
  • Storage-class routing — once you know access is daily-batch- with-diagonal-signature, you can decide the access window for Default Access Retention (data beyond N days back has no consumer signature; safe to gate behind IAM).
  • Migration prioritisation — most-accessed tables and partitions identified by the same data structure can be migrated first to Apache Iceberg, delivering read-performance wins to the highest-value consumers. See patterns/usage-driven-migration-prioritization.

Generalisation beyond date-partitioned tables

The technique requires the partition key to be a temporal coordinate that admits ordering against access-event time. Hash- partitioned (e.g. customer_id_hash), category-partitioned (e.g. country), and bucketed partition schemes do not produce the same diagonal/vertical/scatter signatures because the y-axis is no longer comparable to the x-axis.

For non-temporal partition keys, alternative visualisations apply (per-partition access counts over a rolling window — Yelp's second chart, "Partitions Accessed") but the signature-recognition property is lost.

The IAM-role coordinate

Yelp explicitly includes the entity dimension:

"We include the entity accessing the data in the visualization. Since Yelp runs on AWS, our entities are IAM roles, with role names linked to services or teams."

The colour-by-IAM-role choice is what turns "a daily batch reads this table" into "the search-indexer service's daily batch reads this table." See patterns/iam-role-attribution-from-s3-access-logs for the structural argument that IAM roles are the right entity coordinate for AWS-native data lakes.

Substrate

Computed via a SQL aggregation over compacted S3 server access logs:

INSERT INTO table_usage_aggregated
SELECT
    COUNT(1)                    AS ct,
    requester                   AS iam_role,
    "timestamp"                 AS event_time,
    KEY_TO_TABLE_NAME(key)      AS table_name,
    KEY_TO_PARTITION_VALUE(key) AS partition_value
FROM
    s3_server_access_logs_compacted
WHERE
    operation = 'REST.GET.OBJECT'
    AND key LIKE 'prefix_to_include%'
GROUP BY
    2, 3, 4, 5

Seen in

Last updated · 542 distilled / 1,571 read