Skip to content

CONCEPT Cited by 1 source

Partition projection (Glue / Athena)

Partition projection is the Athena / Glue partitioning model where the set of partition values is declared in the table properties rather than registered as managed partitions in the metastore. The query engine projects partitions at query time by combining the declared type + template with the query's WHERE predicate.

Why it exists

With managed partitions, each new partition must be registered with Glue (MSCK REPAIR, ALTER TABLE ADD PARTITION, Glue Crawlers). At scale this has two costs:

  1. Operational churn — refreshes are cumbersome as the partition count grows.
  2. Query-planning latency — metastore lookups dominate planning time at high partition counts; must be addressed with partition indexing.

Partition projection sidesteps both by describing the partition layout as a declarative template the engine can enumerate on demand.

Canonical shape

TBLPROPERTIES (
  'projection.enabled'='true',
  'projection.bucket_name.type'='enum',
  'projection.bucket_name.values'='bucket-a,bucket-b,bucket-c,...',
  'projection.timestamp.type'='date',
  'projection.timestamp.range'='2020/01/01,NOW',
  'projection.timestamp.format'='yyyy/MM/dd',
  'storage.location.template'='s3://<destination-bucket>/<account>/<region>/${bucket_name}/${timestamp}'
)

The storage.location.template tells Athena how to construct S3 prefixes from the partition values. The partition values come from the per-column projection type.

Projection types (Yelp's 2025-09-26 disclosure)

enum

Explicit finite set of values. "All possible bucket names are written to it." Good for low- to mid-cardinality partition keys that the caller might or might not constrain. Critical limit:

"If enum type partition is not constrained in the where-clause, the query may run into a cap of 1 million partitions over long time windows."

(AWS docs on the 1M-partition cap.)

Yelp keeps the enum value list up-to-date with a Lambda that reads from an SQS queue populated by EventBridge rules (queue to avoid concurrent reads and writes on the projection list).

injected

Caller must supply the value in the WHERE clause. "The alternative of injected type alleviates the need to enumerate all possible values but requires specifying a value in where-clause."

Good for cases where enumeration is impractical (very high cardinality; wouldn't-know-where-to-enumerate). Yelp uses injected for access-based tables where the caller always specifies bucket + date.

date / integer / other built-in types

Used for timestamp / integer partition keys with a declarative range. Common shape for time-series logs. Yelp's SAL compaction table uses timestamp with yyyy/MM/dd format — "encompasses an entire day, that is yyyy/MM/dd, which accelerates query pruning time because we typically query a day's worth of logs" — finer granularity would cause over-partitioning.

Over-partitioning consideration

The AWS guidance referenced by Yelp: "Top 10 performance tuning tips for Amazon Athena" warns against over-partitioning. Reducing partition granularity (e.g. day instead of minute) accelerates query pruning when the common query shape matches the coarser grain.

Comparison to managed partitions

Axis Partition projection Managed partitions
Registration declarative in table properties MSCK REPAIR / ALTER TABLE / Glue Crawler
New-partition visibility immediate on matching prefix manual refresh required
Query-planning cost template enumeration, fast metastore lookup scales with partition count
Scale failure mode enum 1M-partition cap metastore latency + index refresh
Good fit known prefix template, daily/weekly batches ad-hoc or exotic partition layouts

See patterns/projection-partitioning-over-managed-partitions for the canonical design choice.

Seen in

  • sources/2025-09-26-yelp-s3-server-access-logs-at-scale — Yelp's SAL compaction table uses enum for bucket_name (kept in sync by a Lambda) and date-granular timestamp; access- based tables use injected where the caller specifies bucket
  • date. Canonical wiki disclosure of the enum-vs-injected trade-off and the 1M-partition cap.
Last updated · 476 distilled / 1,218 read