CONCEPT Cited by 1 source
Partition projection (Glue / Athena)¶
Partition projection is the Athena /
Glue partitioning model where the set of
partition values is declared in the table properties rather
than registered as managed partitions in the metastore. The
query engine projects partitions at query time by combining the
declared type + template with the query's WHERE predicate.
Why it exists¶
With managed partitions, each new partition must be registered
with Glue (MSCK REPAIR, ALTER TABLE ADD PARTITION, Glue
Crawlers). At scale this has two costs:
- Operational churn — refreshes are cumbersome as the partition count grows.
- Query-planning latency — metastore lookups dominate planning time at high partition counts; must be addressed with partition indexing.
Partition projection sidesteps both by describing the partition layout as a declarative template the engine can enumerate on demand.
Canonical shape¶
TBLPROPERTIES (
'projection.enabled'='true',
'projection.bucket_name.type'='enum',
'projection.bucket_name.values'='bucket-a,bucket-b,bucket-c,...',
'projection.timestamp.type'='date',
'projection.timestamp.range'='2020/01/01,NOW',
'projection.timestamp.format'='yyyy/MM/dd',
'storage.location.template'='s3://<destination-bucket>/<account>/<region>/${bucket_name}/${timestamp}'
)
The storage.location.template tells Athena how to construct
S3 prefixes from the partition values. The partition values come
from the per-column projection type.
Projection types (Yelp's 2025-09-26 disclosure)¶
enum¶
Explicit finite set of values. "All possible bucket names are written to it." Good for low- to mid-cardinality partition keys that the caller might or might not constrain. Critical limit:
"If
enumtype partition is not constrained in the where-clause, the query may run into a cap of 1 million partitions over long time windows."
(AWS docs on the 1M-partition cap.)
Yelp keeps the enum value list up-to-date with a Lambda that
reads from an SQS queue populated by EventBridge rules (queue
to avoid concurrent reads and writes on the projection list).
injected¶
Caller must supply the value in the WHERE clause. "The
alternative of injected type alleviates the need to enumerate
all possible values but requires specifying a value in
where-clause."
Good for cases where enumeration is impractical (very high
cardinality; wouldn't-know-where-to-enumerate). Yelp uses
injected for access-based tables where the caller always
specifies bucket + date.
date / integer / other built-in types¶
Used for timestamp / integer partition keys with a declarative
range. Common shape for time-series logs. Yelp's SAL compaction
table uses timestamp with yyyy/MM/dd format — "encompasses
an entire day, that is yyyy/MM/dd, which accelerates query
pruning time because we typically query a day's worth of logs"
— finer granularity would cause
over-partitioning.
Over-partitioning consideration¶
The AWS guidance referenced by Yelp: "Top 10 performance tuning tips for Amazon Athena" warns against over-partitioning. Reducing partition granularity (e.g. day instead of minute) accelerates query pruning when the common query shape matches the coarser grain.
Comparison to managed partitions¶
| Axis | Partition projection | Managed partitions |
|---|---|---|
| Registration | declarative in table properties | MSCK REPAIR / ALTER TABLE / Glue Crawler |
| New-partition visibility | immediate on matching prefix | manual refresh required |
| Query-planning cost | template enumeration, fast | metastore lookup scales with partition count |
| Scale failure mode | enum 1M-partition cap |
metastore latency + index refresh |
| Good fit | known prefix template, daily/weekly batches | ad-hoc or exotic partition layouts |
See patterns/projection-partitioning-over-managed-partitions for the canonical design choice.
Seen in¶
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
Yelp's SAL compaction table uses
enumforbucket_name(kept in sync by a Lambda) and date-granulartimestamp; access- based tables useinjectedwhere the caller specifies bucket - date. Canonical wiki disclosure of the enum-vs-injected trade-off and the 1M-partition cap.
Related¶
- systems/amazon-athena / systems/aws-glue — implementations.
- concepts/partition-pruning — the optimisation the partitioning enables.
- concepts/over-partitioning — the failure mode at overly fine granularity.
- patterns/projection-partitioning-over-managed-partitions