PATTERN Cited by 1 source
Idempotent Athena insertion via self-LEFT-JOIN¶
Problem¶
You run an INSERT INTO ... SELECT in
Athena as part of a retry-safe batch
job (for example,
raw-to-Parquet log
compaction). Athena is a shared resource
(concepts/athena-shared-resource-contention); any given
query can be killed mid-flight by cluster eviction, S3 API rate
limits, or concurrency-quota exhaustion. The compaction job
must tolerate retries. But a naive re-run of INSERT INTO
target SELECT * FROM source duplicates rows every time it
retries — each prior partial success compounds the duplication.
Pattern¶
Self-LEFT-JOIN the target on the source, filter
target.<unique_col> IS NULL, and insert only the rows the
target doesn't already have.
Canonical shape¶
INSERT INTO "catalog_target"."database"."table_region"
SELECT *
FROM "catalog_source"."database"."table_region" source
LEFT JOIN "catalog_target"."database"."table_region" target
ON target.requestid = source.requestid
AND bucket_name = 'foo-bucket'
AND timestamp = '2024/01/01'
WHERE bucket_name = 'foo-bucket'
AND timestamp = '2024/01/01'
AND target.requestid IS NULL
(Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale, verbatim.)
Four load-bearing details:
LEFT JOIN ... ON target.<unique> = source.<unique>— match each source row to any existing target row with the same unique key.WHERE ... target.<unique> IS NULL— keep only source rows that have no match in target.- Partition filters in BOTH
ONandWHERE. Yelp's verbatim note: "Duplicating filters forbucket_nameandtimestampis prudent because queries start taking longer to run after each insertion." TheWHEREfilter is what the planner uses to prune partitions; theONfilter constrains which target partitions the join probes. - Choose a stable unique column. Yelp's SAL tables have
requestid— an AWS-generated opaque token, unique per request. "For tables without a unique column, we've used checksum on row values."
Why this works¶
- First run: target partition is empty → every source row's LEFT JOIN finds no match → every row is inserted.
- Retry after partial success: target partition has some
rows → source rows that match existing
requestids are filtered out bytarget.requestid IS NULL→ only missing rows are inserted. - Retry after full success: every source row matches → zero rows inserted → no-op. Job is safe to retry arbitrarily.
Property: Yelp's verbatim conclusion "our system is retry-safe, and any failure can be fixed by re-running the job over a day's worth of data."
Checksum variant (no natural unique column)¶
When the source doesn't have a stable unique column, compute a checksum over the row and treat that as the unique key:
INSERT INTO target
SELECT source.*
FROM source
LEFT JOIN target
ON checksum(source.*) = checksum(target.*)
AND <partition filters>
WHERE <partition filters>
AND target.<any_col> IS NULL
Trade-off: checksum-join is slower than column-equality-join; materialising the checksum as a column at compaction time can help.
Performance care¶
The cost of the LEFT JOIN grows with target partition size. Two discipline points:
- Keep target partitions small. One partition per day (or finer for very high-volume tables) bounds the join cost.
- Compose with partition projection. Partition-projected tables let Athena prune the join's target side to the one partition the source is targeting, preserving O(single-partition) cost regardless of the target's total size.
Count verification (Yelp's optimisation)¶
Rather than a post-insert COUNT(*) on the target (which
reads all rows of the target partition), Yelp uses
GetQueryRuntimeStatistics on the insert query itself:
- First insertion:
Rowsstatistic from the API is the number of rows inserted — compare to a source-sideCOUNT(*)for completeness. - Re-insertion:
Rowsis still correct, butGetQueryRuntimeStatistics.Rows"may not be populated at all" — fall back to distinct-count queries.
Other idempotent-INSERT shapes to consider¶
- Upsert / MERGE — not supported in Athena's standard INSERT; Iceberg and Delta tables do support MERGE, which is cleaner if your target is an Iceberg table.
- Partition-level atomic replace — overwrite the whole partition on every run. Simpler but requires re-reading all source data every run; wasteful if the source is large.
- Outbox-table idempotency — record the intention before inserting, mark done after. Adds a table and a round-trip; unnecessary if you have a stable unique column.
Seen in¶
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
canonical wiki instance. Yelp's SAL-to-Parquet compaction job
uses the self-LEFT-JOIN shape on
requestidto survive cluster evictions andTooManyRequestsExceptionretries. Reported property: "our system is retry-safe, and any failure can be fixed by re-running the job over a day's worth of data."
Related¶
- systems/amazon-athena — the engine.
- concepts/idempotent-job-design — the general discipline.
- concepts/athena-shared-resource-contention — the reason retries are a common operational event.
- patterns/raw-to-columnar-log-compaction — canonical composer.
- concepts/partition-projection — join-pruning aid.