PATTERN Cited by 1 source

Idempotent Athena insertion via self-LEFT-JOIN¶

Problem¶

You run an INSERT INTO ... SELECT in Athena as part of a retry-safe batch job (for example, raw-to-Parquet log compaction). Athena is a shared resource (concepts/athena-shared-resource-contention); any given query can be killed mid-flight by cluster eviction, S3 API rate limits, or concurrency-quota exhaustion. The compaction job must tolerate retries. But a naive re-run of INSERT INTO target SELECT * FROM source duplicates rows every time it retries — each prior partial success compounds the duplication.

Pattern¶

Self-LEFT-JOIN the target on the source, filter target.<unique_col> IS NULL, and insert only the rows the target doesn't already have.

Canonical shape¶

INSERT INTO "catalog_target"."database"."table_region"
SELECT *
FROM "catalog_source"."database"."table_region" source
LEFT JOIN "catalog_target"."database"."table_region" target
  ON  target.requestid = source.requestid
  AND bucket_name = 'foo-bucket'
  AND timestamp = '2024/01/01'
WHERE bucket_name = 'foo-bucket'
  AND timestamp = '2024/01/01'
  AND target.requestid IS NULL

(Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale, verbatim.)

Four load-bearing details:

LEFT JOIN ... ON target.<unique> = source.<unique> — match each source row to any existing target row with the same unique key.
WHERE ... target.<unique> IS NULL — keep only source rows that have no match in target.
Partition filters in BOTH ON and WHERE. Yelp's verbatim note: "Duplicating filters for bucket_name and timestamp is prudent because queries start taking longer to run after each insertion." The WHERE filter is what the planner uses to prune partitions; the ON filter constrains which target partitions the join probes.
Choose a stable unique column. Yelp's SAL tables have requestid — an AWS-generated opaque token, unique per request. "For tables without a unique column, we've used checksum on row values."

Why this works¶

First run: target partition is empty → every source row's LEFT JOIN finds no match → every row is inserted.
Retry after partial success: target partition has some rows → source rows that match existing requestids are filtered out by target.requestid IS NULL → only missing rows are inserted.
Retry after full success: every source row matches → zero rows inserted → no-op. Job is safe to retry arbitrarily.

Property: Yelp's verbatim conclusion "our system is retry-safe, and any failure can be fixed by re-running the job over a day's worth of data."

Checksum variant (no natural unique column)¶

When the source doesn't have a stable unique column, compute a checksum over the row and treat that as the unique key:

INSERT INTO target
SELECT source.*
FROM source
LEFT JOIN target
  ON  checksum(source.*) = checksum(target.*)
  AND <partition filters>
WHERE <partition filters>
  AND target.<any_col> IS NULL

Trade-off: checksum-join is slower than column-equality-join; materialising the checksum as a column at compaction time can help.

Performance care¶

The cost of the LEFT JOIN grows with target partition size. Two discipline points:

Keep target partitions small. One partition per day (or finer for very high-volume tables) bounds the join cost.
Compose with partition projection. Partition-projected tables let Athena prune the join's target side to the one partition the source is targeting, preserving O(single-partition) cost regardless of the target's total size.

Count verification (Yelp's optimisation)¶

Rather than a post-insert COUNT(*) on the target (which reads all rows of the target partition), Yelp uses GetQueryRuntimeStatistics on the insert query itself:

First insertion: Rows statistic from the API is the number of rows inserted — compare to a source-side COUNT(*) for completeness.
Re-insertion: Rows is still correct, but GetQueryRuntimeStatistics.Rows "may not be populated at all" — fall back to distinct-count queries.

Other idempotent-INSERT shapes to consider¶

Upsert / MERGE — not supported in Athena's standard INSERT; Iceberg and Delta tables do support MERGE, which is cleaner if your target is an Iceberg table.
Partition-level atomic replace — overwrite the whole partition on every run. Simpler but requires re-reading all source data every run; wasteful if the source is large.
Outbox-table idempotency — record the intention before inserting, mark done after. Adds a table and a round-trip; unnecessary if you have a stable unique column.

Seen in¶

sources/2025-09-26-yelp-s3-server-access-logs-at-scale — canonical wiki instance. Yelp's SAL-to-Parquet compaction job uses the self-LEFT-JOIN shape on requestid to survive cluster evictions and TooManyRequestsException retries. Reported property: "our system is retry-safe, and any failure can be fixed by re-running the job over a day's worth of data."

systems/amazon-athena — the engine.
concepts/idempotent-job-design — the general discipline.
concepts/athena-shared-resource-contention — the reason retries are a common operational event.
patterns/raw-to-columnar-log-compaction — canonical composer.
concepts/partition-projection — join-pruning aid.