Skip to content

PATTERN Cited by 1 source

Idempotent Athena insertion via self-LEFT-JOIN

Problem

You run an INSERT INTO ... SELECT in Athena as part of a retry-safe batch job (for example, raw-to-Parquet log compaction). Athena is a shared resource (concepts/athena-shared-resource-contention); any given query can be killed mid-flight by cluster eviction, S3 API rate limits, or concurrency-quota exhaustion. The compaction job must tolerate retries. But a naive re-run of INSERT INTO target SELECT * FROM source duplicates rows every time it retries — each prior partial success compounds the duplication.

Pattern

Self-LEFT-JOIN the target on the source, filter target.<unique_col> IS NULL, and insert only the rows the target doesn't already have.

Canonical shape

INSERT INTO "catalog_target"."database"."table_region"
SELECT *
FROM "catalog_source"."database"."table_region" source
LEFT JOIN "catalog_target"."database"."table_region" target
  ON  target.requestid = source.requestid
  AND bucket_name = 'foo-bucket'
  AND timestamp = '2024/01/01'
WHERE bucket_name = 'foo-bucket'
  AND timestamp = '2024/01/01'
  AND target.requestid IS NULL

(Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale, verbatim.)

Four load-bearing details:

  1. LEFT JOIN ... ON target.<unique> = source.<unique> — match each source row to any existing target row with the same unique key.
  2. WHERE ... target.<unique> IS NULL — keep only source rows that have no match in target.
  3. Partition filters in BOTH ON and WHERE. Yelp's verbatim note: "Duplicating filters for bucket_name and timestamp is prudent because queries start taking longer to run after each insertion." The WHERE filter is what the planner uses to prune partitions; the ON filter constrains which target partitions the join probes.
  4. Choose a stable unique column. Yelp's SAL tables have requestid — an AWS-generated opaque token, unique per request. "For tables without a unique column, we've used checksum on row values."

Why this works

  • First run: target partition is empty → every source row's LEFT JOIN finds no match → every row is inserted.
  • Retry after partial success: target partition has some rows → source rows that match existing requestids are filtered out by target.requestid IS NULL → only missing rows are inserted.
  • Retry after full success: every source row matches → zero rows inserted → no-op. Job is safe to retry arbitrarily.

Property: Yelp's verbatim conclusion "our system is retry-safe, and any failure can be fixed by re-running the job over a day's worth of data."

Checksum variant (no natural unique column)

When the source doesn't have a stable unique column, compute a checksum over the row and treat that as the unique key:

INSERT INTO target
SELECT source.*
FROM source
LEFT JOIN target
  ON  checksum(source.*) = checksum(target.*)
  AND <partition filters>
WHERE <partition filters>
  AND target.<any_col> IS NULL

Trade-off: checksum-join is slower than column-equality-join; materialising the checksum as a column at compaction time can help.

Performance care

The cost of the LEFT JOIN grows with target partition size. Two discipline points:

  • Keep target partitions small. One partition per day (or finer for very high-volume tables) bounds the join cost.
  • Compose with partition projection. Partition-projected tables let Athena prune the join's target side to the one partition the source is targeting, preserving O(single-partition) cost regardless of the target's total size.

Count verification (Yelp's optimisation)

Rather than a post-insert COUNT(*) on the target (which reads all rows of the target partition), Yelp uses GetQueryRuntimeStatistics on the insert query itself:

  • First insertion: Rows statistic from the API is the number of rows inserted — compare to a source-side COUNT(*) for completeness.
  • Re-insertion: Rows is still correct, but GetQueryRuntimeStatistics.Rows "may not be populated at all" — fall back to distinct-count queries.

Other idempotent-INSERT shapes to consider

  • Upsert / MERGE — not supported in Athena's standard INSERT; Iceberg and Delta tables do support MERGE, which is cleaner if your target is an Iceberg table.
  • Partition-level atomic replace — overwrite the whole partition on every run. Simpler but requires re-reading all source data every run; wasteful if the source is large.
  • Outbox-table idempotency — record the intention before inserting, mark done after. Adds a table and a round-trip; unnecessary if you have a stable unique column.

Seen in

  • sources/2025-09-26-yelp-s3-server-access-logs-at-scale — canonical wiki instance. Yelp's SAL-to-Parquet compaction job uses the self-LEFT-JOIN shape on requestid to survive cluster evictions and TooManyRequestsException retries. Reported property: "our system is retry-safe, and any failure can be fixed by re-running the job over a day's worth of data."
Last updated · 476 distilled / 1,218 read