CONCEPT Cited by 1 source
Athena shared-resource contention¶
Amazon Athena (systems/amazon-athena) is a shared
serverless query engine: queries run on a pooled Presto/Trino
cluster whose capacity is allocated across all AWS customers in
the region. At scale this shows up as TooManyRequestsException
errors — queries killed mid-flight or rejected at submission due
to cluster overload, occasional S3 API rate-limit hits, or the
account's active-DML-query quota being exceeded.
Canonical failure mode¶
"Athena is a shared resource so a query may be killed any time due to the cluster being overloaded, or occasionally hitting S3 API limits. Given our scale, we would encounter such errors on a regular basis. Therefore, we needed a way to retry queries that were limited by the quota on the number of active DML queries at a time." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Two distinct contention axes¶
- Cluster overload / noisy-neighbor — regional Athena cluster is under load; individual queries may be queued or killed. Not a per-account quota; no AWS-side knob.
- DML query concurrency quota — a per-account, per-region service limit on the number of concurrently-active Data Manipulation Language queries. Supports an AWS Support quota-increase request.
Both axes surface as TooManyRequestsException. Yelp's
mitigations:
- Reduce concurrent queries on the client side — the compaction job models insertions as "asynchronous functions" per-bucket, limiting in-flight parallelism.
- Request DML-quota increases for each affected account + region.
- Retry on exception — the job is designed to be retry-
safe (see patterns/idempotent-athena-insertion-via-left-join)
precisely because
TooManyRequestsExceptionretries are a normal operational event, not an exception.
Design implications¶
- Every long-running Athena job should be idempotent. If the job can be killed partway through at any time by a shared- resource eviction, it must tolerate re-running without duplication.
- Prefer many small queries over one large query when the unit-of-retry matters. A single giant INSERT is all-or-nothing on cluster-kill; per-bucket per-day INSERTs bound the retry cost.
- Exponential backoff with jitter is the standard retry strategy; avoid thundering-herd after a cluster-wide disturbance.
- Treat the quota increase as a quarterly operational task — Yelp names it as part of their initial-run playbook; production scale changes often outrun initial quota.
Not-Athena-specific pattern¶
The same pattern applies to any shared-serverless query service: BigQuery on-demand slots, Redshift Spectrum, Snowflake serverless tasks. The discipline is the same: idempotence + retry + quota management + many-small-queries > one-big-query.
Seen in¶
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
Yelp's SAL compaction job runs "parallel Athena queries"
and encounters
TooManyRequestsException"on a regular basis". Mitigations disclosed: client-side concurrency reduction + DML-quota increase requests + idempotent query shape for retry safety.
Related¶
- systems/amazon-athena — the service.
- concepts/backpressure — the broader pattern (server tells client to slow down).
- concepts/retry-on-exception
- concepts/noisy-neighbor