Skip to content

CONCEPT Cited by 1 source

Athena shared-resource contention

Amazon Athena (systems/amazon-athena) is a shared serverless query engine: queries run on a pooled Presto/Trino cluster whose capacity is allocated across all AWS customers in the region. At scale this shows up as TooManyRequestsException errors — queries killed mid-flight or rejected at submission due to cluster overload, occasional S3 API rate-limit hits, or the account's active-DML-query quota being exceeded.

Canonical failure mode

"Athena is a shared resource so a query may be killed any time due to the cluster being overloaded, or occasionally hitting S3 API limits. Given our scale, we would encounter such errors on a regular basis. Therefore, we needed a way to retry queries that were limited by the quota on the number of active DML queries at a time." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Two distinct contention axes

  1. Cluster overload / noisy-neighbor — regional Athena cluster is under load; individual queries may be queued or killed. Not a per-account quota; no AWS-side knob.
  2. DML query concurrency quota — a per-account, per-region service limit on the number of concurrently-active Data Manipulation Language queries. Supports an AWS Support quota-increase request.

Both axes surface as TooManyRequestsException. Yelp's mitigations:

  • Reduce concurrent queries on the client side — the compaction job models insertions as "asynchronous functions" per-bucket, limiting in-flight parallelism.
  • Request DML-quota increases for each affected account + region.
  • Retry on exception — the job is designed to be retry- safe (see patterns/idempotent-athena-insertion-via-left-join) precisely because TooManyRequestsException retries are a normal operational event, not an exception.

Design implications

  • Every long-running Athena job should be idempotent. If the job can be killed partway through at any time by a shared- resource eviction, it must tolerate re-running without duplication.
  • Prefer many small queries over one large query when the unit-of-retry matters. A single giant INSERT is all-or-nothing on cluster-kill; per-bucket per-day INSERTs bound the retry cost.
  • Exponential backoff with jitter is the standard retry strategy; avoid thundering-herd after a cluster-wide disturbance.
  • Treat the quota increase as a quarterly operational task — Yelp names it as part of their initial-run playbook; production scale changes often outrun initial quota.

Not-Athena-specific pattern

The same pattern applies to any shared-serverless query service: BigQuery on-demand slots, Redshift Spectrum, Snowflake serverless tasks. The discipline is the same: idempotence + retry + quota management + many-small-queries > one-big-query.

Seen in

  • sources/2025-09-26-yelp-s3-server-access-logs-at-scale — Yelp's SAL compaction job runs "parallel Athena queries" and encounters TooManyRequestsException "on a regular basis". Mitigations disclosed: client-side concurrency reduction + DML-quota increase requests + idempotent query shape for retry safety.
Last updated · 476 distilled / 1,218 read