PATTERN Cited by 1 source
Parallel S3 download for bootstrap¶
Intent¶
Saturate available network bandwidth during a cold-start bootstrap by downloading many files from S3 concurrently rather than serially, landing them on local SSD for subsequent use. For index-shaped workloads (Lucene, Parquet, sstables, etc.) where state is a directory of many independent files, concurrent fetch is the right shape — serial download leaves most of the network idle per object.
Shape¶
on bootstrap (empty local state):
remote_files = s3.list_objects(index_prefix)
pool = BoundedThreadPool(N) // N tuned to bandwidth / CPU
for each file in remote_files:
pool.submit(lambda: s3.get_object(file) → local_ssd)
pool.join()
// index is now usable; start serving
Tune N by the ratio of per-file latency to aggregate bandwidth — typically
dozens of concurrent GETs per host to saturate a 10-Gb NIC on modern instances.
Canonical instance: Yelp Nrtsearch 1.0.0¶
Pre-1.0 Nrtsearch replicas downloaded the index archive from S3 and then synced updates from the primary. Per the 1.0.0 release post:
"We started downloading multiple files from S3 in parallel to make full use of the available network bandwidth. Combined with a local SSD, this yielded a 5x increase in the download speed. With both blockers resolved, we were able to stop using EBS volumes in favor of local disks."
Two architectural moves composed:
- Parallelism at the S3 GET layer. Instead of one large archive fetched serially, many small segment files fetched concurrently saturates the NIC.
- Local SSD on the write side. EBS's IOPS budget would throttle the concurrent-write side of the pipeline; local SSD doesn't.
Together they deliver the 5× bootstrap speedup. The speedup is the precondition for Yelp's architectural move off EBS — previously, the slow bootstrap was one of two blockers keeping the primary on EBS (the other being incremental-backup- on-commit for commit durability).
Tradeoffs¶
- S3 request cost. Per-request pricing means many-small-GETs has a higher marginal cost than one-large-GET. For bootstrap workloads this is usually negligible (infrequent, one-shot) but worth being aware of for high-churn scenarios.
- Memory pressure. Running too many concurrent GETs holds too many buffers in memory. A bounded thread pool + streaming-to-disk is the discipline.
- Local SSD capacity. Local SSD on AWS instances has a fixed per-instance size — index + any working data must fit. If not, EBS or lazy-prefetch from S3 (content-addressed caching) is needed instead.
Related patterns¶
- patterns/incremental-s3-backup-of-immutable-files — the producer side that creates the S3 prefix this pattern consumes.
- patterns/warm-pool-instances — orthogonal approach: keep pre-bootstrapped replacements ready instead of paying bootstrap cost on demand.
- patterns/checkpoint-backup-to-object-storage — broader pattern for using object storage as a checkpoint substrate.
Seen in¶
- sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — canonical wiki instance. Parallel multi-file S3 download + local SSD on write yielded a 5× bootstrap-speed improvement for Yelp Nrtsearch replicas, enabling the architectural move off EBS.