Skip to content

PATTERN Cited by 1 source

Parallel S3 download for bootstrap

Intent

Saturate available network bandwidth during a cold-start bootstrap by downloading many files from S3 concurrently rather than serially, landing them on local SSD for subsequent use. For index-shaped workloads (Lucene, Parquet, sstables, etc.) where state is a directory of many independent files, concurrent fetch is the right shape — serial download leaves most of the network idle per object.

Shape

  on bootstrap (empty local state):
    remote_files = s3.list_objects(index_prefix)
    pool = BoundedThreadPool(N)              // N tuned to bandwidth / CPU
    for each file in remote_files:
      pool.submit(lambda: s3.get_object(file) → local_ssd)
    pool.join()
    // index is now usable; start serving

Tune N by the ratio of per-file latency to aggregate bandwidth — typically dozens of concurrent GETs per host to saturate a 10-Gb NIC on modern instances.

Canonical instance: Yelp Nrtsearch 1.0.0

Pre-1.0 Nrtsearch replicas downloaded the index archive from S3 and then synced updates from the primary. Per the 1.0.0 release post:

"We started downloading multiple files from S3 in parallel to make full use of the available network bandwidth. Combined with a local SSD, this yielded a 5x increase in the download speed. With both blockers resolved, we were able to stop using EBS volumes in favor of local disks."

Two architectural moves composed:

  1. Parallelism at the S3 GET layer. Instead of one large archive fetched serially, many small segment files fetched concurrently saturates the NIC.
  2. Local SSD on the write side. EBS's IOPS budget would throttle the concurrent-write side of the pipeline; local SSD doesn't.

Together they deliver the 5× bootstrap speedup. The speedup is the precondition for Yelp's architectural move off EBS — previously, the slow bootstrap was one of two blockers keeping the primary on EBS (the other being incremental-backup- on-commit for commit durability).

Tradeoffs

  • S3 request cost. Per-request pricing means many-small-GETs has a higher marginal cost than one-large-GET. For bootstrap workloads this is usually negligible (infrequent, one-shot) but worth being aware of for high-churn scenarios.
  • Memory pressure. Running too many concurrent GETs holds too many buffers in memory. A bounded thread pool + streaming-to-disk is the discipline.
  • Local SSD capacity. Local SSD on AWS instances has a fixed per-instance size — index + any working data must fit. If not, EBS or lazy-prefetch from S3 (content-addressed caching) is needed instead.

Seen in

Last updated · 550 distilled / 1,221 read