Skip to content

CONCEPT Cited by 1 source

Lazy container image loading

Lazy container image loading is the technique of starting a container before its filesystem has been fully fetched, and streaming the filesystem layers on demand as the container's processes actually read them. Contrast with the traditional eager-pull model, where the runtime downloads the entire image from the registry before it can start the container.

Why it matters

Container images in production have drifted into the multi-GB range, especially for deep-learning workloads that bundle CUDA, PyTorch, framework libraries, and model weights. Under eager-pull, cold-start latency scales linearly with image size — O(size). For on-demand serverless substrates (concepts/serverless-compute) where instances are provisioned per invocation, this is a user-visible tax on every start.

Lazy loading converts the startup cost from O(image_size) to O(working_set_at_startup) — typically a small fraction of the image, because most files in an image (docs, tests, unused libraries, model weights for untaken paths) aren't accessed in the first few seconds.

How it works

An index of the image's internal structure (filesystem offsets, file→layer mapping, chunk boundaries) is computed once at push time and stored alongside the image in the registry. SOCI (Seekable OCI) is the AWS-originated CNCF sandbox project that defines the index format; OverlayFS or FUSE-based mounts consume the index at runtime.

When the runtime starts the container, it:

  1. Fetches the SOCI index (small).
  2. Starts the container against a virtual filesystem backed by the index.
  3. Streams individual filesystem chunks from the registry as processes read() them.

Files that are never read are never fetched.

Where it's adopted

Where it doesn't help

Workloads whose startup touches most of the image (common for ML training jobs that load the full model and all dependencies into memory before stepping) see diminishing returns from lazy loading alone. Lyft's LyftLearn 2.0 observed this: SOCI "wasn't available" for SageMaker training / batch jobs at migration time, and Lyft fell back to image-size reduction plus SageMaker warm pools for the most latency-sensitive training jobs.

Seen in

Last updated · 517 distilled / 1,221 read