Skip to content

CONCEPT Cited by 1 source

Pinned (page-locked) memory

Definition

Pinned memory (or page-locked memory) is host-side RAM that the operating system has been told must not be paged out to disk and must not be relocated to another physical address. The allocation is done via CUDA host-side APIs (cudaHostAlloc, torch.empty(..., pin_memory=True)) which call into the OS to page-lock the underlying physical pages.

The key property: a pinned host buffer has a stable host-physical address for its lifetime, which means the GPU's copy engine can DMA bytes directly to/from it without the driver staging through an internal pinned bounce buffer first.

Why pageable host memory breaks fully-async D2H

When the destination of a D2H copy is pageable (the default for malloc / regular Python tensors / torch.empty without pin_memory=True):

  1. The CUDA runtime cannot DMA directly to a pageable address — the OS might page the destination out mid-transfer.
  2. The driver therefore copies bytes first to an internal pinned bounce buffer owned by the runtime.
  3. Then it copies from the bounce buffer to the user's pageable destination on the host CPU.
  4. This intermediate copy is synchronous on the default CUDA stream, which re-serialises the operation with compute even if the user issued the D2H on a dedicated Copy Stream.

The result: dual CUDA streams provide no actual overlap when the host buffer is pageable. The GPU stalls anyway.

When the destination is pinned, the bounce-buffer step is skipped and the GPU's copy engine DMAs directly to the user-visible host memory. The D2H copy can now run truly asynchronously with respect to compute kernels on the Compute Stream — iff it is also issued on a separate Copy Stream.

Compositional requirement

Pinned memory is a necessary but not sufficient condition for fully-async D2H. The complete recipe is:

  1. Pinned host buffer as the D2H destination (this concept).
  2. Dedicated Copy Stream so D2H doesn't serialise with compute on the default stream (see concepts/cuda-stream).
  3. Double buffering so adjacent chunks don't alias the same buffer (see patterns/double-buffer-cuda-events-pipeline-overlap).
  4. CUDA events as cross-stream barriers so the worker thread doesn't read a half-written buffer.

Skip the pinned-memory step and the dual-stream + double-buffer machinery still runs but the GPU stalls anyway because each D2H serialises through the bounce buffer.

Trade-offs

  • Pinned memory is a scarce kernel resource. The OS must reserve physical RAM that cannot be paged. Allocating too much pinned memory degrades whole-system performance because it reduces the working-set the OS can manage. Pinned-memory allocations should be preallocated, reused, and bounded — not done per-request.
  • Allocation is expensive. Page-locking pages requires per-page kernel work; a fresh cudaHostAlloc is much slower than a pageable malloc. Reuse, don't reallocate.
  • Application pattern: allocate a small fixed pool of pinned buffers at startup (one per stream, one per pipeline stage, etc.), reuse them across requests. Double buffering with two pinned host buffers — one in-flight, one draining — is the minimal viable pool size for the asynchronous frame-generation pipeline.

Wiki-attested usage

In the AWS / Synthesia Asynchronous Frame Generation Pipeline:

  • Two VRAM buffers on the GPU (double buffer for compute side).
  • Two pinned host RAM buffers for D2H destinations (double buffer for transfer side).
  • The post is explicit: "page-lock the required Host Memory buffers to make sure D2H copies are performed fully asynchronously."

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)

Adjacent concepts

  • Pinned vs unified memory. Unified memory (cudaMallocManaged) is a different abstraction — a virtual address valid on both host and device, with the runtime migrating pages on access. It has different performance characteristics from pinned memory and is generally slower for streaming D2H workloads. The wiki- attested pipeline uses pinned memory, not unified memory.
  • Pinned vs mapped memory. Mapped pinned memory (cudaHostAlloc with cudaHostAllocMapped) gives the GPU a pointer it can dereference directly into host memory across PCIe — useful for scatter-write patterns but typically slower than explicit DMA for streaming-read.

Seen in

Last updated · 542 distilled / 1,571 read