CONCEPT Cited by 1 source
Pinned (page-locked) memory¶
Definition¶
Pinned memory (or page-locked memory) is host-side RAM
that the operating system has been told must not be paged out to
disk and must not be relocated to another physical address. The
allocation is done via CUDA host-side APIs (cudaHostAlloc,
torch.empty(..., pin_memory=True)) which call into the OS to
page-lock the underlying physical pages.
The key property: a pinned host buffer has a stable host-physical address for its lifetime, which means the GPU's copy engine can DMA bytes directly to/from it without the driver staging through an internal pinned bounce buffer first.
Why pageable host memory breaks fully-async D2H¶
When the destination of a D2H copy is pageable (the default
for malloc / regular Python tensors / torch.empty without
pin_memory=True):
- The CUDA runtime cannot DMA directly to a pageable address — the OS might page the destination out mid-transfer.
- The driver therefore copies bytes first to an internal pinned bounce buffer owned by the runtime.
- Then it copies from the bounce buffer to the user's pageable destination on the host CPU.
- This intermediate copy is synchronous on the default CUDA stream, which re-serialises the operation with compute even if the user issued the D2H on a dedicated Copy Stream.
The result: dual CUDA streams provide no actual overlap when the host buffer is pageable. The GPU stalls anyway.
When the destination is pinned, the bounce-buffer step is skipped and the GPU's copy engine DMAs directly to the user-visible host memory. The D2H copy can now run truly asynchronously with respect to compute kernels on the Compute Stream — iff it is also issued on a separate Copy Stream.
Compositional requirement¶
Pinned memory is a necessary but not sufficient condition for fully-async D2H. The complete recipe is:
- Pinned host buffer as the D2H destination (this concept).
- Dedicated Copy Stream so D2H doesn't serialise with compute on the default stream (see concepts/cuda-stream).
- Double buffering so adjacent chunks don't alias the same buffer (see patterns/double-buffer-cuda-events-pipeline-overlap).
- CUDA events as cross-stream barriers so the worker thread doesn't read a half-written buffer.
Skip the pinned-memory step and the dual-stream + double-buffer machinery still runs but the GPU stalls anyway because each D2H serialises through the bounce buffer.
Trade-offs¶
- Pinned memory is a scarce kernel resource. The OS must reserve physical RAM that cannot be paged. Allocating too much pinned memory degrades whole-system performance because it reduces the working-set the OS can manage. Pinned-memory allocations should be preallocated, reused, and bounded — not done per-request.
- Allocation is expensive. Page-locking pages requires
per-page kernel work; a fresh
cudaHostAllocis much slower than a pageablemalloc. Reuse, don't reallocate. - Application pattern: allocate a small fixed pool of pinned buffers at startup (one per stream, one per pipeline stage, etc.), reuse them across requests. Double buffering with two pinned host buffers — one in-flight, one draining — is the minimal viable pool size for the asynchronous frame-generation pipeline.
Wiki-attested usage¶
In the AWS / Synthesia Asynchronous Frame Generation Pipeline:
- Two VRAM buffers on the GPU (double buffer for compute side).
- Two pinned host RAM buffers for D2H destinations (double buffer for transfer side).
- The post is explicit: "page-lock the required Host Memory buffers to make sure D2H copies are performed fully asynchronously."
Adjacent concepts¶
- Pinned vs unified memory. Unified memory (
cudaMallocManaged) is a different abstraction — a virtual address valid on both host and device, with the runtime migrating pages on access. It has different performance characteristics from pinned memory and is generally slower for streaming D2H workloads. The wiki- attested pipeline uses pinned memory, not unified memory. - Pinned vs mapped memory. Mapped pinned memory
(
cudaHostAllocwithcudaHostAllocMapped) gives the GPU a pointer it can dereference directly into host memory across PCIe — useful for scatter-write patterns but typically slower than explicit DMA for streaming-read.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. Two pinned host buffers paired with two VRAM buffers in the Asynchronous Frame Generation Pipeline; explicit reason given is "to make sure D2H copies are performed fully asynchronously."
Related¶
- concepts/cuda-stream — must be paired with pinned memory to deliver async D2H.
- concepts/device-to-host-transfer — the operation pinned memory is required for.
- concepts/gpu-kernel-utilization — saturation metric whose improvement depends on pinned memory's role in the pipeline.
- patterns/asynchronous-frame-generation-pipeline — full pattern this primitive participates in.
- patterns/double-buffer-cuda-events-pipeline-overlap — buffer-side composition that uses pinned host buffers.