Skip to content

CONCEPT Cited by 1 source

Row-major vs column-major layout

Two conventions for laying out a 2D matrix in linear memory:

  • Row-major stores each row contiguously. For an M × N matrix A, element A[i][j] lives at offset i*N + j in a flat buffer. Walking along a row is a sequential memory scan; walking down a column strides by N elements.
  • Column-major stores each column contiguously. Element A[i][j] lives at offset j*M + i. Walking down a column is sequential; walking along a row strides by M.

Which convention is "right" depends on which access direction the hot loop uses. Getting it wrong turns every load into a cache miss and can slow numerical kernels by an order of magnitude — see concepts/cache-locality.

Language / ecosystem conventions

Ecosystem Convention
C, C++, Java, Rust, Python NumPy (default), PyTorch, TensorFlow Row-major
Fortran, MATLAB, Julia, R, classical BLAS / LAPACK / LINPACK Column-major
GPU (CUDA cuBLAS default) Column-major (Fortran heritage)

The BLAS ecosystem is column-major because it was designed in Fortran. This creates a structural mismatch when calling BLAS from row-major languages: either the caller transposes matrices (O(M·N) extra memory traffic) or uses the BLAS "transpose flag" parameters (which work but force the kernel to do column-vs-row stride arithmetic internally, losing some optimised code paths).

Why the mismatch bites in practice

  1. Transpose-before-call — allocates a temporary column-major copy per call. Fine for infrequent large calls; disastrous for hot loops on small tiles.
  2. Transpose flags — avoid the copy but force the BLAS kernel to use strided memory access internally, often triggering a slower code path than the contiguous one.
  3. Mixed pipelines — when the caller, the kernel, and the consumer all have different conventions, data "ping-pongs" through layout conversions that dominate total runtime.

How to sidestep the mismatch

  • Stay in one convention end-to-end. Pre-normalise inputs into the hot path's native layout and design the full pipeline around it (row-major for Java/C++, column-major for Fortran/cuBLAS).
  • Use a SIMD kernel in the caller's language. Pure-Java SIMD via the JDK Vector API operates natively on row-major Java arrays, avoiding the BLAS mismatch entirely.
  • Reshape the problem. A row × row dot-product can be expressed as A × Bᵀ without ever materialising a transposed B, by picking an access order the kernel is already optimised for.

Seen in

  • sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — Netflix Ranker's BLAS evaluation flagged row-vs-column-major translation as one of the three load-bearing reasons BLAS lost the kernel competition to pure-Java SIMD: "Java's row-major layout doesn't match the column-major expectations of many BLAS routines, which can introduce conversion and temporary buffers." The JDK Vector API path kept the flat row-major Java buffers end-to-end, no translation.
Last updated · 319 distilled / 1,201 read