CONCEPT Cited by 1 source

Row-major vs column-major layout¶

Two conventions for laying out a 2D matrix in linear memory:

Row-major stores each row contiguously. For an M × N matrix A, element A[i][j] lives at offset i*N + j in a flat buffer. Walking along a row is a sequential memory scan; walking down a column strides by N elements.
Column-major stores each column contiguously. Element A[i][j] lives at offset j*M + i. Walking down a column is sequential; walking along a row strides by M.

Which convention is "right" depends on which access direction the hot loop uses. Getting it wrong turns every load into a cache miss and can slow numerical kernels by an order of magnitude — see concepts/cache-locality.

Language / ecosystem conventions¶

Ecosystem	Convention
C, C++, Java, Rust, Python NumPy (default), PyTorch, TensorFlow	Row-major
Fortran, MATLAB, Julia, R, classical BLAS / LAPACK / LINPACK	Column-major
GPU (CUDA cuBLAS default)	Column-major (Fortran heritage)

The BLAS ecosystem is column-major because it was designed in Fortran. This creates a structural mismatch when calling BLAS from row-major languages: either the caller transposes matrices (O(M·N) extra memory traffic) or uses the BLAS "transpose flag" parameters (which work but force the kernel to do column-vs-row stride arithmetic internally, losing some optimised code paths).

Why the mismatch bites in practice¶

Transpose-before-call — allocates a temporary column-major copy per call. Fine for infrequent large calls; disastrous for hot loops on small tiles.
Transpose flags — avoid the copy but force the BLAS kernel to use strided memory access internally, often triggering a slower code path than the contiguous one.
Mixed pipelines — when the caller, the kernel, and the consumer all have different conventions, data "ping-pongs" through layout conversions that dominate total runtime.

How to sidestep the mismatch¶

Stay in one convention end-to-end. Pre-normalise inputs into the hot path's native layout and design the full pipeline around it (row-major for Java/C++, column-major for Fortran/cuBLAS).
Use a SIMD kernel in the caller's language. Pure-Java SIMD via the JDK Vector API operates natively on row-major Java arrays, avoiding the BLAS mismatch entirely.
Reshape the problem. A row × row dot-product can be expressed as A × Bᵀ without ever materialising a transposed B, by picking an access order the kernel is already optimised for.

Seen in¶

sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — Netflix Ranker's BLAS evaluation flagged row-vs-column-major translation as one of the three load-bearing reasons BLAS lost the kernel competition to pure-Java SIMD: "Java's row-major layout doesn't match the column-major expectations of many BLAS routines, which can introduce conversion and temporary buffers." The JDK Vector API path kept the flat row-major Java buffers end-to-end, no translation.

concepts/cache-locality — the performance axis the layout choice actually lives on.
concepts/jni-transition-overhead — the sibling JNI-to-BLAS cost that compounds with layout translation.
concepts/matrix-multiplication-accumulate — the hardware primitive; kernels tune for a specific layout.
patterns/flat-buffer-threadlocal-reuse — Netflix's row-major flat-buffer rewrite that made the Vector API kernel fast.

Row-major vs column-major layout¶

Language / ecosystem conventions¶

Why the mismatch bites in practice¶

How to sidestep the mismatch¶

Seen in¶

Related¶