CONCEPT Cited by 1 source
Row-major vs column-major layout¶
Two conventions for laying out a 2D matrix in linear memory:
- Row-major stores each row contiguously. For an
M × NmatrixA, elementA[i][j]lives at offseti*N + jin a flat buffer. Walking along a row is a sequential memory scan; walking down a column strides byNelements. - Column-major stores each column contiguously. Element
A[i][j]lives at offsetj*M + i. Walking down a column is sequential; walking along a row strides byM.
Which convention is "right" depends on which access direction the hot loop uses. Getting it wrong turns every load into a cache miss and can slow numerical kernels by an order of magnitude — see concepts/cache-locality.
Language / ecosystem conventions¶
| Ecosystem | Convention |
|---|---|
| C, C++, Java, Rust, Python NumPy (default), PyTorch, TensorFlow | Row-major |
| Fortran, MATLAB, Julia, R, classical BLAS / LAPACK / LINPACK | Column-major |
| GPU (CUDA cuBLAS default) | Column-major (Fortran heritage) |
The BLAS ecosystem is column-major because it was designed in Fortran. This creates a structural mismatch when calling BLAS from row-major languages: either the caller transposes matrices (O(M·N) extra memory traffic) or uses the BLAS "transpose flag" parameters (which work but force the kernel to do column-vs-row stride arithmetic internally, losing some optimised code paths).
Why the mismatch bites in practice¶
- Transpose-before-call — allocates a temporary column-major copy per call. Fine for infrequent large calls; disastrous for hot loops on small tiles.
- Transpose flags — avoid the copy but force the BLAS kernel to use strided memory access internally, often triggering a slower code path than the contiguous one.
- Mixed pipelines — when the caller, the kernel, and the consumer all have different conventions, data "ping-pongs" through layout conversions that dominate total runtime.
How to sidestep the mismatch¶
- Stay in one convention end-to-end. Pre-normalise inputs into the hot path's native layout and design the full pipeline around it (row-major for Java/C++, column-major for Fortran/cuBLAS).
- Use a SIMD kernel in the caller's language. Pure-Java SIMD via the JDK Vector API operates natively on row-major Java arrays, avoiding the BLAS mismatch entirely.
- Reshape the problem. A row × row dot-product can be expressed
as
A × Bᵀwithout ever materialising a transposedB, by picking an access order the kernel is already optimised for.
Seen in¶
- sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — Netflix Ranker's BLAS evaluation flagged row-vs-column-major translation as one of the three load-bearing reasons BLAS lost the kernel competition to pure-Java SIMD: "Java's row-major layout doesn't match the column-major expectations of many BLAS routines, which can introduce conversion and temporary buffers." The JDK Vector API path kept the flat row-major Java buffers end-to-end, no translation.
Related¶
- concepts/cache-locality — the performance axis the layout choice actually lives on.
- concepts/jni-transition-overhead — the sibling JNI-to-BLAS cost that compounds with layout translation.
- concepts/matrix-multiplication-accumulate — the hardware primitive; kernels tune for a specific layout.
- patterns/flat-buffer-threadlocal-reuse — Netflix's row-major flat-buffer rewrite that made the Vector API kernel fast.