Bits, Math and Performance(?): Loading and storing short anti-diagonal pieces of a 2D grid of i32

Sometimes when working with a 2D grid, the computation of a cell depends on the values of the cell directly above it and the cell directly to the left of it, optionally also on the value of the cell anti-diagonally^[1] to the top-left of it. This pattern of data dependencies shows up for example in the Paeth filter, and may show up in some dynamic programming contexts but in those cases you may be able to reorient your grid to store anti-diagonals contiguously. At first glance, this data dependency pattern ruins all hope for parallelism: if you look at the cells in the usual top-to-bottom left-to-right order, each cell depends on the previous cell (and a couple of others from the previous row, but they're not the problem). But there is still hope, when processing the grid anti-diagonally^[1] all the cells on an anti-diagonal are independent - they depend only on cells that were in the previous and second-previous anti-diagonal.

On a smaller scale, we could take some small anti-diagonal pieces, and use SIMD to exploit the parallelism within each piece.

There may be other reasons why you would want to load or store data this way aside from that specific pattern of data dependencies.

How to load or store those pieces

There's a simple trick: load 4 small rows of 4 cells, slightly offset, then 4x4 transpose them. For storing, do the opposite.

// see https://randombit.net/bitbashing/posts/integer_matrix_transpose_in_sse2.html
void _transpose4(__m128i &R0, __m128i& R1, __m128i& R2, __m128i& R3)
{
    __m128i T0 = _mm_unpacklo_epi32(R0, R1);
    __m128i T1 = _mm_unpacklo_epi32(R2, R3);
    __m128i T2 = _mm_unpackhi_epi32(R0, R1);
    __m128i T3 = _mm_unpackhi_epi32(R2, R3);
    R0 = _mm_unpacklo_epi64(T0, T1);
    R1 = _mm_unpackhi_epi64(T0, T1);
    R2 = _mm_unpacklo_epi64(T2, T3);
    R3 = _mm_unpackhi_epi64(T2, T3);
}

...

__m128i D0 = _mm_loadu_si128((__m128i*)(ptr + (w * 0) + 12));
__m128i D1 = _mm_loadu_si128((__m128i*)(ptr + (w * 1) + 8));
__m128i D2 = _mm_loadu_si128((__m128i*)(ptr + (w * 2) + 4));
__m128i D3 = _mm_loadu_si128((__m128i*)(ptr + (w * 3) + 0));
_transpose4(D0, D1, D2, D3);

(note the decreasing additional offset as we go down the rows)

A 4x4 transpose costs 8 shuffles, which is not great but still better than loading 16 individual DWORDs and building vectors out of them.

Before the transpose, we have 4 pieces of plain old sequential data, shown as a colour per vector register^[3], the cells that have already been processed are in gray:

After transposing, we have:

Extras

This trick may be used to implement the PNG filters "Paeth" and "average", using it for the "up" filter would obviously be silly, and using it for the "sub" filter puts it in competition with using prefix sums. In my experiments involving stb_image, the trick was worth using in terms of performance, but maybe not in terms of the effort it takes to integrate it into an existing PNG decoder which expects to work purely row-by-row, at least not for me.

I see two main ways to extend this trich to AVX2, either taking doing 8 128-bit loads and transposing that data into 4 256-bit anti-diagonal pieces, or doing 4 256-bit loads and transposing that data into 8 128-bit anti-diagonal pieces, initially awkwardly combined in 4 256-bit vectors that have different pieces in their lower and upper halves. Both are awkward in their own way. Perhaps still worth doing, depending on the context.

If you actually have the data dependency that I described at the start of the post, the vector of "B pixels" (in PNG terms) also presents a puzzle. The B pixels that go into a column are mostly the results from the previous column, shifted up by 1 place and with a pixel from the previous row inserted at the start of the vector. Working with 128-bit vectors and using SSSE3, that can be done with PALIGNR. 256-bit vectors in AVX2 make that a bit more annoying and higher-latency as well (involving a cross-lane shuffle). The latency is important here, this shift-up-and-insert operation^[4] is on the critical path, part of the loop-carried dependency between anti-diagonals. Although annoying, that extra latency is not necessarily disqualifying.

[1] In the way I use this terminology, "diagonally" is reserved for a line connecting the top left with the bottom right, and "anti-diagonally" for a line connecting the top right with the bottom left.
[2] Not every PNG has 32 bits per pixels. 24 bits per pixel is reasonable to handle by inserting dummy alpha values before the transpose. Other pixels formats are rare and can be handled by legacy code.
[3] Picking 4 colours that were unlikely to be mistaken for RGB (plus one extra) was surprisingly tricky. These don't looks great, but at least they're hard to misinterpret.
[4] Is this what RVV calls vslide1up?

Friday, 26 December 2025

Loading and storing short anti-diagonal pieces of a 2D grid of i32

How to load or store those pieces

Extras