It's no secret that GF2P8AFFINEQB can be tricky to think about, even in the restricted context of bit-permutations. Thinking about more than one step (such as more than one GF2P8AFFINEQB back-to-back, or GF2P8AFFINEQB flanked by byte-wise shuffles) is just too much. Or perhaps you can do it, tell me your secret.
A good way for mere mortals to reason about these kinds of permutations, I think, is to think in terms of the bits of the indices of the bits that are really being permuted. So we're 4 levels deep:
- The value whose bits are being permuted.
- The bits that are being permuted.
- The indices of those bits.
- The bits of those indices.
This can get a little confusing because a lot of the time the operation that will be performed on the bits of those indices is a permutation again, but they don't have to be, another classic example is that a rotation corresponds to add/subtracting a constant to the indices. Just keep in mind that we're 4 levels deep the entire time.
Actually we don't need to go deeper.
The building blocks
Assuming we have 512 bits to work with, the indices of those bits are 0..511: 9-bit numbers. We will split that into 3 groups of 3 bits, denoted a,b,c where a locates a QWORD in the 512-bit register, b locates a byte within that QWORD, and c locates a bit within that byte.
Here are some nice building blocks (given fairly arbitrary names):
- Pf(a,b,c) = a,b,f(c) aka "right GF2P8AFFINEQB", where f is any mapping from a 3-bit integer to a 3-bit integer. This building block can be implemented with _mm512_gf2p8affine_epi64_epi8(input, _mm512_set1_epi64(f_as_a_reversed_matrix), 0)
- Qf(a,b,c) = a,f(c),~b aka "left GF2P8AFFINEQB", where ~b is a 3-bit inversion, equivalent to 7 - b. f can often be the identity mapping, swapping the second and third groups of bits is useful on its own (the "bonus" inversion can be annoying to deal with). This building block can be implemented with _mm512_gf2p8affine_epi64_epi8(_mm512_set1_epi64(f_as_a_matrix), input, 0)
- Sg(a,b,c) = g(a,b),c aka Shuffle, where g is any mapping from a 6-bit integer to a 6-bit integer. This building block can be implemented with _mm512_permutexvar_epi8(g_as_an_array, input), but in some cases also with another instruction that you may prefer, depending on the mapping.
S, though it doesn't touch c, is quite powerful. As a couple of special cases that may be of interest, it can be used to swap a and b, invert a or b, or do a combined swap-and-invert.
We could further distinguish:
- S64f(a,b,c) = f(a),b,c aka VPERMQ. This building block can be implemented with, you guessed it, VPERMQ.
- S8f(a,b,c) = a,f(b),c aka PSHUFB. This building block can be implemented with, you guessed it, PSHUFB. PSHUFB allows a bit more freedom than is used here, the mapping could be from 4-bit integers to 4-bit integers, but that's not nice to think about in this framework of 3 groups of 3 bits.
Building something with the blocks
Let's say that we want to take a vector of 8 64-bit integers, and transpose it into a vector of 64 8-bit integers such that the k'th bit of the n'th uint64 ends up in the n'th bit of the k'th uint8. In terms of the bits of the indices of the bits (I swear it's not as confusing as it sounds) that means we want to build something that maps a,b,c to b,c,a. It's immediately clear that we need a Q operation at some point, since it's the only way to swap some other groups of bits into the 3rd position. But if we start with a Q, we get ~b in the 3rd position while we need a. We can solve that by starting with an S that swaps a and b while also inverting a (I'm not going to bother defining what that looks like in terms of an index mapping function, just imagine that those functions are whatever they need to be in order to make it work):
Qid(b,~a,c) = b,c,a
Which translates into code like this:
__m512i Transpose8x64(__m512i x) { x = _mm512_permutexvar_epi8(_mm512_setr_epi8( 56, 48, 40, 32, 24, 16, 8, 0, 57, 49, 41, 33, 25, 17, 9, 1, 58, 50, 42, 34, 26, 18, 10, 2, 59, 51, 43, 35, 27, 19, 11, 3, 60, 52, 44, 36, 28, 20, 12, 4, 61, 53, 45, 37, 29, 21, 13, 5, 62, 54, 46, 38, 30, 22, 14, 6, 63, 55, 47, 39, 31, 23, 15, 7), x); __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201); x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0); return x; }
Now let's say that we want to do the inverse of that, going back from b,c,a to a,b,c. Again it's clear that we need a Q, but we have some choice now. We could start by inverting the c in the middle first:
Qid(b,~c,a) = b,a,c
Sf2(b,a,c) = a,b,c
Which translates into code like this:
__m512i Transpose64x8(__m512i x) { x = _mm512_shuffle_epi8(x, _mm512_setr_epi8( 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 23, 22, 21, 20, 19, 18, 17, 16, 31, 30, 29, 28, 27, 26, 25, 24, 39, 38, 37, 36, 35, 34, 33, 32, 47, 46, 45, 44, 43, 42, 41, 40, 55, 54, 53, 52, 51, 50, 49, 48, 63, 62, 61, 60, 59, 58, 57, 56)); __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201); x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0); x = _mm512_permutexvar_epi8(_mm512_setr_epi8( 0, 8, 16, 24, 32, 40, 48, 56, 1, 9, 17, 25, 33, 41, 49, 57, 2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59, 4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61, 6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63), x); return x; }
Or we could start with a Q to get the a out of the third position, then use an S to swap the first and second positions and a P to invert c (in any order).
Sf1(b,a,~c) = a,b,~c
Pf2(a,b,~c) = a,b,c
Which translates into code like this:
__m512i Transpose64x8(__m512i x) { __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201); x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0); x = _mm512_permutexvar_epi8(_mm512_setr_epi8( 0, 8, 16, 24, 32, 40, 48, 56, 1, 9, 17, 25, 33, 41, 49, 57, 2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59, 4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61, 6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63), x); x = _mm512_gf2p8affine_epi64_epi8(x, idmatrix, 0); return x; }
I will probably keep using a SAT solver to solve the masks (using the same techniques as in (Not) transposing a 16x16 bitmatrix), but now at least I have a proper way to think about the shape of the solution, which makes it a lot easier to ask a SAT solver to fill in the specifics.
This framework could be extended with other bit-permutation operatations such as QWORD rotates, but that quickly becomes tricky to think about.