As a reminder, `grev` is generalized bit-reversal, and performs a bit-permutation that corresponds to XORing the indices of the bits of the left operand by the number given in the right operand. A possible (not very suitable for actual use, but illustrative) implementation of `grev` is:

def grev(bits, k): return bits[np.arange(len(bits)) ^ k]

`grevmul` (see also this older post where I list some of its properties) can be defined in terms of `grev` (hence the name^{[1]}), with `grev` replacing the left shift in a carryless multiplication. But let's do something else first. An interesting way to look at multiplication (plain old multiplication between natural numbers) is as:

- Form the Cartesian product of the inputs, viewed as arrays of bits.
- For each pair in the Cartesian product, compute the AND of the two bits.
- Send the AND of each pair with index
`(i, j)`to the bin`i + j`, to be accumulated with the function`(+)`. - Resolve the carries.

Carryless multiplication works mostly the same way except that accumulation is done with XOR, which makes step 4 unnecessary. `grevmul` also works mostly the same way, with accumulation also done with XOR, but now the pair with index `(i, j)` is sent to the bin `i XOR j`.

+ | ^ | |
---|---|---|

+ | imul | clmul |

^ | ??? | grevmul |

That won't be important for the implementation, but it may help you think about what `grevmul` does, and this will be on the test. OK there is no test, but you can test yourself by explaining why `(popcnt(a & b) & 1) == (grevmul(a, b) & 1)`, based on reasoning about the Cartesian-product-algorithm for `grevmul`.

## Implementing `grevmul` with `GF2P8AFFINEQB`

Some of you may have already seen a version of the code that I am about to discuss, although I have made some changes since. Nothing serious, but while thinking about how the code worked, I encountered some opportunities to make polish it up.

`GF2P8AFFINEQB` computes, for each QWORD (so for now, let's concentrate on one QWORD of the result), the product of two bit-matrices, where the left matrix comes from the first input and the right matrix is the transpose-flip (or flip-transpose, however you prefer to think about it) of the second input. You can think of it as a mutant of the `mxor` operation found in Knuth's MMIX instruction set, and from here on I will use the name `bmatxor` for the version of this operation that simply multiplies two 8x8 bit-matrices, with none of that transpose-flip business^{[2]}. There is also a free XOR by a constant byte thrown in at the end, which can be safely ignored by setting it to zero. The transpose-flip however need to be worked around or otherwise taken into account. You may also want to read
(Ab)using gf2p8affineqb to turn indices into bits first for some extra familiarity with / alternate view of `GF2P8AFFINEQB`.

To implement `grevmul(a, b)`, we need some linear combination (controlled by `b`) of permuted (by `grev`) versions of `a` (the roles of `a` and `b` can be swapped since `grevmul` is commutative, which the Cartesian product-based algorithm makes clear). `GF2P8AFFINEQB` is all about linear combinations, but it works with 8x8 matrices, not 64x64 (put that in AVX-4096) which would have made it really simple. Fortunately, we can slice a `grevmul` however we want.

Now to start in the middle, let's say we have 8 copies of a byte (a byte of `b`), with the `i`'th copy `grev`'ed by `i`, concatenated into a QWORD that I will call `m`. `bmatxor(a, m)` would (thinking of a matrix multiplication AB as forming linear combinations of rows of B), for each row of the result, form a linear combination (controlled by `a`) of `grev`'ed copies of the byte from `b`. That may seem like it would be wrong, since every byte of `a` is done separately and uses the same `m`, so it's "missing" a `grev` by 8 for the second byte, 16 for the third byte, etc. But if `x` is a byte (not if and *only* if, just regular "if"), then `grev(x, 8 * i)` is the same as `x << (8 * i)` and the second byte is indeed already in the second position anyway, so we get this for free. Thus, `mxor(a, m)` would allow us to `grevmul` a 64-bit number by an 8-bit number. If we could just do that 8 times (for each byte of `b`) and combine the results, we're done.

But we don't have `bmatxor`, we have `GF2P8AFFINEQB` with its built-in transpose-flip, and that presents a choice: either put another `GF2P8AFFINEQB` before, or after, the "main" `GF2P8AFFINEQB` to counter that built-in transpose. Not the whole transpose-flip, let's put the flip aside for now. There is a small reason to favour the "extra `GF2P8AFFINEQB` after the main one" order, namely that that results in `GF2P8AFFINEQB(m, broadcast(a))` (as opposed to `GF2P8AFFINEQB(broadcast(a), m)`) and when `a` comes from memory it can be loaded and broadcasted directly with a `{1to8}` broadcasted memory operand. That option would not be available if `a` was the first operand of the `GF2P8AFFINEQB`. This is a small matter, but we have to make the choice somehow, and there seems to be no difference aside from this.

At this point there are two (and a half) pieces of the puzzle left: forming `m`, and horizontally combining 8 results.

## Forming `m`

If we would first form 8 copies of a byte of `b` and then try to `grev` those by their respective indices, that would be hard. But doing it the other way around is easy, broadcast `b` into each QWORD of a 512-bit vector, `grev` each QWORD by its index, then transpose the whole vector as an 8x8 matrix of bytes. Actually for constant-reuse (loading fewer humongous vector constants is rarely bad) and because of the built-in transpose-*flip* it turns out slightly better to transpose-flip that 8x8 matrix of bytes as well, that doesn't cost any more than just transposing it.

`grev`-ing each QWORD by its index is easy, perhaps easier than it sounds. A `grev` by 0..7 only rearranges the bits within each byte, which is easy to do with a single `GF2P8AFFINEQB` with a constant as the second operand (a "P" step).

An 8x8 transpose-flip is just a `VPERMB` by some index vector.^{[3]}

## Combining the results

After the "middle" step, if we went with the "extra `GF2P8AFFINEQB` before the main `GF2P8AFFINEQB`"-route, we would have `bmatxor(a, m)` in each QWORD (with a different `m` per QWORD), which need to be combined. If we were implementing a plain old integer multiplication, the value in the QWORD with the index `i` would be shifted left by `8 * i` and then all of them would be summed up. Since we're implementing a `grevmul`, that value is `grev`'ed by `8 * i` (which is just some byte permutation) and the resulting QWORDs are XORed.

If we go with the "extra `GF2P8AFFINEQB` after the main `GF2P8AFFINEQB`"-route, which I chose, then there is a `GF2P8AFFINEQB` to do before we can start combining QWORDs. We really only need it to un-transpose the result of the "main" `GF2P8AFFINEQB`, the rest is just a byte-permutation and we're about to do a `VPERMB` anyway (if we choose the cool way of XORing the QWORDs together), *but* there is a neat opportunity here: if we re-use the same set of bit-matrices that was used to `grev` `b` by 0..7, in addition to the transpose that we wanted we also permute the bytes such that we `grev` each QWORD by `8 * i` as a bonus.

Now we could just XOR-fold the vector 3 times to end up with the XOR of all eight QWORDs, and that would be a totally valid implementation, but it would also be boring. Alternatively, if we transpose the vector as an 8x8 matrix of bytes again (it can be a transpose-flip, so we get to reuse the same index vector as before), then the i'th byte of each QWORD would be gathered in the i'th QWORD of the transpose and `bmatxor(0xFF, vector)` would XOR together all bytes with the same index and give us a vector that has one byte of the final result per QWORD, easily extractable with `VPMOVQB`. We still don't have `bmatxor` though, we have `bmatxor` with an extra transpose-flip, which can be countered the usual way, with yet another `GF2P8AFFINEQB`.

As yet another alternative^{[4]}, after similar transpose trickery `bmatxor(vector, 0xFF)` would also XOR together bytes with the same index but leave the result in a form that can be extracted with `VPMOVB2M`, which puts the result in a mask register but it's not too bad to move it from there to a GPR.

## The code

In case anyone makes it this far, here is one possible embodiment^{[5]} of the algorithm described herein:

uint64_t grevmul_avx512(uint64_t a, uint64_t b) { uint64_t id = 0x0102040810204080; __m512i grev_by_index = _mm512_setr_epi64( id, grev(id, 1), grev(id, 2), grev(id, 3), grev(id, 4), grev(id, 5), grev(id, 6), grev(id, 7)); __m512i tp_flip = _mm512_setr_epi8( 56, 48, 40, 32, 24, 16, 8, 0, 57, 49, 41, 33, 25, 17, 9, 1, 58, 50, 42, 34, 26, 18, 10, 2, 59, 51, 43, 35, 27, 19, 11, 3, 60, 52, 44, 36, 28, 20, 12, 4, 61, 53, 45, 37, 29, 21, 13, 5, 62, 54, 46, 38, 30, 22, 14, 6, 63, 55, 47, 39, 31, 23, 15, 7); __m512i m = _mm512_set1_epi64(b); m = _mm512_gf2p8affine_epi64_epi8(m, grev_by_index, 0); m = _mm512_permutexvar_epi8(tp_flip, m); __m512i t512 = _mm512_gf2p8affine_epi64_epi8(m, _mm512_set1_epi64(a), 0); t512 = _mm512_gf2p8affine_epi64_epi8(grev_by_index, t512, 0); t512 = _mm512_permutexvar_epi8(tp_flip, t512); t512 = _mm512_gf2p8affine_epi64_epi8(_mm512_set1_epi64(0x8040201008040201), t512, 0); t512 = _mm512_gf2p8affine_epi64_epi8(t512, _mm512_set1_epi64(0xFF), 0); return _mm512_movepi8_mask(t512); }

## The end

As far as I know, no one really uses `grevmul` for anything, so being able to compute it somewhat efficiently (more efficiently than a naive scalar solution at least) is not immediately useful. On the other hand, if an operation is not known to be efficiently computable, that may preclude its use. But the point of this post is more to show something neat.

Originally I had found the sequence of `_mm512_gf2p8affine_epi64_epi8(m, _mm512_set1_epi64(a), 0)` and `_mm512_gf2p8affine_epi64_epi8(grev_by_index, t512, 0)` as a solution to `grevmul`-ing a QWORD by a (constant) byte, using a SAT solver (shout-out to togasat for being easy to add to a project - though admittedly I eventually bit the bullet and switched to MiniSAT). That formed the starting point of this investigation / puzzle-solving session. It may be possible to pull a more complete solution out of the void with a SAT/SMT based technique such as CEGIS, perhaps the bitwise-bilinear nature of `grevmul` can be exploited (I used the bitwise-linear nature of `grevmul`-by-a-constant in my SAT-experiments to represent the problem as a composition of matrices over GF(2)).

Almost half of the steps of this algorithm are *some* kind of transpose, which has also been the case with some other SIMD algorithms that I recently had a hand in. I used to think of a transpose as "not really doing anything", barely worth the notation when doing linear algebra, but maybe I was wrong.

[1] Maybe it's less "the name" and more "what I decided to call it". I'm not aware of any established name for this operation.

[2] This makes it sound more negative than it really is. The transpose-flip often needs to be worked around when we don't want it, but that's not *that* bad. Having no easy access to a transpose would be much worse to work around when we *do* need it. Separate `bmatxor` and `bmattranspose` instructions would have been nice.

[3] `GF2P8AFFINEQB` trickery is nice, but when I recently wrote some AVX2 code it was `VPERMB` that I missed the most.

[4] Can we stop with the alternatives and just pick something?

[5] No patents were read in the development of this algorithm, nor in the writing of this blog post.