Saturday, 23 September 2023

Permuting bits with GF2P8AFFINEQB

It's no secret that GF2P8AFFINEQB can be tricky to think about, even in the restricted context of bit-permutations. Thinking about more than one step (such as more than one GF2P8AFFINEQB back-to-back, or GF2P8AFFINEQB flanked by byte-wise shuffles) is just too much. Or perhaps you can do it, tell me your secret.

A good way for mere mortals to reason about these kinds of permutations, I think, is to think in terms of the bits of the indices of the bits that are really being permuted. So we're 4 levels deep:

  1. The value whose bits are being permuted.
  2. The bits that are being permuted.
  3. The indices of those bits.
  4. The bits of those indices.

This can get a little confusing because a lot of the time the operation that will be performed on the bits of those indices is a permutation again, but they don't have to be, another classic example is that a rotation corresponds to add/subtracting a constant to the indices. Just keep in mind that we're 4 levels deep the entire time.

Actually we don't need to go deeper.

The building blocks

Assuming we have 512 bits to work with, the indices of those bits are 0..511: 9-bit numbers. We will split that into 3 groups of 3 bits, denoted a,b,c where a locates a QWORD in the 512-bit register, b locates a byte within that QWORD, and c locates a bit within that byte.

Here are some nice building blocks (given fairly arbitrary names):

  • Pf(a,b,c) = a,b,f(c) aka "right GF2P8AFFINEQB", where f is any mapping from a 3-bit integer to a 3-bit integer. This building block can be implemented with _mm512_gf2p8affine_epi64_epi8(input, _mm512_set1_epi64(f_as_a_reversed_matrix), 0)
  • Qf(a,b,c) = a,f(c),~b aka "left GF2P8AFFINEQB", where ~b is a 3-bit inversion, equivalent to 7 - b. f can often be the identity mapping, swapping the second and third groups of bits is useful on its own (the "bonus" inversion can be annoying to deal with). This building block can be implemented with _mm512_gf2p8affine_epi64_epi8(_mm512_set1_epi64(f_as_a_matrix), input, 0)
  • Sg(a,b,c) = g(a,b),c aka Shuffle, where g is any mapping from a 6-bit integer to a 6-bit integer. This building block can be implemented with _mm512_permutexvar_epi8(g_as_an_array, input), but in some cases also with another instruction that you may prefer, depending on the mapping.

S, though it doesn't touch c, is quite powerful. As a couple of special cases that may be of interest, it can be used to swap a and b, invert a or b, or do a combined swap-and-invert.

We could further distinguish:

  • S64f(a,b,c) = f(a),b,c aka VPERMQ. This building block can be implemented with, you guessed it, VPERMQ.
  • S8f(a,b,c) = a,f(b),c aka PSHUFB. This building block can be implemented with, you guessed it, PSHUFB. PSHUFB allows a bit more freedom than is used here, the mapping could be from 4-bit integers to 4-bit integers, but that's not nice to think about in this framework of 3 groups of 3 bits.

Building something with the blocks

Let's say that we want to take a vector of 8 64-bit integers, and transpose it into a vector of 64 8-bit integers such that the k'th bit of the n'th uint64 ends up in the n'th bit of the k'th uint8. In terms of the bits of the indices of the bits (I swear it's not as confusing as it sounds) that means we want to build something that maps a,b,c to b,c,a. It's immediately clear that we need a Q operation at some point, since it's the only way to swap some other groups of bits into the 3rd position. But if we start with a Q, we get ~b in the 3rd position while we need a. We can solve that by starting with an S that swaps a and b while also inverting a (I'm not going to bother defining what that looks like in terms of an index mapping function, just imagine that those functions are whatever they need to be in order to make it work):

Sf(a,b,c) = b,~a,c
Qid(b,~a,c) = b,c,a

Which translates into code like this:

__m512i Transpose8x64(__m512i x)
{
    x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
        56, 48, 40, 32, 24, 16, 8, 0,
        57, 49, 41, 33, 25, 17, 9, 1,
        58, 50, 42, 34, 26, 18, 10, 2,
        59, 51, 43, 35, 27, 19, 11, 3,
        60, 52, 44, 36, 28, 20, 12, 4,
        61, 53, 45, 37, 29, 21, 13, 5,
        62, 54, 46, 38, 30, 22, 14, 6,
        63, 55, 47, 39, 31, 23, 15, 7), x);
    __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201);
    x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
    return x;
}

Now let's say that we want to do the inverse of that, going back from b,c,a to a,b,c. Again it's clear that we need a Q, but we have some choice now. We could start by inverting the c in the middle first:

S8f1(b,c,a) = b,~c,a
Qid(b,~c,a) = b,a,c
Sf2(b,a,c) = a,b,c

Which translates into code like this:

__m512i Transpose64x8(__m512i x)
{
    x = _mm512_shuffle_epi8(x, _mm512_setr_epi8(
        7, 6, 5, 4, 3, 2, 1, 0,
        15, 14, 13, 12, 11, 10, 9, 8,
        23, 22, 21, 20, 19, 18, 17, 16,
        31, 30, 29, 28, 27, 26, 25, 24,
        39, 38, 37, 36, 35, 34, 33, 32,
        47, 46, 45, 44, 43, 42, 41, 40,
        55, 54, 53, 52, 51, 50, 49, 48,
        63, 62, 61, 60, 59, 58, 57, 56));
    __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201);
    x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
    x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
        0, 8, 16, 24, 32, 40, 48, 56,
        1, 9, 17, 25, 33, 41, 49, 57,
        2, 10, 18, 26, 34, 42, 50, 58,
        3, 11, 19, 27, 35, 43, 51, 59,
        4, 12, 20, 28, 36, 44, 52, 60,
        5, 13, 21, 29, 37, 45, 53, 61,
        6, 14, 22, 30, 38, 46, 54, 62,
        7, 15, 23, 31, 39, 47, 55, 63), x);
    return x;
}

Or we could start with a Q to get the a out of the third position, then use an S to swap the first and second positions and a P to invert c (in any order).

Qid(b,c,a) = b,a,~c
Sf1(b,a,~c) = a,b,~c
Pf2(a,b,~c) = a,b,c

Which translates into code like this:

__m512i Transpose64x8(__m512i x)
{
    __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201);
    x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
    x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
        0, 8, 16, 24, 32, 40, 48, 56,
        1, 9, 17, 25, 33, 41, 49, 57,
        2, 10, 18, 26, 34, 42, 50, 58,
        3, 11, 19, 27, 35, 43, 51, 59,
        4, 12, 20, 28, 36, 44, 52, 60,
        5, 13, 21, 29, 37, 45, 53, 61,
        6, 14, 22, 30, 38, 46, 54, 62,
        7, 15, 23, 31, 39, 47, 55, 63), x);
    x = _mm512_gf2p8affine_epi64_epi8(x, idmatrix, 0);
    return x;
}

I will probably keep using a SAT solver to solve the masks (using the same techniques as in (Not) transposing a 16x16 bitmatrix), but now at least I have a proper way to think about the shape of the solution, which makes it a lot easier to ask a SAT solver to fill in the specifics.

This framework could be extended with other bit-permutation operatations such as QWORD rotates, but that quickly becomes tricky to think about.

Sunday, 2 July 2023

Propagating bounds through bitwise operations

This post is meant as a replacement/recap of some work that I did over a decade ago on propagating bounds through bitwise operations, which was intended as an improvement over the implementations given in Hacker's Delight chapter 4, Arithmetic Bounds.

The goal is, given two variables x and y, with known bounds a ≤ x ≤ b, c ≤ y ≤ d, compute the bounds of x | y and of x & y. Thanks to De Morgan, we have the equations (most also listed in Hacker's Delight, except the last one)

  • minAND(a, b, c, d) = ~maxOR(~b, ~a, ~d, ~c)
  • maxAND(a, b, c, d) = ~minOR(~b, ~a, ~d, ~c)
  • minXOR(a, b, c, d) = minAND(a, b, ~d, ~c) | minAND(~b, ~a, c, d)
  • maxXOR(a, b, c, d) = maxOR(a, b, c, d) & ~minAND(a, b, c, d)

Everything can be written in terms of only minOR and maxOR and some basic operations.

maxOR

To compute the upper bound of the OR of x and y, what we need to do is find is the leftmost bit (henceforth the "target bit") such that it is both:

  1. set in both b and d (the upper bounds of x and y) and,
  2. changing an upper bound (either one of them, doesn't matter, but never both) by resetting the target bit and setting the bits that are less significant, keeps it greater-or-equal than the corresponding lower bound.

The explanation of why that works can be found in Hacker's Delight, along with a more of less direct transcription into code, but we can do better than a direct transcription.

Finding the leftmost bit that passes only the first condition would be easy, its the highest set bit in b & d. The second condition is a bit more complex to handle, but still surprisingly easy thanks to one simple observation: the bits that can pass it, are precisely those bits that are at (or to the right of) the leftmost bit where the upper and lower bound differ. Imagine two numbers in binary, one being the lower bound and the other the upper bound. The number have some equal prefix (possibly zero bits long, up to all bits) and then if they differ, they must differ by a bit in the upper bound being 1 while the corresponding bit in the lower bound is 0. Lowering the upper bound by resetting that bit while setting all bits the right of it, cannot make it lower than the lower bound.

For one of the inputs, say x, the position at which that second condition start being false (looking at that bit and to the left of it) can be computed directly with 64 - lzcnt(a ^ b). We actually need the maximum of that across both pairs of bounds, but there's no need to compute that for both bounds and then take the maximum, we can use this to let the lzcnt find the maximum automatically: 64 - lzcnt((a ^ b) | (c ^ d)).

bzhi(m, k) is an operation that resets the bits in m starting at index k. It can be emulated by shifting or masking, but an advantage of bzhi is that it is well defined for any relevant k, including when k is equal to the size of the integer in bits. bzhi is not strictly required here, but it is more convenient than "classic" bitwise operations, and available on most x64 processors today[1]. Using bzhi, it's simple to take the position calculated in the previous paragraph and reset all the bits in b & d that do not pass the second condition: bzhi(b & d, 64 - lzcnt((a ^ b) | (c ^ d))).

With that bitmask in hand, all we need to do is apply it to one of the upper bounds. We can skip the "reset the target bit" part, since that bit will be set in the other upper bound and therefore also in the result. It also does not matter which upper bound is changed, regardless of which bound we were conceptually changing. Let's pick b for no particular reason. Then in total, the implementation could be:

uint64_t maxOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
    uint64_t index = 64 - _lzcnt_u64((a ^ b) | (c ^ d));
    uint64_t candidates = _bzhi_u64(b & d, index);
    if (candidates) {
        uint64_t target = highestSetBit(candidates);
        b |= target - 1;
    }
    return b | d;
}

For the highestSetBit function you can choose any way you like to isolate the highest set bit in an integer.

minOR

Computing the lower bound of x | y surprisingly seems to be more complex. The basic principles are similar, but this time bits are being reset in one of the lower bounds, and it does matter in which lower bound that happens. The computation of the mask of candidate bits also "splits" into separate candidates for each lower bound, unless there's some trick that I've missed. This whole "splitting" thing cannot be avoided by defining minOR in terms of maxAND either, because the same things happen there. But it's not too bad, a little bit of extra arithmetic. Anyway, let's see some code.

uint64_t minOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
    uint64_t candidatesa = _bzhi_u64(~a & c, 64 - _lzcnt_u64(a ^ b));
    uint64_t candidatesc = _bzhi_u64(a & ~c, 64 - _lzcnt_u64(c ^ d));
    uint64_t target = highestSetBit(candidatesa | candidatesc);
    if (a & target) {
        c &= -target;
    }
    if (c & target) {
        a &= -target;
    }
    return a | c;
}

A Fun Fact here is that the target bit cannot be set in both bounds, opposite to what happens in maxOR where the target bit is always set in both bounds. You may be tempted to turn the second if into else if, but in my tests it was quite important that the ifs are compiled into conditional moves rather than branches (which of the lower bounds the target bit is found in was essentially random), and using else if here apparently discourages compilers (MSVC at least) from using conditional moves.

candidatesa | candidatesc can be zero, although that is very rare, at least in my usage of the function. As written, the code assumes that highestSetBit deals with that gracefully by returning zero if its input is zero. Branching here is (unlike in the two ifs at the end of minOR) not a big deal since this case is so rare (and therefore predictable).

Conclusion

In casually benchmarking these functions, I found them to be a bit faster than the ones that I came up with over a decade ago, and significantly faster than the ones from Hacker's Delight. That basic conclusion probably translates to different scenarios, but the exact ratios will vary a lot based on how predictable the branches are in that case, on your CPU, and on arbitrary codegen decisions made by your compiler.

In any case these new versions look nicer to me.

There are probably much simpler solutions if the bounds were stored in bit-reversed form, but that doesn't seem convenient.

Someone on a certain link aggregation site asked about signed integers. As Hacker's Delight explains via a table, things can go wrong if one (or both) bounds cross the negative/positive boundary - but the solution in those cases is still easy to compute. The way I see it, the basic problem is that a signed bound that crosses the negative/positive boundary effectively encodes two different unsigned intervals, one starting at zero and one ending at the greatest unsigned integer, and the basic unsigned minOR and so on cannot (by themselves) handle those "split" intervals.


[1] Sadly not all, some low-end Intel processors have AVX disabled, which apparently is done by disabling the entire VEX encoding and it takes out BMI2 as collateral damage.

Monday, 12 June 2023

Some ways to check whether an unsigned sum wraps

When computing x + y, does the sum wrap? There are various ways to find out, some of them well known, some less. Some of these are probably totally unknown, in some cases deservedly so.

This is not meant to be an exhaustive list.

  • ~x < y

    A cute trick in case you don't want to compute the sum, for whatever reason.

    Basically a variation of how MISRA C recommends checking for wrapping if you use the precondition test since ~x = UINT_MAX - x.

  • (x + y) < x

    (x + y) < y

    The Classic™. Useful for being cheap to compute: x + y is often needed anyway, in which case this effectively only costs a comparison. Also recommended by MISRA C, in case you want to express the "has the addition wrapped"-test as a postcondition test.

  • avg_down(x, y) <s 0

    avg_down(x, y) & 0x80000000 // adjust to integer size

    <s is signed-less-than. Performed on unsigned integers here, so be it.

    avg_down is the unsigned average rounded down. avg_up is the unsigned average rounded up.

    Since avg_down is the sum (the full sum, without wrapping) shifted right by 1, what would have been the carry out of the top of the sum becomes the top bit of the average. So, checking the top bit of avg_down(x, y) is equivalent to checking the carry out of x + y.

    Can be converted into avg_up(~x, ~y) >=s 0 through the equivalence avg_down(x, y) = ~avg_up(~x, ~y).

  • (x + y) < min(x, y)

    (x + y) < max(x, y)

    (x + y) < avg(x, y)

    (x + y) < (x | y)

    (x + y) < (x & y)

    ~(x | y) < (x & y)

    Variants of The Classic™. They all work for essentially the same reason: addition is commutative, including at the bit-level. So if we have (x + y) < x, then we also have (x + y) < y, and together they imply that instead of putting x or y on the right hand side of the comparison, we could arbitrarily select one of them, or anything between them too. Bit-level commutativity takes care of the bottom three variants in a similar way.

    Wait, is that a signed or unsigned min? Does avg round up or down or either way depending on the phase of the moon? It doesn't matter, all of those variants work and more.

  • (x + y) != addus(x, y)

    (x + y) < addus(x, y)

    addus is addition with unsigned saturation, meaning that instead of wrapping the result would be UINT_MAX.

    When are normal addition and addition with unsigned saturation different? Precisely when one wraps and the other saturates. Wrapping addition cannot "wrap all the way back" to UINT_MAX, the highest result when the addition wraps is UINT_MAX + UINT_MAX = UINT_MAX - 1.

    When the normal sum and saturating sum are different, the normal sum must be the smaller of the two (it certainly couldn't be greater than UINT_MAX), hence the second variant.

  • subus(y, ~x) != 0

    subus(x, ~y) != 0

    addus(~x, ~y) != UINT_MAX

    subus is subtraction with unsigned saturation.

    Strange variants of ~x < y. Since subus(a, b) will be zero when a <= b, it will be non-zero when b < a, therefore subus(y, ~x) != 0 is equivalent to ~x < y.

    subus(a, b) = ~addus(~a, b) lets us turn the subus variant into the addus variant.

  • (x + y) < subus(y, ~x)

    Looks like a cursed hybrid of (x + y) < avg(x, y) and subus(y, ~x) != 0, but the mechanism is (at least the way I see it) different from both of them.

    subus(y, ~x) will be zero when ~x >= y, which is exactly when the sum x + y would not wrap. x + y certainly cannot be unsigned-less-than zero, so overall the condition (x + y) < subus(y, ~x) must be false (which is good, it's supposed to be false when x + y would not wrap).

    In the other case, when ~x < y, we know that x + y will wrap and subus(y, ~x) won't be zero (and therefore cannot saturate). Perhaps there is a nicer way to show what happens, but at least under those conditions (predictable wrapping and no saturation) it is easy to do algebra:

    • (x + y) < subus(y, ~x)
    • x + y - 2k < y - (2k - 1 - x)
    • x + y - 2k < y - 2k + 1 + x
    • x + y < y + 1 + x
    • 0 < 1

    So the overall condition (x + y) < subus(y, ~x) is true IFF x + y wraps.

  • ~x < avg_up(~x, y)

    Similar to ~x < y, but stranger. Averaging y with ~x cannot take a low y to above ~x, nor a high y to below ~x. The direction of rounding is important: avg_down(~x, y) could take an y that's just one higher than ~x down to ~x itself, making it no longer higher than ~x. avg_up(~x, y) cannot do that thanks to rounding up.

Monday, 22 May 2023

grevmul

grev (generalized bit-reverse) is an operation that implements bit-permutations corresponding to XOR-ing the indices by some value. It has been proposed to be part of the Zbp extension of RISC-V, with this reference implementation (source: release v0.93)

uint32_t grev32(uint32_t rs1, uint32_t rs2)
{
    uint32_t x = rs1;
    int shamt = rs2 & 31;
    if (shamt &  1) x = ((x & 0x55555555) <<  1) | ((x & 0xAAAAAAAA) >>  1);
    if (shamt &  2) x = ((x & 0x33333333) <<  2) | ((x & 0xCCCCCCCC) >>  2);
    if (shamt &  4) x = ((x & 0x0F0F0F0F) <<  4) | ((x & 0xF0F0F0F0) >>  4);
    if (shamt &  8) x = ((x & 0x00FF00FF) <<  8) | ((x & 0xFF00FF00) >>  8);
    if (shamt & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
    return x;
}

grev looks in some ways similar to bit-shifts and rotates: the left and right operands have distinct roles with the right operand being a mask of k bits if the left operand has 2k bits[1].

Carry-less multiplication normally has a left-shift in it, grevmul is what you get when that left-shift is replaced with grev.

uint32_t grevmul32(uint32_t x, uint32_t y)
{
    uint32_t r = 0;
    for (int k = 0; k < 32; k++) {
        if (y & (1 << k))
            r ^= grev32(x, k);
    }
    return x;
}

grevmul is, at its core, very similar to clmul: take single-bit products (logical AND) of every bit of the left operand with every bit of the right operand, then do some XOR-reduction. The difference is in which partial products are grouped together. For clmul, the partial products that contribute to bit k of the result are pairs with indices i,j such that i + j = k. For grevmul, it's the pairs with indices such that i ^ j = k. This goes back to grev permuting the bits by XOR-ing their indices by some value, and that value is k here.

Now that grevmul has been defined, let's look at some of its properties, comparing it to clmul and plain old imul.

grevmul clmul imul
zero[2] 0 0 0
identity 1 1 1
commutative yes yes yes
associative yes yes yes
distributes over xor xor addition
op(x, 1 << k) is grev(x, k) x << k x << k
x has inverse if
popcnt(x) & 1 x & 1 x & 1
op(x, x) is popcnt(x) & 1 pdep(x, 0x55555555)

What is the "grevmul inverse" of x?

Time for some algebra. Looking just at the table above, and forgetting the actual definition of grevmul, can we say something about the solutions of grevmul(x, y) == 1? Surprisingly, yes.

Assuming we have some x with odd hamming weight (numbers with even hamming weight do not have inverses, so let's ignore them for now), we know that grevmul(x, x) == 1. The inverse in a monoid is unique so x is not just some inverse of x, it is the (unique) inverse of x.

Since the "addition operator" is XOR (for which negation is the identity function), this is a non-trivial example of a ring in which x = -x = x-1, when x-1 exists. Strange, isn't it?

We also have that f(x) = grevmul(x, c) (for appropriate choices of c) is a (non-trivial) involution, so it may be a contenter for the "middle operation" of an involutary bit finalizer, but probably useless without an efficient implementation.

I was going to write about implementing grevmul by an 8-bit constant with two GF2P8AFFINEQBs but I've had enough for now, maybe later. E: see Implementing grevmul with GF2P8AFFINEQB where I went ahead and implemented the whole thing, not only the "multiply by 8-bit constant" case.


[1] The right operand of a shift is often called the shift count, but it can also be interpreted as a mask indicating some subset of shift-by-2i operations to perform. That interpretation is useful for example when implementing a shift-by-variable operation on a machine that only has a shift-by-constant instruction, following the same pattern as the reference implementation of grev32.

[2] This looks like a joke, but I mean that the numeric value 0 acts as the zero element of the corresponding semigroup.

Wednesday, 12 April 2023

Transposing a 16x16 bitmatrix

Inverting a 16-element permutation may done like this:

for (int i = 0; i < 16; i++)
    inv[perm[i]] = i;

Computing a histogram of 16 nibbles may done like this:

for (int i = 0; i < 16; i++)
    hist[data[i]] += 1;

These different-sounding but already similar-looking tasks have something in common: they can be both be built around a 16x16 bitmatrix transpose. That sounds silly, why would anyone want to first construct a 16x16 bitmatrix, transpose it, and then do yet more processing to turn the resulting bitmatrix back into an array of numbers?

Because it turns out to be an efficiently-implementable operation, on some modern processors anyway.

If you know anything about the off-label application of GF2P8AFFINEQB, you may already suspect that it will be involved somehow (merely left-GF2P8AFFINEQB-ing by the identity matrix already results in some sort of 8x8 transpose, just horizontally mirrored), and it will be, but that's not the whole story.

First I will show not only how to do it with GF2P8AFFINEQB, but also how to find that solution programmatically using a SAT solver. There is nothing that fundamentally prevents a human from finding a solution by hand, but it seems difficult. Using a SAT solver to find a solution ex nihilo (requiring it to find both a sequence of instructions and their operands) is not that easy either (though that technique also exists). Thankfully, Geoff Langdale suggested a promising sequence of instructions:

The problem we have now (which the SAT solver will solve) is, under the constraint that for all X, f(X) = PERMB(GF2P8AFFINE(B, PERMB(X, A)), C) computes the transpose of X, what is a possible valuation of the variables A, B, C. Note that the variables in the SAT problem correspond to constants in the resulting code, and the variable in the resulting code (X) is quantified out of the problem.

If you know a bit about SAT solving, that "for all X" sounds like trouble, requiring either creating a set of constraints for every possible value of X (henceforth, concrete values of X will be known as "examples"), or some advanced technique such as CEGIS to dynamically discover a smaller set of examples to base the constraints on. Luckily, since we are dealing with a bit-permutation, there are simple and small sets of examples that together sufficiently constrain the problem. For a 16-bit permutation, this set of values could be used:

  • 1010101010101010
  • 1100110011001100
  • 1111000011110000
  • 1111111100000000

For a 256-bit permutation, a similar pattern can be used, where each of the examples has 256 bits and there would be 8 of them. Note that if you read the columns of the values, they list out the indices of the corresponding columns, which is no coincidence. Using that set of examples to constrain the problem with, essentially means that we assert that f when applied to the sequence 0..n-1 must result in the desired permutation. The way that I actually implemented this puts a column into one "abstract bit", so that it represents the index of the bit all in one place instead of spread out.

Implementing a "left GF2P8AFFINEQB" (multiplying a constant matrix on the left by a variable matrix on the right) in CNF, operating on "abstract bits" (8 variables each), is relatively straight forward. Every (abstract) bit of the result is the XOR of the AND of some (abstract) bits, writing that down is mostly a chore, but there is one interesting aspect: the XOR can be turned into an OR, since we know that we're multiplying by a permutation matrix. In CNF, OR is simpler than XOR, and easier for the solver to reason through.

VPERMB is more difficult to implement, given that the permutation operand is a variable (if it was a constant, we could just permute the abstract bits without generating any new constraints). To make it easier, I represent the permutation operand as a 32x32 permutation matrix, letting me create a bunch of simple ternary constraints of the form (¬P(i, j) ∨ ¬A(j) ∨ R(i)) ∧ (¬P(i, j) ∨ A(j) ∨ ¬R(i)) (read: if P(i, j), then A(j) must be equal to R(i)). The same thing can be used to implement VPSHUFB, with additional constraints on the permutation matrix (to prevent cross-slice movement).

Running that code, at least on my PC at this time[1], results in (with some whitespace manually added):

__m256i t0 = _mm256_permutexvar_epi8(_mm256_setr_epi8(
    14, 12, 10, 8, 6, 4, 2, 0,
    30, 28, 26, 24, 22, 20, 18, 16,
    15, 13, 11, 9, 7, 5, 3, 1,
    31, 29, 27, 25, 23, 21, 19, 17), input);
__m256i t1 = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x1080084004200201), t0, 0);
__m256i t2 = _mm256_shuffle_epi8(t1, _mm256_setr_epi8(
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14,
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14));

So that's it. That's the answer[2]. If you want to transpose a 16x16 bitmatrix, on a modern PC (this code requires AVX512_VBMI and AVX512_GFNI[3]), it's fairly easy and cheap, it's just not so easy to find this solution to begin with.

Using this transpose to invert a 16-element permutation is pretty easy, for example using _mm256_sllv_epi16 to construct the matrix and _mm256_popcnt_epi16(_mm256_sub_epi16(t2, _mm256_set1_epi16(1))) (sadly there is no SIMD version of TZCNT .. yet) to convert the bit-masks back into indices. It may be tempting to try to use a mirrored matrix and leading-zero count, which AVX512 does offer, but it only offers the DWORD and QWORD versions VPLZCNTD/Q.

Making a histogram is even simpler, using only _mm256_popcnt_epi16(t2) to convert the matrix into counts.


[1] While MiniSAT (which this program uses as its SAT solver) is a "deterministic solver" in the sense of definitely finding a satifying valuation if there is one, it is not deterministic in the sense of guaranteeing that the same satisfying valuation is found every time the solver is run on the same input.

[2] Not the unique answer, there are multiple solutions.

[3] But not 512-bit vectors.

Sunday, 9 April 2023

Column-oriented row reduction

In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations.

The usual implementation of row reduction of a bit-matrix[1] takes an array of bit-packed rows, and applies row operations to the matrix in a row-wise manner. As a reminder, it works something like this:

  1. Find a row with a non-zero element[2] in the current column.
  2. Swap it into position.
  3. XOR that row into any other row that has a non-zero element in the current column.
  4. Go to step 1 for the next column, if there is one.

For the application that I had mind when writing this blog post[3], step 2 is not necessary and will be skipped.[4]

Note that at the single-bit-level, the XOR-logic of step 3 comes down to if M(i, j) and M(j, k) then flip M(i, k). That is, in order for the XOR to flip M(i, k), the corresponding row must have a 1 in the current column (condition A), and the row that was selected to pivot on must have a 1 in the column k (condition B).

Turning it into a column-oriented algorithm

In the row-oriented algorithm, condition A is handled by conditionally XOR-ing into only those rows that have a 1 in the current column, and condition B is handled by the XOR itself (which is a conditional bit-flip afterall). A column-oriented algorithm would do it the other way around, using XOR to implement condition B, and skipping columns for which condition A is false:

  1. Find a row that hasn't already been pivoted on, with a non-zero element in the current column.
  2. Don't swap it into position.
  3. XOR each column that has a 1 in the row that we're pivoting on, with a mask formed by taking the current column and resetting the bit that corresponds to the pivot row.
  4. If there are any rows left that haven't been pivoted on, go to step 1 for the next column.

Step 1 may sound somewhat annoying, but it is really simple: AND the current column with a mask of the rows that have already been pivoted on, had, and use the old Isolate Lowest Set Bit[5] trick x & -x to extract a mask corresponding to the first row that passes both conditions.

Here is an example of what an implementation might look like (shown in C#, or maybe this is Java code with a couple of capitalization errors)

static void reduceRows(long[] M)
{
    long had = 0;
    for (int i = 0; i < M.Length; i++)
    {
        long r = lowestSetBit(M[i] & ~had);
        if (r == 0)
            continue;
        long mask = M[i] & ~r;
        for (int j = 0; j < M.Length; j++)
        {
            if ((M[j] & r) != 0)
                M[j] ^= mask;
        }
        had |= r;
    }
}

static long lowestSetBit(long x)
{
    return x & -x;
}

The stuff that goes at the end of the post

While the column-oriented approach makes the step "search for a row that has a 1 in the current column" easy, transposing a matrix just to be able to do this does not seem worthwhile to me.

Despite the if inside the inner loops of both the row-oriented and the column-oriented algorithm, both of them can be accellerated using SIMD. The conditional code can easily be rewritten into arithmetic-based masking[6] or "real" masked operations (in eg AVX512). "But don't compilers autovectorize these days?" Yes they do, but not always in a good way, note that both GCC and Clang both used a masked store, which is worse than writing back some unchanged values with a normal store (especially on AMD processors, and also especially considering that the stored values will be loaded again soon). Rumour has it that there are Reasons™ for that quirk, but that doesn't improve the outcome.


[1] Of course this blog post is not about floating point matrices, this is the bitmath blog.

[2] Sometimes known as 1 (one).

[3] XOR-clause simplification

[4] Row-swaps are particularly expensive to apply to a column-major bit-matrix, so perhaps this is just an excuse to justify the column-based approach. You decide.

[5] AKA BLSI, which AMD describes as "Isolate Lowest Set Bit", which makes sense, because well, that is what the operation does. Meanwhile Intel writes "Extract Lowest Set Isolated Bit" (isolated bit? what?) in their manuals.

[6] Arithmetic-based masking can also be used in scalar code.

Tuesday, 3 January 2023

Weighted popcnt

Since I showed the code below on Twitter, and some people understood it and some didn't, I suppose I should explain how it works and what the more general technique is.

// sum of indexes of set bits
int A073642(uint64_t n)
{
    return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
          (__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}

A sum of several multi-digit numbers, such as 12 + 34, can be rewritten to make the place-value notation explicit, giving 1*10 + 2*1 + 3*10 + 4*1. Then the distributive property of multiplication can be used to group digits of equal place-value together, giving (1 + 3)*10 + (2 + 4)*1. I don't bring up something so basic as an insult to the reader, the basis of that possibly-intimidating piece of code really is this basic.

The masks 0xAAAAAAAAAAAAAAAA, 0xCCCCCCCCCCCCCCCC, etc, represent nothing more than the numbers 0..63, each written vertically in binary, starting with the least significant bit. Each column of bits contains its own index. AND-ing the masks with n leaves, in each column, either zero (if the corresponding bit of n was zero) or the index of that column.

The first popcnt then adds together all the digits of the "filtered" indexes with a place-value of 1, the second popcnt adds together the digits with a place-value of 2, and so on. The result of each popcnt is shifted to match the corresponding place-value, and all of them are added together.

Generalizing the technique

The numbers in the columns did not have to be 0..63, the columns can contain arbitrary numbers, even negative numbers. This gives some sort of weighted popcnt: rather than each bit having a weight of 1, we are free to choose an arbitrary weight for each bit.

In general, starting with some arbitrary numbers, interpret them as a binary matrix. Transpose that matrix. Every row of the transposed matrix is the mask that goes into the corresponding popcnt-step.

Some simplifications may be possible:

  • If a row of the matrix is zero, the corresponding step can be left out.
  • If two or more rows are equal, their popcnt-steps can be merged into one with a weight that is the sum of the weights of the steps that are merged.
  • If a row is a linear combination of some set of other rows, the popcnt corresponding to that row can be computed as a linear combination of the popcnts corresponding to those other rows.
  • If a row has exactly one bit set, its step can be implemented with an AND and a shift, without popcnt, by shifting the bit to its target position.
  • If every row has exactly one bit set, we're really dealing with a bit-permutation and there are better ways to implement them.

For example, let's construct "the sum of squares of indexes of set bits" (with 1-indexed indexes) aka oeis.org/A003995. Transposing the squares 1,2,4,9..4096 gives these masks:

0x5555555555555555
0x0000000000000000
0x2222222222222222
0x1414141414141414
0x0d580d580d580d58
0x0335566003355660
0x00f332d555a66780
0x555a5b6666387800
0x66639c78783f8000
0x787c1f807fc00000
0x7f801fff80000000
0x7fffe00000000000
0x8000000000000000

The second mask is zero because a square cannot be congruent to 2 or 3 modulo 4. The last mask has exactly one bit set, so its step can be implemented without popcnt. Putting it together:

int A003995(uint64_t n)
{
    return __popcnt64(n & 0x5555555555555555) +
          (__popcnt64(n & 0x2222222222222222) << 2) +
          (__popcnt64(n & 0x1414141414141414) << 3) +
          (__popcnt64(n & 0x0d580d580d580d58) << 4) +
          (__popcnt64(n & 0x0335566003355660) << 5) +
          (__popcnt64(n & 0x00f332d555a66780) << 6) +
          (__popcnt64(n & 0x555a5b6666387800) << 7) +
          (__popcnt64(n & 0x66639c78783f8000) << 8) +
          (__popcnt64(n & 0x787c1f807fc00000) << 9) +
          (__popcnt64(n & 0x7f801fff80000000) << 10) +
          (__popcnt64(n & 0x7fffe00000000000) << 11) +
          ((n & 0x8000000000000000) >> 51);
}

Aside from implementing strange sequences from the OEIS, possible uses of this technique may include

  • Summing the face-values of a hand of cards. Unfortunately this cannot take "special combinations" into account, unlike other techniques.
  • A basic evaluation of a board state in Reversi/Othello, based on weights for each captured square that differ by their position.
  • Determining the price for a pizza based on a set of selected toppings, I don't know, I'm grasping at straws here.

Addendum

What if the input is small, and the weights are also small? Here is a different trick that is applicable in some of those cases, depending on whether the temporary result fits in 64 bits. The trick this time is much simpler, or at least sounds much simpler: for bit i of weight w[i], make w[i] copies of bit i, then popcnt everything.

A common trick to make k copies of only one bit is (bit << k) - bit. Assuming that pdep exists and is efficient, that trick can be generalized to making different numbers of copies of different bits. The simplest version of that trick would sacrifice one bit of padding per input bit, which may be acceptable depending on whether that all still fits in 64 bits. For example A073642 with a 10-bit input would work:

// requires: n is a 10-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n, 0x000020101020844b);
    return __popcnt64(top - bot);
}

That can be extended to an 11-bit input like this:

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n >> 1, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n >> 1, 0x000020101020844b);
    return __popcnt64(top - bot | top);
}

Or like this

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225) >> 1;
    uint64_t bot = _pdep_u64(n, 0x008020101020844b) >> 1;
    return __popcnt64(top - bot);
}

pdep cannot move bits to the right of their original position, in some cases (if there is no space for padding bits) you may need to hack around that, as in the example above. Rotates and right-shift are good candidates to do that with, and in general you may gather the bits with non-zero weights with pext.

This approach relies strongly on being able to efficiently implement the step "make w[i] copies of bit i", it is probably not possible to do that efficiently using only the plain old standard integer operations. Also, pdep is not efficient on all processors that support it, unfortunately making the CPUID feature flag for BMI2 useless for deciding which implementation to use.