Bits, Math and Performance(?)

From Boolean logic to bitmath and SIMD: transitive closure of tiny graphs

2025-06-06T19:43:00.000-07:00

Let's say that we have a graph of at most 8 nodes (you can also think of it as a relation between 8 things), represented as an 8 by 8 Boolean matrix, and we want to compute its transitive closure. A basic translation of Warshall's algorithm (Warshall's formulation of the Floyd–Warshall algorithm computes the transitive closure instead of shortest paths) into what I imagine a C++ programmer might write would look like this:

// version 0
std::bitset<64> closure8x8(std::bitset<64> g)
{
    for (size_t k = 0; k < 8; k++) {
        for (size_t i = 0; i < 8; i++) {
            if (g[i * 8 + k]) {
                for (size_t j = 0; j < 8; j++) {
                    if (g[k * 8 + j])
                        g.set(i * 8 + j);
                }
            }
        }
    }
    return g;
}

Some may be tempted to suggest switching to a non-bit-packed representation instead of std::bitset, but then you're reading the wrong blog. Working with individual Booleans one-by-one is a dead end whether they're packed or not, and a packed representation has more opportunities to work with multiple bits at once, which is exactly what we're doing to do. The "trick" here, if it can even be called a trick, is to extract the k-th row of the 8x8 matrix and OR it into the i-th row, if and only if node k is directly reachable from node i.

//version 1
uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        for (size_t i = 0; i < 8; i++) {
            if (x & (UINT64_C(1) << (i * 8 + k)))
                x |= kth_row << (i * 8);
        }
    }
    return x;
}

Although x is obviously modified in the i-loop, it is modified in a way such that the test doesn't really depend on the new value, we could use a copy of x that is taken just before entering the i-loop (challenge for compiler writers: automate this optimization). This isn't just for show, it's measurably more efficient (at least it was with my tests, on my PC, using a specific compiler..) and makes the next step less of a leap.

//version 2
uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t x_old = x;
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        for (size_t i = 0; i < 8; i++) {
            if (x_old & (UINT64_C(1) << (i * 8 + k)))
                x |= kth_row << (i * 8);
        }
    }
    return x;
}

Now we can unroll the i-loop. The 8 condition bits, the bits that control whether the if is entered or not, can be pulled out with (x >> k) & 0x0101010101010101. In each byte of that mask we get either 0, if the k-th row shouldn't be ORed into the corresponding row, or 1, if it should be. Multiplying the mask by the k-th row keeps every zero, but changes every 1 to a copy of the k-th row, precisely what we need to OR by.

//version 3
uint64_t closure8x8(uint64_t x)
{
    uint64_t lsb_of_byte = 0x0101010101010101;
    for (size_t k = 0; k < 8; k++)
    {
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        uint64_t m = (x >> k) & lsb_of_byte;
        x |= kth_row * m;
    }
    return x;
}

That's decent, if you're not into weird operations or SIMD you can stop there.

MOR

Similar to how (Floyd-)Warshall "looks like" a kind of matrix multiplication, transitive closure "look like" MMIX's MOR operation (which is also a kind of matrix multiplication), which I implemented like this:

uint64_t MOR(uint64_t a, uint64_t b)
{
    uint64_t lsb_of_byte = 0x0101010101010101;
    uint64_t r = 0;
    for (size_t i = 0; i < 8; i++)
    {
        uint64_t ith_row = (a >> (8 * i)) & 0xFF;
        uint64_t m = (b >> i) & lsb_of_byte;
        r |= ith_row * m;
    }
    return r;
}

Beyond merely looking similar, if we had an efficient implementation of MOR (do we?) it would be possible to use that to implement the transitive closure, although that's not as simple as taking the "MOR square" of the matrix. The "internal dependency" of Warshall's algorithm may look like a small matter on the surface but it's a crucial difference. Essentially, if A is the matrix (with Boolean entries and using OR-AND-algebra instead of the more familiar PLUS-MULTIPLY) representing our graph, we need the sum of A¹, A², ... A⁸, not just A². That can be done with only 3 uses of MOR and 3 bitwise OR. Squaring A¹ + A² gives us A² + A³ + A⁴, adding A¹ and squaring again gives us A² + ... A⁸ (1+1=1 in this context so multiple copies of A raised to the same exponent collapse into one when summed).

//version 4
uint64_t closure8x8(uint64_t x)
{
    x |= MOR(x, x);
    x |= MOR(x, x);
    x |= MOR(x, x);
    return x;
}

Although MOR cannot easily be built on top of GF2P8AFFINEQB, there is a trick that uses it (but only for its ability to transpose):

uint64_t MOR(uint64_t a, uint64_t b)
{
    __m512i va = _mm512_set1_epi64(a);
    __m512i z = _mm512_setzero_si512();
    __m512i m_id = _mm512_set1_epi64(0x8040201008040201);
    __m512i masked = _mm512_mask_blend_epi8(b, z, va);
    __m512i maskedt = _mm512_gf2p8affine_epi64_epi8(m_id, masked, 0);
    return _mm512_cmpneq_epi8_mask(maskedt, z);
}

On my Intel Rocket Lake, version 4 (using the AVX-512 trick for MOR) is not great. The latency is surprisingly decent. In terms of throughput, it beats version 3 by some 35% .. but that's bad. This approach "wastes" the entire SIMD width on processing a single graph. If we have multiple graphs, which we probably do if we're talking about throughput, this is not the way to go. There may be a niche use for it in a context where there's only one graph to process, and the latency isn't critical, and it's competing with another computation.

No more MOR

Taking the same approach as version 3, but using real SIMD instead of "faking SIMD" with scalar operations, the SIMD width is still available to process 4 graphs at once with AVX2:

// version 5
__m256i closure8x8_avx2(__m256i g)
{
    uint64_t ones = 0x0101010101010101;
    __m256i z = _mm256_setzero_si256();
    __m256i m, shuf;

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(0, ones * 8));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 7));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones, ones * 9));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 6));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 2, ones * 10));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 5));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 3, ones * 11));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 4));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 4, ones * 12));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 3));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 5, ones * 13));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 2));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 6, ones * 14));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 1));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 7, ones * 15));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), g);
    g = _mm256_or_si256(g, m);

    return g;
}

An AVX-512 version could process 8 graphs at once, but it would have to change a bit because there is no 512-bit vpblendvb. Instead you could use vpblendmb, and vptestmb to extract the masks.

Bit-permuting 16 u32s at once with AVX-512

2024-12-10T04:09:00.002-08:00

The basic trick to apply the same bit-permutation to each of the u32s is to view them as matrix of 16 rows by 32 columns, transpose it into a 32 u16s, permute those u16s in the same way that we wanted to permute the bits of the u32s [1], then transpose back to 16 u32s. Easy:

__m512i permbits_16x32(__m512i data, __m512i indices)
{
    __m512i x = data;
    x = transpose_16_dwords_to_32_words(x);
    x = _mm512_permutexvar_epi16(indices, x);
    x = transpose_32_words_to_16_dwords(x);
    return x;
}

transpose_16_dwords_to_32_words and transpose_32_words_to_16_dwords may be tricky to implement by hand, but haroldbot.nl/avx512bpc.html can solve them. In order to transpose 16 dwords into 32 words, the 4 most significant bits of every bit-index (the 9-bit bit-index of every bit in a 512-bit __m512i) should become the 4 least significant bits, and the 5 least significant bits of the index should become the most significant bits. In other words, the bit order of the bit-indices should become 5,6,7,8,0,1,2,3,4. For the inverse, 4,5,6,7,8,0,1,2,3.

Taking the solutions from that AVX512 BPC permute solver and making them the bread of the transpose-shuffle-transpose sandwich is a valid solution, but there is an opportunity for improvement: that sandwich ends up with 3 back-to-back permutes in the middle, perhaps they can be merged somehow.

One solution I've found, is to use the bit-index-bit-order 5,6,7,0,1,2,3,4,8. Keeping the most-significant bit in place simplifies the first "transpose" to no longer need a shuffle at the end, but now instead of shuffling each pair of adjacent bytes the same way (which a permutation of 16-bit elements accomplishes for free) we need to shuffle the low and high half of a 64-byte vector the same way, which requires some pre-processing of the index-vector: duplicate a 32-byte vector into the top and bottom of a 64-byte vector, and add 32 to every byte in the top half. The second not-quite-transpose in the new sandwich, corresponding to a bit-order of 3,4,5,6,7,0,1,2,8, starts with a simple permute: byte-reverse every u64. This permute can be absorbed into the pre-processing of the index-vector. In general that would not help, but in cases in which the same index-vector is reused multiple times (probably a common usage pattern, I've already used it that way myself) moving that permute into the pre-processing step can move it out of the loop.

The code

Putting it all together, permbits_16x32 can be implemented like this:

__m512i permbits_16x32_weirdindex(__m512i x, __m512i p) {
    
    __m512i mID = _mm512_set1_epi64(0x8040201008040201);
    __m512i s1 = _mm512_set_epi8(
        35, 39, 43, 47, 51, 55, 59, 63,
        34, 38, 42, 46, 50, 54, 58, 62,
        33, 37, 41, 45, 49, 53, 57, 61,
        32, 36, 40, 44, 48, 52, 56, 60,
        3, 7, 11, 15, 19, 23, 27, 31,
        2, 6, 10, 14, 18, 22, 26, 30,
        1, 5, 9, 13, 17, 21, 25, 29,
        0, 4, 8, 12, 16, 20, 24, 28);
    __m512i s2 = _mm512_set_epi8(
        63, 55, 47, 39, 62, 54, 46, 38,
        61, 53, 45, 37, 60, 52, 44, 36,
        59, 51, 43, 35, 58, 50, 42, 34,
        57, 49, 41, 33, 56, 48, 40, 32,
        31, 23, 15, 7, 30, 22, 14, 6,
        29, 21, 13, 5, 28, 20, 12, 4,
        27, 19, 11, 3, 26, 18, 10, 2,
        25, 17, 9, 1, 24, 16, 8, 0);
    x = _mm512_permutexvar_epi8(s1, x);
    x = _mm512_gf2p8affine_epi64_epi8(mID, x, 0);
    x = _mm512_permutexvar_epi8(p, x);
    x = _mm512_gf2p8affine_epi64_epi8(mID, x, 0);
    x = _mm512_permutexvar_epi8(s2, x);
    return x;
}

__m512i permbits_16x32(__m512i x, __m256i indices)
{
    __m512i s1 = _mm512_set_epi8(
        24, 25, 26, 27, 28, 29, 30, 31,
        16, 17, 18, 19, 20, 21, 22, 23,
        8, 9, 10, 11, 12, 13, 14, 15,
        0, 1, 2, 3, 4, 5, 6, 7,
        24, 25, 26, 27, 28, 29, 30, 31,
        16, 17, 18, 19, 20, 21, 22, 23,
        8, 9, 10, 11, 12, 13, 14, 15,
        0, 1, 2, 3, 4, 5, 6, 7);
    __m512i p = _mm512_permutexvar_epi8(s1, _mm512_castsi256_si512(indices));
    uint64_t m = 0x2020202020202020;
    p = _mm512_add_epi8(p, _mm512_set_epi64(m, m, m, m, 0, 0, 0, 0));
    return permbits_16x32_weirdindex(x, p);
}

[1]: As a bonus, you can put other permutation-like operations between the transposes, such as vpcompressd to perform a pext on every u32 (but with only one shared mask used for all 16 u32s), or even operations that are not permutation-like.

Histogramming bytes with positional popcount (GF2P8AFFINEQB edition)

2024-11-10T04:47:00.000-08:00

A while ago, after some back and forth on twitter/X with @corsix, I dropped some implementation of byte histogramming without explaining anything. This post aims to rectify that lack of explanation.

`pospopcnt`

Imagine an array of N k-bit words as an N-by-k matrix of bits. It's easy to take a plain old popcnt across a row of the matrix. pospopcnt is the count per position, ie summing across a column.

The amount of "heavy vertical summation" (assuming that it is a heavy operation) can be reduced by putting the rows through some "vertical" carry-save adders (which works out to some cheap bitwise operations, especially cheap on AVX512 / AVX10 thanks to ternlog), this technique is also discussed in for example Efficient Computation of Positional Population Counts Using SIMD Instructions.

That reduction, though efficient, produces an intermediate result in an awkward format. If we have four 512-bit vectors as the intermediate result and are taking the pospopcnt of 64-bit QWORDs, each of the vectors now holds 8 groups of QWORDs, where the k'th bit of each QWORD in the i'th vector contributes a value of 2ⁱ to the k'th position of the pospopcnt. "Efficient Computation of Positional Population Counts Using SIMD Instructions" and the associated code show various ways to do this without using GF2P8AFFINEQB. That paper and code have a context of pospopcnt-ing 16-bit words, which makes GF2P8AFFINEQB-free approaches more viable.

With GF2P8AFFINEQB (and VPERMB), the bits in each vector can be regouped so that the 8 k'th bit of each QWORD are grouped together in the k'th byte of the vector. VPOPCNTB easily sums those bits, from here on all we need to do is shift each vector by i and add them up.

Now that we have a pospopcnt, we can use that to make a histogram of some data after transforming that data via 1 << data.

Binning

The pospopcnt as I described it in the previous paragraph only produces 64 counts, so it's suitable for making a histogram of data in the range from 0 up to 64. For bytes, the pospopcnt could be generalized to count across 256 positions, which is how one of my early prototypes of this algorithm worked. Doing it that way results in an somewhat awkward left shift (it's surprisingly not-horrible to set the k'th bit of a 256-bit vector, but still more awkward than just VPSLLVQ-ing 1 by the data), and in having to reduce/sum vectors in which only 1 out of every 256 bits is set. The 64-bit case already suffers from mostly summing zeroes, with the reduction operations mostly reducing nothing to nothing, and the 256-bit case makes that 4 times as bad.

For a while I thought that limited the applicability of the "pospopcnt histogram algorith" to 6-bit data, but then I had another idea: if the data is not already in the range 0..64, split it into bins that are. As long as that can be done quickly enough, and it turned out that it can be done quickly enough based on VPCOMPRESSB, the extra time spent on binning the data pays for itself by enabling the use of the 64-bit pospopcnt rather than its slow 256-bit generalization.

Extras

According to my own benchmarks on an Intel 11600K (Rocket Lake), this algorithm is almost twice as fast as Powturbo's hist_8_64 on both random and non-random data, reaching about 8.8GB/s vs 4.6GB/s for hist_8_64 either way, being mostly immune to patterns in the data (unlike the good old hist[data[i]]++ algorithm, which is infamous for slowing down for some kinds of patterned data).

Benchmarks have also been done on Zen5 (9700X), which as a side effect also make my 11600K look like a joke. Interestingly, from this benchmark it looks like some kinds of patterned data increase performance on Zen5, though I did not see that effect on my 11600K.

Some improvements have been found by Joern Engel and Curtis Bezault.

Below is a diagram summarizing the major components of the algorithm.

Enumerating identities, part 2

2024-08-22T06:01:00.000-07:00

Part 2 of Enumerating all mathematical identities (in fixed-size bitvector arithmetic with a restricted set of operations) of a certain size.

To recap, the approach in part 1 was broadly:

Use a CEGIS-style loop to find a pair of expressions that are equal for all possible inputs.
After finding such a pair, block it from being found again.
Repeat until no more pairs can be found.

One weakness (in addition to the other weaknesses) of that approach is that the "block list" keeps growing as more identities are found. Here is an alternative approach that does not use an ever-growing block list:

Interpreting the raw bit-string that represents a pair of expressions as an integer, find the lowest pair of equivalent expressions.
After finding the lowest pair, interpreted as the number X, set the lower bound for the next pair to X + 1.
Repeat until no more pairs can be found.

Finding the lowest pair of expressions

SAT by itself does not try to find the lowest solution, nor does the CEGIS-loop built on top of it. But we can use the CEGIS-loop as an oracle to answer the question: is there any solution in (a restricted part of) the search space, and that lets us do a bitwise binary search - the "find one bit at the time" variant of binary search, typically discussed in a completely different context. Bitwise binary search maps well to SAT, not directly of course, but in the following way:

Initialize the prefix to an empty list.
If the prefix has the same size as a pair of expressions, directly return the result of the CEGIS-oracle.
Extend the prefix with false.
Ask the CEGIS-oracle if there is a solution that starts with the prefix, if there is, go back to step 2.
Otherwise, turn the false at the end of the prefix into true, then go back to step 2.

Asking a SAT solver for a solution that starts with a given prefix is easy and tends to make the SAT instance easier to solve (especially when the prefix is long), this only involves forcing some variables (the ones covered by the prefix) to be true or false, using single-literal clauses.

A very useful optimization can be done in step 4: when the CEGIS-oracle says there is a solution with the current prefix, the solution may have some extra zeroes after the prefix which we can use directly to extend the prefix for free. That saves a lot of SAT solves when the solutions tend to have a lot of zeroes in them, which in my case they do, due to extensive use of one-hot encoding.

Encoding the lower bound

There is a neat way to encode that an integer must be greater-than-or-equal-to some given constant, which I haven't seen people talk about (perhaps it's part of the folklore?), using only popcnt(lower_bound) clauses. The idea here is that for every bit that's set in the bound, at least one of the following bits must be set in the solution: that bit itself, or any more-significant bit that is zero in the lower bound. That only takes one clause to encode, a clause containing the variable that corresponds to the set bit, and the variables corresponding to the zeroes to the left of that set bit.

Code

For concreteness, here's how I implemented constraining solutions to conform to the prefix, it's really simple:

// set the prefix (used by binary search)
for (size_t i = 0; i < prefix.size(); i++)
{
    if (prefix[i])
        s.addClause(allprogbits[i]);
    else
        s.addClause(~allprogbits[i]);
}

Here's how I set the lower bound:

// if there is a lower bound (used by binary search), enforce it
if (!lower_bound.empty())
{
    vector<Lit> cl;
    for (size_t i = 0; i < lower_bound.size(); i++) {
        if (lower_bound[i]) {
            cl.push_back(allprogbits[i]);
            s.addClause(cl);
            cl.pop_back();
        }
        else
        {
            cl.push_back(allprogbits[i]);
        }
    }
}

And binary search (with the optimization to keep the extra zeroes that the solver gives for free) looks like this:

optional<pair<vector<InstrB>, vector<InstrB>>> find_lowest(vector<bool>& prefix,
      vector<vector<int>>& inputs)
{
    do {
        if (prefix.size() == progbits) {
            return format_progbits(synthesize(prefix, inputs),
                inputcount, lhs_size, rhs_size);
        }
        else {
            prefix.push_back(false);
            auto f = synthesize(prefix, inputs);
            if (f.has_value()) {
                // if the next bits are already zero in the solution, keep them zero
                auto bits = *f;
                while (prefix.size() < progbits && !bits[prefix.size()])
                    prefix.push_back(false);
                continue;
            }

            prefix.pop_back();
            prefix.push_back(true);
        }
    } while (true);
}

The full code of my implementation of this idea is available on gitlab. It's a bit crap but at least it should show any detail that you may still be wondering about. In the code I also constrain expressions to not be "a funny way to write zero" and to not be "a complicated way to do nothing", otherwise a lot of less-interesting identities would be generated.

Sorting the nibbles of a u64

2024-06-18T04:14:00.000-07:00

I was reminded (on mastodon) of this nibble-sorting technique (it could be adapted to other element sizes), which I apparently had only vaguely tweeted about in the past. It deserves a post, so here it is.

Binary LSD radix sort can be expressed as a sequence of stable-partitions, first stable-partitioning based on the least-significant bit, then on the second-to-least-significant bit and so on.

In modern x86, pext essentially implements half of a stable partition, only the half that moves a subset of the elements down towards lower indices. If we do that twice, the second time with an inverted mask, and shift the subset of elements where the mask is set left to put it at the top, we get a gadget that partitions a u64 based on a mask:

(pext(x, mask) << popcount(~mask)) | pext(x, ~mask)

This is sometimes called the sheep-and-goats operation.

For radix sort the masks that we need are, in order, the least significant bit of each element, each broadcasted to cover the whole corresponding element, then the same thing but with the second-to-least-signficant bit and so on. One way to express that is by shifting the whole thing right to put the bit that we want to broadcast in the the least significant position of the element, and then multiplying by 15 to broadcast that bit into every bit of the element. Different compilers handled that multiplication by 15 differently (there are alternative ways to express that).

static ulong sort_nibbles(ulong x)
{
    ulong m = 0x1111111111111111;
    ulong t = (x & m) *15;
    x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) |
        Bmi2.X64.ParallelBitExtract(x, ~t);
    t = ((x >> 1) & m) * 15;
    x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) |
        Bmi2.X64.ParallelBitExtract(x, ~t);
    t = ((x >> 2) & m) * 15;
    x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) |
        Bmi2.X64.ParallelBitExtract(x, ~t);
    t = ((x >> 3) & m) * 15;
    x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) |
        Bmi2.X64.ParallelBitExtract(x, ~t);
    return x;
}

It's easy to extend this to a key-value sort. Hypothetically you could use that key-value sort to invert a permutation (sorting the values 0..15 by the permutation), but you can do much better with AVX512.

Sharpening a lower bound with KnownBits information

2024-06-04T17:22:00.000-07:00

I have written about this before, but that was a long time ago, I've had a lot more practice with similar things since then. This topic came up on Mastodon, inspiring me to give it another try. Actually the title is a bit of a lie, I will be using Miné's bitfield domain in which we have bitvectors z indicating the bits that can be zero and o indicating the bits that can be one (as opposed to bitvectors which indicate the bits known to be zero and the bits known to be one respectively, or bitvectors ⁰b and ¹b as used in the paper linked below where ¹b has bits set that are known to be set and ⁰b has bits unset that are known to be unset). The exact representation doesn't really matter.

The problem, to be clear, is that suppose we have a lower bound on some variable, along with some knowledge about its bits (knowing that some bits have a fixed value, which others do not), for example we may know that a variable is even (its least significant bit is known to be zero) and at least 5. "Sharpening" the lower bound means increasing it, if possible, so that the lower bound "fits" the knowledge we have about the bits. If a value is even and at least 5, it is also at least 6, so we can increase the lower bound.

As a more recent reference for an algorithm that is better than my old one, you can read Sharpening Constraint Programming approaches for Bit-Vector Theory.

As that paper notes, we need to find the highest bit in the current lower bound that doesn't "fit" the KnownBits (or z, o pair from the bitfield domain) information, and then either:

If that bit was not set but should be, we need to set it, and reset any lower bits that are not required to be set (lower bound must go up, but only as little as possible).
If that bit was set but shouldn't be, we need to reset it, and in order to do that we need to set a higher bit that wasn't set yet, and also reset any lower bits that are not required to be set.

So far so good. What that paper doesn't tell you, is that these are essentially the same case, and we can do:

Starting from the highest "wrong" bit in the lower bound, find the lowest bit that is unset but could be set, set it, and clear any lower bits that are not required to be set.

That mostly sounds like the second case, what allows the original two cases to be unified is the fact that the bit we find is the same as the bit that needs to be set in the first case too.

As a reminder, x & -x is a common technique used to extract or isolate the lowest set bit aka blsi. It can also be written as x & (~x + 1), and if we change the 1 to some other constant, we can use this technique to find the lowest set bit but starting from some position that is not necessarily the least significant bit. So if we start from highestSetBit(~low & o), we find the bit we're looking for. Actually the annoying part is highestSetBit. Putting the rest together, we may get an implementation like this:

uint64_t sharpen_low(uint64_t low, uint64_t z, uint64_t o)
{
    uint64_t m = (~low & ~z) | (low & ~o);
    if (m) {
        uint64_t target = ~low & o;
        target &= ~target + highestSetBit(m);
        low = (low & -target) | target;
        low |= ~z;
    }
    return low;
}

The branch on m is a bit annoying, but on the plus side it means that the input of highestSetBit is always non-zero. Zero is otherwise a bit of an annoying case to handle in highestSetBit. In modern C++, you can use std::bit_floor for highestSetBit.

Sharpening the upper bound is symmetric, it can be implemented as ~sharpen_low(~high, o, z) or you could push the bitwise flips "inside" the algorithm and do some algebra to cancel them out.

Multiplying 64x64 bit-matrices with GF2P8AFFINEQB

2024-06-03T00:23:00.000-07:00

This is a relatively simple use of GF2P8AFFINEQB. By itself GF2P8AFFINEQB essentially multiplies two 8x8 bit-matrices (but with transpose-flip applied to the second operand, and an extra XOR by a byte that we can set to zero). A 64x64 matrix can be seen as a block-matrix where each block is an 8x8 matrix. You can also view this as taking the ring of 8x8 matrices over GF(2), and then working with 8x8 matrices with elements from that ring. All we really need to do is write an 8x8 matrix multiplication and let GF2P8AFFINEQB take care of the complicated part.

Using the "full" 512-bit version of VGF2P8AFFINEQB from AVX-512, one VGF2P8AFFINEQB instructions performs 8 of those products. A convenient way to use them is by broadcasting an element (which is really an 8x8 bit-matrix but let's put that aside for now) from the left matrix to all QWORD elements of a ZMM vector, and multiplying that by a row of the right matrix. That way we end up with a row of the result, which is nice to work with: no reshaping or horizontal-SIMD required. XOR-ing QWORDs together horizontally could be done relatively reasonably with another GF2P8AFFINEQB trick, which is neat but avoiding it is even better. All we need to do to compute a row of the result (still viewed as an 8x8 matrix) is 8 broadcasts, 8 VGF2P8AFFINEQB, and XOR-ing the 8 results together, which doesn't take 8 VPXORQ because VPTERNLOGQ can XOR three vectors together. Then just do this for each row of the result.

There are two things that I've skipped so far. First, the built-in transpose-flip of GF2P8AFFINEQB needs to be cancelled out with a flip-transpose (unless explicitly working with a matrix in a weird format is OK). Second, working with an 8x8 block-matrix is mathematically "free" by imagining some dotted lines running through the matrix, but in order to get the right data into GF2P8AFFINEQB we have to actually rearrange it (again: unless the weird format is OK).

One way to implement a flip-transpose (ie the inverse of the bit-permutation that GF2P8AFFINEQB applies to its second operand) is by reversing the bytes in each QWORD and then left-multiplying (in the second of GF2P8AFFINEQB-ing with constant as the first operand) by a flipped identity matrix, which as a QWORD looks like: 0x0102040810204080. Reversing the bytes in each QWORD could be done with a VPERMB, there are other ways, but we're about to have a VPERMB anyway.

Rearranging the data between a fully row-major layout and an 8x8 matrix in which each element is an 8x8 bit-matrix is easy, that's just an 8x8 transpose after all, so just VPERMB. That's needed both for the inputs and the output. The input that is the right-hand operand of the overall matrix multiplication also needs to have a byte-reverse applied to each QWORD, the same VPERMB that does that transpose can also do that byte-reverse.

Here's one way to put that all together:

array<uint64_t, 64> mmul_gf2_avx512(const array<uint64_t, 64>& A, const array<uint64_t, 64>& B)
{
    __m512i id = _mm512_set1_epi64(0x0102040810204080);
    __m512i tp = _mm512_setr_epi8(
        0, 8, 16, 24, 32, 40, 48, 56,
        1, 9, 17, 25, 33, 41, 49, 57,
        2, 10, 18, 26, 34, 42, 50, 58,
        3, 11, 19, 27, 35, 43, 51, 59,
        4, 12, 20, 28, 36, 44, 52, 60,
        5, 13, 21, 29, 37, 45, 53, 61,
        6, 14, 22, 30, 38, 46, 54, 62,
        7, 15, 23, 31, 39, 47, 55, 63);
    __m512i tpr = _mm512_setr_epi8(
        56, 48, 40, 32, 24, 16, 8, 0,
        57, 49, 41, 33, 25, 17, 9, 1,
        58, 50, 42, 34, 26, 18, 10, 2,
        59, 51, 43, 35, 27, 19, 11, 3,
        60, 52, 44, 36, 28, 20, 12, 4,
        61, 53, 45, 37, 29, 21, 13, 5,
        62, 54, 46, 38, 30, 22, 14, 6,
        63, 55, 47, 39, 31, 23, 15, 7);
    array<uint64_t, 64> res;

    __m512i b_0 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[0]));
    __m512i b_1 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[8]));
    __m512i b_2 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[16]));
    __m512i b_3 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[24]));
    __m512i b_4 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[32]));
    __m512i b_5 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[40]));
    __m512i b_6 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[48]));
    __m512i b_7 = _mm512_permutexvar_epi8(tpr, _mm512_loadu_epi64(&B[56]));
    
    b_0 = _mm512_gf2p8affine_epi64_epi8(id, b_0, 0);
    b_1 = _mm512_gf2p8affine_epi64_epi8(id, b_1, 0);
    b_2 = _mm512_gf2p8affine_epi64_epi8(id, b_2, 0);
    b_3 = _mm512_gf2p8affine_epi64_epi8(id, b_3, 0);
    b_4 = _mm512_gf2p8affine_epi64_epi8(id, b_4, 0);
    b_5 = _mm512_gf2p8affine_epi64_epi8(id, b_5, 0);
    b_6 = _mm512_gf2p8affine_epi64_epi8(id, b_6, 0);
    b_7 = _mm512_gf2p8affine_epi64_epi8(id, b_7, 0);

    for (size_t i = 0; i < 8; i++)
    {
        __m512i a_tiles = _mm512_loadu_epi64(&A[i * 8]);
        a_tiles = _mm512_permutexvar_epi8(tp, a_tiles);
        __m512i row = _mm512_ternarylogic_epi64(
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(0), a_tiles), b_0, 0),
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(1), a_tiles), b_1, 0),
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(2), a_tiles), b_2, 0), 0x96);
        row = _mm512_ternarylogic_epi64(row,
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(3), a_tiles), b_3, 0),
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(4), a_tiles), b_4, 0), 0x96);
        row = _mm512_ternarylogic_epi64(row,
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(5), a_tiles), b_5, 0),
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(6), a_tiles), b_6, 0), 0x96);
        row = _mm512_xor_epi64(row,
            _mm512_gf2p8affine_epi64_epi8(_mm512_permutexvar_epi64(_mm512_set1_epi64(7), a_tiles), b_7, 0));
        row = _mm512_permutexvar_epi8(tp, row);
        _mm512_storeu_epi64(&res[i * 8], row);
    }
    
    return res;
}

When performing multiple matrix multiplications in a row, it may make sense to leave the intermediate results in the format of an 8x8 matrix of 8x8 bit-matrices. The B matrix needs to be permuted anyway, but the two _mm512_permutexvar_epi8 in the loop can be removed. And obviously, if the same matrix is used as the B matrix several times, it only needs to be permuted once. You may need to manually inline the code to convince your compiler to keep the matrix in registers.

Crude benchmarks

A very boring conventional implementation of this 64x64 matrix multiplication may look like this:

array<uint64_t, 64> mmul_gf2_scalar(const array<uint64_t, 64>& A, const array<uint64_t, 64>& B)
{
    array<uint64_t, 64> res;
    for (size_t i = 0; i < 64; i++) {
        uint64_t result_row = 0;
        for (size_t j = 0; j < 64; j++) {
            if (A[i] & (1ULL << j))
                result_row ^= B[j];
        }
        res[i] = result_row;
    }
    return res;
}

There are various ways to write this slightly differently, some of which may be a bit faster, that's not really the point.

On my PC, which has a 11600K (Rocket Lake) in it, mmul_gf2_scalar runs around 500 times as slow (in terms of the time taken to perform a chain of dependent multiplications) as the AVX-512 implementation. Really, it's that slow - but that is partly due to my choice of data: I mainly benchmarked this on random matrices where each bit has a 50% chance of being set. The AVX-512 implementation does not care about that at all, while the above scalar implementation has thousands (literally) of branch mispredictions. That can be fixed without using SIMD, for example:

array<uint64_t, 64> mmul_gf2_branchfree(const array<uint64_t, 64>& A, const array<uint64_t, 64>& B)
{
    array<uint64_t, 64> res;
    for (size_t i = 0; i < 64; i++) {
        uint64_t result_row = 0;
        for (size_t j = 0; j < 64; j++) {
            result_row ^= B[j] & -((A[i] >> j) & 1);
        }
        res[i] = result_row;
    }
    return res;
}

That was, in my benchmark, already about 8 times as fast as the branching version, if I don't let MSVC auto-vectorize. If I do let it auto-vectorize (with AVX-512), this implementation becomes 25 times as fast as the branching version. This was not supposed to be a post about branch (mis)prediction, but be careful out there, you might get snagged on a branch.

It would be interesting to compare the AVX-512 version against established finite-field (or GF(2)-specific) linear algebra packages, such as FFLAS-FFPACK and M4RI. Actually I tried to include M4RI in the benchmark, but it ended up being 10 times as slow as mmul_gf2_branchfree (when it is auto-vectorized). That's bizarrely bad so I probably did something wrong, but I've already sunk 4 times as much time into getting M4RI to work at all as it took to write the code which this blog post is really about, so I'll just accept that I did it wrong and leave it at that.

Implementing grevmul with GF2P8AFFINEQB

2024-05-28T20:23:00.000-07:00

As a reminder, grev is generalized bit-reversal, and performs a bit-permutation that corresponds to XORing the indices of the bits of the left operand by the number given in the right operand. A possible (not very suitable for actual use, but illustrative) implementation of grev is:

def grev(bits, k):
    return bits[np.arange(len(bits)) ^ k]

grevmul (see also this older post where I list some of its properties) can be defined in terms of grev (hence the name^[1]), with grev replacing the left shift in a carryless multiplication. But let's do something else first. An interesting way to look at multiplication (plain old multiplication between natural numbers) is as:

Form the Cartesian product of the inputs, viewed as arrays of bits.
For each pair in the Cartesian product, compute the AND of the two bits.
Send the AND of each pair with index (i, j) to the bin i + j, to be accumulated with the function (+).
Resolve the carries.

Carryless multiplication works mostly the same way except that accumulation is done with XOR, which makes step 4 unnecessary. grevmul also works mostly the same way, with accumulation also done with XOR, but now the pair with index (i, j) is sent to the bin i XOR j.

	+	^
+	imul	clmul
^	???	grevmul

That won't be important for the implementation, but it may help you think about what grevmul does, and this will be on the test. OK there is no test, but you can test yourself by explaining why (popcnt(a & b) & 1) == (grevmul(a, b) & 1), based on reasoning about the Cartesian-product-algorithm for grevmul.

Implementing `grevmul` with `GF2P8AFFINEQB`

Some of you may have already seen a version of the code that I am about to discuss, although I have made some changes since. Nothing serious, but while thinking about how the code worked, I encountered some opportunities to make polish it up.

GF2P8AFFINEQB computes, for each QWORD (so for now, let's concentrate on one QWORD of the result), the product of two bit-matrices, where the left matrix comes from the first input and the right matrix is the transpose-flip (or flip-transpose, however you prefer to think about it) of the second input. You can think of it as a mutant of the mxor operation found in Knuth's MMIX instruction set, and from here on I will use the name bmatxor for the version of this operation that simply multiplies two 8x8 bit-matrices, with none of that transpose-flip business^[2]. There is also a free XOR by a constant byte thrown in at the end, which can be safely ignored by setting it to zero. The transpose-flip however need to be worked around or otherwise taken into account. You may also want to read (Ab)using gf2p8affineqb to turn indices into bits first for some extra familiarity with / alternate view of GF2P8AFFINEQB.

To implement grevmul(a, b), we need some linear combination (controlled by b) of permuted (by grev) versions of a (the roles of a and b can be swapped since grevmul is commutative, which the Cartesian product-based algorithm makes clear). GF2P8AFFINEQB is all about linear combinations, but it works with 8x8 matrices, not 64x64 (put that in AVX-4096) which would have made it really simple. Fortunately, we can slice a grevmul however we want.

Now to start in the middle, let's say we have 8 copies of a byte (a byte of b), with the i'th copy grev'ed by i, concatenated into a QWORD that I will call m. bmatxor(a, m) would (thinking of a matrix multiplication AB as forming linear combinations of rows of B), for each row of the result, form a linear combination (controlled by a) of grev'ed copies of the byte from b. That may seem like it would be wrong, since every byte of a is done separately and uses the same m, so it's "missing" a grev by 8 for the second byte, 16 for the third byte, etc. But if x is a byte (not if and only if, just regular "if"), then grev(x, 8 * i) is the same as x << (8 * i) and the second byte is indeed already in the second position anyway, so we get this for free. Thus, mxor(a, m) would allow us to grevmul a 64-bit number by an 8-bit number. If we could just do that 8 times (for each byte of b) and combine the results, we're done.

But we don't have bmatxor, we have GF2P8AFFINEQB with its built-in transpose-flip, and that presents a choice: either put another GF2P8AFFINEQB before, or after, the "main" GF2P8AFFINEQB to counter that built-in transpose. Not the whole transpose-flip, let's put the flip aside for now. There is a small reason to favour the "extra GF2P8AFFINEQB after the main one" order, namely that that results in GF2P8AFFINEQB(m, broadcast(a)) (as opposed to GF2P8AFFINEQB(broadcast(a), m)) and when a comes from memory it can be loaded and broadcasted directly with a {1to8} broadcasted memory operand. That option would not be available if a was the first operand of the GF2P8AFFINEQB. This is a small matter, but we have to make the choice somehow, and there seems to be no difference aside from this.

At this point there are two (and a half) pieces of the puzzle left: forming m, and horizontally combining 8 results.

Forming `m`

If we would first form 8 copies of a byte of b and then try to grev those by their respective indices, that would be hard. But doing it the other way around is easy, broadcast b into each QWORD of a 512-bit vector, grev each QWORD by its index, then transpose the whole vector as an 8x8 matrix of bytes. Actually for constant-reuse (loading fewer humongous vector constants is rarely bad) and because of the built-in transpose-flip it turns out slightly better to transpose-flip that 8x8 matrix of bytes as well, that doesn't cost any more than just transposing it.

grev-ing each QWORD by its index is easy, perhaps easier than it sounds. A grev by 0..7 only rearranges the bits within each byte, which is easy to do with a single GF2P8AFFINEQB with a constant as the second operand (a "P" step).

An 8x8 transpose-flip is just a VPERMB by some index vector.^[3]

Combining the results

After the "middle" step, if we went with the "extra GF2P8AFFINEQB before the main GF2P8AFFINEQB"-route, we would have bmatxor(a, m) in each QWORD (with a different m per QWORD), which need to be combined. If we were implementing a plain old integer multiplication, the value in the QWORD with the index i would be shifted left by 8 * i and then all of them would be summed up. Since we're implementing a grevmul, that value is grev'ed by 8 * i (which is just some byte permutation) and the resulting QWORDs are XORed.

If we go with the "extra GF2P8AFFINEQB after the main GF2P8AFFINEQB"-route, which I chose, then there is a GF2P8AFFINEQB to do before we can start combining QWORDs. We really only need it to un-transpose the result of the "main" GF2P8AFFINEQB, the rest is just a byte-permutation and we're about to do a VPERMB anyway (if we choose the cool way of XORing the QWORDs together), but there is a neat opportunity here: if we re-use the same set of bit-matrices that was used to grev b by 0..7, in addition to the transpose that we wanted we also permute the bytes such that we grev each QWORD by 8 * i as a bonus.

Now we could just XOR-fold the vector 3 times to end up with the XOR of all eight QWORDs, and that would be a totally valid implementation, but it would also be boring. Alternatively, if we transpose the vector as an 8x8 matrix of bytes again (it can be a transpose-flip, so we get to reuse the same index vector as before), then the i'th byte of each QWORD would be gathered in the i'th QWORD of the transpose and bmatxor(0xFF, vector) would XOR together all bytes with the same index and give us a vector that has one byte of the final result per QWORD, easily extractable with VPMOVQB. We still don't have bmatxor though, we have bmatxor with an extra transpose-flip, which can be countered the usual way, with yet another GF2P8AFFINEQB.

As yet another alternative^[4], after similar transpose trickery bmatxor(vector, 0xFF) would also XOR together bytes with the same index but leave the result in a form that can be extracted with VPMOVB2M, which puts the result in a mask register but it's not too bad to move it from there to a GPR.

The code

In case anyone makes it this far, here is one possible embodiment^[5] of the algorithm described herein:

uint64_t grevmul_avx512(uint64_t a, uint64_t b)
{
    uint64_t id = 0x0102040810204080;
    __m512i grev_by_index = _mm512_setr_epi64(
        id,
        grev(id, 1),
        grev(id, 2),
        grev(id, 3),
        grev(id, 4),
        grev(id, 5),
        grev(id, 6),
        grev(id, 7));
    __m512i tp_flip = _mm512_setr_epi8(
        56, 48, 40, 32, 24, 16, 8, 0,
        57, 49, 41, 33, 25, 17, 9, 1,
        58, 50, 42, 34, 26, 18, 10, 2,
        59, 51, 43, 35, 27, 19, 11, 3,
        60, 52, 44, 36, 28, 20, 12, 4,
        61, 53, 45, 37, 29, 21, 13, 5,
        62, 54, 46, 38, 30, 22, 14, 6,
        63, 55, 47, 39, 31, 23, 15, 7);
    __m512i m = _mm512_set1_epi64(b);
    m = _mm512_gf2p8affine_epi64_epi8(m, grev_by_index, 0);
    m = _mm512_permutexvar_epi8(tp_flip, m);
    __m512i t512 = _mm512_gf2p8affine_epi64_epi8(m, _mm512_set1_epi64(a), 0);
    t512 = _mm512_gf2p8affine_epi64_epi8(grev_by_index, t512, 0);
    t512 = _mm512_permutexvar_epi8(tp_flip, t512);
    t512 = _mm512_gf2p8affine_epi64_epi8(_mm512_set1_epi64(0x8040201008040201), t512, 0);
    t512 = _mm512_gf2p8affine_epi64_epi8(t512, _mm512_set1_epi64(0xFF), 0);
    return _mm512_movepi8_mask(t512);
}

The end

As far as I know, no one really uses grevmul for anything, so being able to compute it somewhat efficiently (more efficiently than a naive scalar solution at least) is not immediately useful. On the other hand, if an operation is not known to be efficiently computable, that may preclude its use. But the point of this post is more to show something neat.

Originally I had found the sequence of _mm512_gf2p8affine_epi64_epi8(m, _mm512_set1_epi64(a), 0) and _mm512_gf2p8affine_epi64_epi8(grev_by_index, t512, 0) as a solution to grevmul-ing a QWORD by a (constant) byte, using a SAT solver (shout-out to togasat for being easy to add to a project - though admittedly I eventually bit the bullet and switched to MiniSAT). That formed the starting point of this investigation / puzzle-solving session. It may be possible to pull a more complete solution out of the void with a SAT/SMT based technique such as CEGIS, perhaps the bitwise-bilinear nature of grevmul can be exploited (I used the bitwise-linear nature of grevmul-by-a-constant in my SAT-experiments to represent the problem as a composition of matrices over GF(2)).

Almost half of the steps of this algorithm are some kind of transpose, which has also been the case with some other SIMD algorithms that I recently had a hand in. I used to think of a transpose as "not really doing anything", barely worth the notation when doing linear algebra, but maybe I was wrong.

[1] Maybe it's less "the name" and more "what I decided to call it". I'm not aware of any established name for this operation.
[2] This makes it sound more negative than it really is. The transpose-flip often needs to be worked around when we don't want it, but that's not that bad. Having no easy access to a transpose would be much worse to work around when we do need it. Separate bmatxor and bmattranspose instructions would have been nice.
[3] GF2P8AFFINEQB trickery is nice, but when I recently wrote some AVX2 code it was VPERMB that I missed the most.
[4] Can we stop with the alternatives and just pick something?
[5] No patents were read in the development of this algorithm, nor in the writing of this blog post.

Enumerating all mathematical identities (in fixed-size bitvector arithmetic with a restricted set of operations) of a certain size

2024-04-04T20:29:00.000-07:00

Once again a boring-but-specific title, I don't want to clickbait the audience after all. Even so, let's get some "disclaimers" out of the way.

"All" includes both boring and interesting identities. I tried to remove the most boring ones, so it's no longer truly all identities, but still a lot of boring ones remain. The way I see it, the biggest problem which the approach that I describe in this blog post has, is generating too much "true but boring" junk.
This approach is, as far as I know, absolutely limited to fixed-size bitvectors, but that's what I'm interested in anyway. To keep things reasonable, the size should be small, which does result in some differences with eg the arithmetic of 64-bit bitvectors. Most of the results either directly transfer to larger sizes, or generalize to larger sizes.
The set of operations is restricted to those that are cheap to implement in CNF SAT, that is not a hard limitation but a practical one.
"Of a certain size" means we have to pick in advance the number of operations on both sides of the mathematical identity, and then only identities with exactly that number of operations (counted in the underlying representation, which may represent a seemingly larger expression if the result of an operation is used more than once) are found. This can be repeated for any size we're interested in, but this approach is not very scalable and tends to run out of memory if there are too many identities of the requested size.

The results look something like this. Let's say we want "all" identities that involve 2 variables, 2 operations on the left, and 0 operations (ie only a variable) on the right. The result would be the following, keep in mind that a bunch of redundant and boring identities are filtered out.

(a - (a - b)) == b
(a + (b - a)) == b
((a + b) - a) == b
(b | (a & b)) == b
(a ^ (a ^ b)) == b
(b & (a | b)) == b
// Done in 0s. Used a set of 3 inputs.

Nothing too interesting so far, but then we didn't ask for much. Here are a couple of selected "more interesting" (but not by any means new or unknown) identities that this approach can also enumerate:

((a & b) + (a | b)) == (a + b)
((a | b) - (a & b)) == (a ^ b)
((a ^ b) | (a & b)) == (a | b)
((a & b) ^ (a | b)) == (a ^ b)
((~ a) - (~ b)) == (b - a)
(~ (b + (~ a))) == (a - b)

Now that the expectations have been set accurately (hopefully), let's get into what the approach is.

The Approach

The core mechanism I use is CounterExample-Guided Inductive Synthesis (CEGIS) based on a SAT solver. Glucose worked well, other solvers can be used. Rather than asking CEGIS to generate a snippet of code that performs some specific task however, I ask it to generate two snippets that are equivalent. That does not fundementally change how it operates, which is still a loop of:

Synthesize code that does the right thing for each input in the set of inputs to check.
Check whether the code matches the specification. If it does, we're done. If it doesn't, add the counter-example to the set of inputs to check.

Both synthesis and checking could be performed by a SAT solver, but I only use a SAT solver for synthesis. For checking, since 4-bit bitvectors have so few combinations, I just brute force every possible valuation of the variables.

When a pair of equivalent expressions has been found, I add its negation as a single clause to prevent the same thing from being synthesized again. This is what enables pulling out one identity after the other. In my imagination, that looks like Thor smashing his mug and asking for another.

Solving for programs may seem odd, the trick here is to represent a program as a sequence of instructions that are constructed out of boolean variables, the SAT solver is then invoked to solve for those variables.

The code is available on gitlab.

Original motivation

The examples I gave earlier only involve "normal" bitvector arithmetic. Originally what I set out to do is discover what sorts of mathematical identities are true in the context of trapping arithmetic (in which subtraction and addition trap on signed overflow), using the rule that two expressions are equivalent if and only if they have the same behaviour, in the following sense: for all valuations of the variables, the two expressions either yield the same value, or they both trap. That rule is also implemented in the linked source.

Many of the identities found in that context involve a trapping operation that can never actually trap. For example the trapping subtraction (the t-suffix in -t indicates that it is the trapping version of subtraction) in (b & (~ a)) == (b -t (a & b)) cannot trap (boring to prove so I won't bother). But the "infamous" (among who, maybe just me) (-t (-t (-t a))) == (-t a) is also enumerated and the sole remaining negation can trap but does so in exactly the same case as the original three-negations-in-a-row (namely when a is the bitvector with only its sign-bit set). Here is a small selection of nice identities that hold in trapping arithmetic:

((a & b) +t (a | b)) == (a +t b)
((a | b) -t (a & b)) == (a ^ b)
((~ b) -t (~ a)) == (a -t b)
(~ (b +t (~ a))) == (a -t b)
(a -t (a -t b)) == (a - (a -t b))  // note: one of the subtractions is a non-trapping subtraction

Future directions

A large source of boring identities is the fact that if f(x) == g(x), then also f(x) + x == g(x) + x and f(x) & x == g(x) & x and so on, which causes "small" identities to show up again as part of larger ones, without introducing any new information, and multiplied in myriad ways. If there was a good way to prevent them from being enumerated (it would have to be sufficiently easy to state in terms of CNF SAT clauses, to prevent slowing down the solver too much), or to summarize the full output, that could make the output of the enumeration more human-digestible.

There is a part 2 for this post.

The solutions to 𝚙𝚘𝚙𝚌𝚗𝚝(𝚡) < 𝚝𝚣𝚌𝚗𝚝(𝚡) and why there are Fibonacci[n] of them below 2ⁿ

2024-03-09T23:43:00.000-08:00

popcnt(x) < tzcnt(x) asks the question "does x have fewer set bits than it has trailing zeroes". It's a simple question with a simple answer, but cute enough to think about on a Sunday morning.[1]

Here are the solutions for 8 bits, in order: 0, 4, 8, 16, 24, 32, 40, 48, 64, 72, 80, 96, 112, 128, 136, 144, 160, 176, 192, 208, 224[2]

In case you find decimal hard to do read (as I do), here they are again in binary: 00000000, 00000100, 00001000, 00010000, 00011000, 00100000, 00101000, 00110000, 01000000, 01001000, 01010000, 01100000, 01110000, 10000000, 10001000, 10010000, 10100000, 10110000, 11000000, 11010000, 11100000

Simply staring at the values doesn't do much for me. To get a better handle on what's going on, let's recursively (de-)construct the set of n-bit solutions.

The most significant bit of an n-bit solution is either 0 or 1:

If it is 0, then that bit affects neither the popcnt nor the tzcnt so removing it must yield an (n-1)-bit solution.
If it is 1, then removing it along with the least significant bit (which must be zero, there are no odd solutions since their tzcnt would be zero) would decrease the both popcnt and the tzcnt by 1, yielding an (n-2)-bit solution.

This "deconstructive" recursion is slightly awkward. The constructive version would be: you can take the (n-1)-bit solutions and prepend a zero to them, and you can take the (n-2)-bit solutions and prepend a one and append a zero to them. However, it is less clear then (to me anyway) that those are the only n-bit solutions. The "deconstructive" version starts with all n-bit solutions and splits them into two obviously-disjoint groups, removing the possibility of solutions getting lost or being counted double.

The F(n) = F(n - 1) + F(n - 2) structure of the number of solutions is clear, but there are different sequences that follow that same recurrence that differ in their base cases. Here we have 1 solution for 1-bit integers (namely zero) and 1 solution for 2-bit integers (also zero), so the base cases are 1 and 1 as in the Fibonacci sequence.

This is probably all useless, and it's barely even bitmath.

[1] Or whenever, but it happens to be a Sunday morning for me right now.
[2] This sequence does not seem to be on the OEIS at the time of writing.

Partial sums of popcount

2024-01-17T03:06:00.000-08:00

The partial sums of popcount, aka A000788: Total number of 1's in binary expansions of 0, ..., n can be computed fairly efficiently with some mysterious code found through its OEIS entry (see the link Fast C++ function for computing a(n)): (reformatted slightly to reduce width)

unsigned A000788(unsigned n)
{
    unsigned v = 0;
    for (unsigned bit = 1; bit <= n; bit <<= 1)
        v += ((n>>1)&~(bit-1)) +
             ((n&bit) ? (n&((bit<<1)-1))-(bit-1) : 0);
    return v;
}

Knowing what we (or I, anyway) know from computing the partial sums of blsi and blsmsk, let's try to improve on that code. "Improve" is a vague goal, let's say we don't want to loop over the bits, but also not just unroll by 64x to do this for a 64-bit integer.

First let's split this thing into the sum of an easy problem and a harder problem, the easy problem being the sum of (n>>1)&~(bit-1) (reminder that ~(bit-1) == -bit, unsigned negation is safe, UB-free, and does exactly what we need, even on hypothetical non-two's-complement hardware). This is the same thing we saw in the partial sum of blsi, bit k of n occurs k times in the sum, which we can evaluate like this:

uint64_t v = 
    ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) +
    ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) +
    ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) +
    ((n & 0xFF00'FF00'FF00'FF00) << 2) +
    ((n & 0xFFFF'0000'FFFF'0000) << 3) +
    ((n & 0xFFFF'FFFF'0000'0000) << 4);

The harder problem, the contribution from ((n&bit) ? (n&((bit<<1)-1))-(bit-1) : 0), has a similar pattern but more annoying in three ways. Here's an example of the pattern, starting with n in the first row and listing the values being added together below the horizontal line:

00100011000111111100001010101111
--------------------------------
00000000000000000000000000000001
00000000000000000000000000000010
00000000000000000000000000000100
00000000000000000000000000001000
00000000000000000000000000010000
00000000000000000000000000110000
00000000000000000000000010110000
00000000000000000000001010110000
00000000000000000100001010110000
00000000000000001100001010110000
00000000000000011100001010110000
00000000000000111100001010110000
00000000000001111100001010110000
00000000000011111100001010110000
00000000000111111100001010110000
00000001000111111100001010110000
00000011000111111100001010110000

Some anomalous thing happens for the contiguous group of rightmost set bits.
The weights are based not on the column index, but sort of dynamic based on the number of set bits ...
... to the left of the bit we're looking at. That's significant, "to the right" would have been a lot nicer to deal with.

For problem 1, I'm just going to state without proof that we can add 1 to n and ignore the problem, as long as we add n & ~(n + 1) to the final sum. Problems 2 and 3 are more interesting. If we had problem 2 but counting the bits to the right of the bit we're looking at, that would have nice and easy, instead of (n & 0xAAAA'AAAA'AAAA'AAAA) we would have _pdep_u64(0xAAAA'AAAA'AAAA'AAAA, n), problem solved. If we had a "pdep but from left to right" (aka expand_left) named _pdepl_u64 we could have done this:

uint64_t u = 
    ((_pdepl_u64(0x5555'5555'5555'5555, m) >> shift) << 0) +
    ((_pdepl_u64(0x3333'3333'3333'3333, m) >> shift) << 1) +
    ((_pdepl_u64(0x0F0F'0F0F'0F0F'0F0F, m) >> shift) << 2) +
    ((_pdepl_u64(0x00FF'00FF'00FF'00FF, m) >> shift) << 3) +
    ((_pdepl_u64(0x0000'FFFF'0000'FFFF, m) >> shift) << 4) +
    ((_pdepl_u64(0x0000'0000'FFFF'FFFF, m) >> shift) << 5);

But as far as I know, ~~that requires bit-reversing the inputs~~ (see the update below) of a normal _pdep_u64 and bit-reversing the result, which is not so nice at least on current x64 hardware. Every ISA should have a Generalized Reverse operation like the grevi instruction which used to be in the drafts of the RISC-V Bitmanip Extension prior to version 1.

Update:

It turned out there is a reasonable way to implement _pdepl_u64(v, m) in plain scalar code after all, namely as _pdep_u64(v >> (std::popcount(~m) & 63), m). The & 63 isn't meaningful, it's just to prevent UB at the C++ level.

This approach turned out to be more efficient than the AVX512 approach, so that's obsolete now, but maybe still interesting to borrow ideas from. Here's the scalar implementation in full:

uint64_t _pdepl_u64(uint64_t v, uint64_t m)
{
    return _pdep_u64(v >> (std::popcount(~m) & 63), m);
}

uint64_t partialSumOfPopcnt(uint64_t n)
{
    uint64_t v =
        ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) +
        ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) +
        ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) +
        ((n & 0xFF00'FF00'FF00'FF00) << 2) +
        ((n & 0xFFFF'0000'FFFF'0000) << 3) +
        ((n & 0xFFFF'FFFF'0000'0000) << 4);
    uint64_t m = n + 1;
    int shift = std::countl_zero(m);
    m = m << shift;
    uint64_t u =
        ((_pdepl_u64(0x5555'5555'5555'5555, m) >> shift) << 0) +
        ((_pdepl_u64(0x3333'3333'3333'3333, m) >> shift) << 1) +
        ((_pdepl_u64(0x0F0F'0F0F'0F0F'0F0F, m) >> shift) << 2) +
        ((_pdepl_u64(0x00FF'00FF'00FF'00FF, m) >> shift) << 3) +
        ((_pdepl_u64(0x0000'FFFF'0000'FFFF, m) >> shift) << 4) +
        ((_pdepl_u64(0x0000'0000'FFFF'FFFF, m) >> shift) << 5);
    return u + (n & ~(n + 1)) + v;
}

Repeatedly calling _pdepl_u64 with the same mask creates some common-subexpressions, they could be manually factored out but compilers do that anyway, even MSVC only uses one actual popcnt instruction (but MSVC, annoyingly, actually performs the meaningless & 63).

Enter AVX512

Using AVX512, we could more easily reverse the bits of a 64-bit integer, there are various ways to do that. But just using that and then going back to scalar pdep would be a waste of a good opportunity to implement the whole thing in AVX512, pdep and all. The trick to doing a pdep in AVX512, if you have several 64-bit integers that you want to pdep with the same mask, is to transpose 8x 64-bit integers into 64x 8-bit integers, use vpexpandb, then transpose back. In this case the first operand of the pdep is a constant, so the first transpose is not necessary. We still have to reverse the mask though. Since vpexpandb takes the mask input in a mask register and we only have one thing to reverse, this trick to bit-permute integers seems like a better fit than Wunk's whole-vector bit-reversal or some variant thereof.

I sort of glossed over the fact that we're supposed to be bit-reversing relative to the most significant set bit in the mask, but that's easy to do by shifting left by std::countl_zero(m) and then doing a normal bit-reverse, so in the end it still comes down to a normal bit-reverse. The result of the pdeps have to be shifted right by the same amount to compensate.

Here's the whole thing: (note that this is less efficient than the updated approach without AVX512)

uint64_t partialSumOfPopcnt(uint64_t n)
{    
    uint64_t v = 
        ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) +
        ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) +
        ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) +
        ((n & 0xFF00'FF00'FF00'FF00) << 2) +
        ((n & 0xFFFF'0000'FFFF'0000) << 3) +
        ((n & 0xFFFF'FFFF'0000'0000) << 4);
    // 0..63
    __m512i weights = _mm512_setr_epi8(
        0, 1, 2, 3, 4, 5, 6, 7,
        8, 9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,
        24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39,
        40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63);
    // 63..0
    __m512i rev = _mm512_set_epi8(
        0, 1, 2, 3, 4, 5, 6, 7,
        8, 9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,
        24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39,
        40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63);
    uint64_t m = n + 1;
    // bit-reverse the mask to implement expand-left
    int shift = std::countl_zero(m);
    m = (m << shift);
    __mmask64 revm = _mm512_bitshuffle_epi64_mask(_mm512_set1_epi64(m), rev);
    // the reversal of expand-right with reversed inputs is expand-left
    __m512i leftexpanded = _mm512_permutexvar_epi8(rev, 
        _mm512_mask_expand_epi8(_mm512_setzero_si512(), revm, weights));
    // transpose back to 8x 64-bit integers
    leftexpanded = Transpose64x8(leftexpanded);
    // compensate for having shifted m left
    __m512i masks = _mm512_srlv_epi64(leftexpanded, _mm512_set1_epi64(shift));
    // scale and sum results
    __m512i parts = _mm512_sllv_epi64(masks,
        _mm512_set_epi64(7, 6, 5, 4, 3, 2, 1, 0));
    __m256i parts2 = _mm256_add_epi64(
        _mm512_castsi512_si256(parts), 
        _mm512_extracti64x4_epi64(parts, 1));
    __m128i parts3 = _mm_add_epi64(
        _mm256_castsi256_si128(parts2), 
        _mm256_extracti64x2_epi64(parts2, 1));
    uint64_t u = 
        _mm_cvtsi128_si64(parts3) + 
        _mm_extract_epi64(parts3, 1);
    
    return u + (n & ~(n + 1)) + v;
}

As for Transpose64x8, you can find out how to implement that in Permuting bits with GF2P8AFFINEQB.

Permuting bits with GF2P8AFFINEQB

2023-09-23T16:33:00.005-07:00

It's no secret that GF2P8AFFINEQB can be tricky to think about, even in the restricted context of bit-permutations. Thinking about more than one step (such as more than one GF2P8AFFINEQB back-to-back, or GF2P8AFFINEQB flanked by byte-wise shuffles) is just too much. Or perhaps you can do it, tell me your secret.

A good way for mere mortals to reason about these kinds of permutations, I think, is to think in terms of the bits of the indices of the bits that are really being permuted. So we're 4 levels deep:

The value whose bits are being permuted.
The bits that are being permuted.
The indices of those bits.
The bits of those indices.

This can get a little confusing because a lot of the time the operation that will be performed on the bits of those indices is a permutation again, but they don't have to be, another classic example is that a rotation corresponds to add/subtracting a constant to the indices. Just keep in mind that we're 4 levels deep the entire time.

Actually we don't need to go deeper.

The building blocks

Assuming we have 512 bits to work with, the indices of those bits are 0..511: 9-bit numbers. We will split that into 3 groups of 3 bits, denoted a,b,c where a locates a QWORD in the 512-bit register, b locates a byte within that QWORD, and c locates a bit within that byte.

Here are some nice building blocks (given fairly arbitrary names):

P_f(a,b,c) = a,b,f(c) aka "right GF2P8AFFINEQB", where f is any mapping from a 3-bit integer to a 3-bit integer. This building block can be implemented with _mm512_gf2p8affine_epi64_epi8(input, _mm512_set1_epi64(f_as_a_reversed_matrix), 0)
Q_f(a,b,c) = a,f(c),~b aka "left GF2P8AFFINEQB", where ~b is a 3-bit inversion, equivalent to 7 - b. f can often be the identity mapping, swapping the second and third groups of bits is useful on its own (the "bonus" inversion can be annoying to deal with). This building block can be implemented with _mm512_gf2p8affine_epi64_epi8(_mm512_set1_epi64(f_as_a_matrix), input, 0)
S_g(a,b,c) = g(a,b),c aka Shuffle, where g is any mapping from a 6-bit integer to a 6-bit integer. This building block can be implemented with _mm512_permutexvar_epi8(g_as_an_array, input), but in some cases also with another instruction that you may prefer, depending on the mapping.

S, though it doesn't touch c, is quite powerful. As a couple of special cases that may be of interest, it can be used to swap a and b, invert a or b, or do a combined swap-and-invert.

We could further distinguish:

S64_f(a,b,c) = f(a),b,c aka VPERMQ. This building block can be implemented with, you guessed it, VPERMQ.
S8_f(a,b,c) = a,f(b),c aka PSHUFB. This building block can be implemented with, you guessed it, PSHUFB. PSHUFB allows a bit more freedom than is used here, the mapping could be from 4-bit integers to 4-bit integers, but that's not nice to think about in this framework of 3 groups of 3 bits.

Building something with the blocks

Let's say that we want to take a vector of 8 64-bit integers, and transpose it into a vector of 64 8-bit integers such that the k'th bit of the n'th uint64 ends up in the n'th bit of the k'th uint8. In terms of the bits of the indices of the bits (I swear it's not as confusing as it sounds) that means we want to build something that maps a,b,c to b,c,a. It's immediately clear that we need a Q operation at some point, since it's the only way to swap some other groups of bits into the 3rd position. But if we start with a Q, we get ~b in the 3rd position while we need a. We can solve that by starting with an S that swaps a and b while also inverting a (I'm not going to bother defining what that looks like in terms of an index mapping function, just imagine that those functions are whatever they need to be in order to make it work):

S_f(a,b,c) = b,~a,c
Q_id(b,~a,c) = b,c,a

Which translates into code like this:

__m512i Transpose8x64(__m512i x)
{
    x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
        56, 48, 40, 32, 24, 16, 8, 0,
        57, 49, 41, 33, 25, 17, 9, 1,
        58, 50, 42, 34, 26, 18, 10, 2,
        59, 51, 43, 35, 27, 19, 11, 3,
        60, 52, 44, 36, 28, 20, 12, 4,
        61, 53, 45, 37, 29, 21, 13, 5,
        62, 54, 46, 38, 30, 22, 14, 6,
        63, 55, 47, 39, 31, 23, 15, 7), x);
    __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201);
    x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
    return x;
}

Now let's say that we want to do the inverse of that, going back from b,c,a to a,b,c. Again it's clear that we need a Q, but we have some choice now. We could start by inverting the c in the middle first:

S8_f1(b,c,a) = b,~c,a
Q_id(b,~c,a) = b,a,c
S_f2(b,a,c) = a,b,c

Which translates into code like this:

__m512i Transpose64x8(__m512i x)
{
    x = _mm512_shuffle_epi8(x, _mm512_setr_epi8(
        7, 6, 5, 4, 3, 2, 1, 0,
        15, 14, 13, 12, 11, 10, 9, 8,
        23, 22, 21, 20, 19, 18, 17, 16,
        31, 30, 29, 28, 27, 26, 25, 24,
        39, 38, 37, 36, 35, 34, 33, 32,
        47, 46, 45, 44, 43, 42, 41, 40,
        55, 54, 53, 52, 51, 50, 49, 48,
        63, 62, 61, 60, 59, 58, 57, 56));
    __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201);
    x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
    x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
        0, 8, 16, 24, 32, 40, 48, 56,
        1, 9, 17, 25, 33, 41, 49, 57,
        2, 10, 18, 26, 34, 42, 50, 58,
        3, 11, 19, 27, 35, 43, 51, 59,
        4, 12, 20, 28, 36, 44, 52, 60,
        5, 13, 21, 29, 37, 45, 53, 61,
        6, 14, 22, 30, 38, 46, 54, 62,
        7, 15, 23, 31, 39, 47, 55, 63), x);
    return x;
}

Or we could start with a Q to get the a out of the third position, then use an S to swap the first and second positions and a P to invert c (in any order).

Q_id(b,c,a) = b,a,~c
S_f1(b,a,~c) = a,b,~c
P_f2(a,b,~c) = a,b,c

Which translates into code like this:

__m512i Transpose64x8(__m512i x)
{
    __m512i idmatrix = _mm512_set1_epi64(0x8040201008040201);
    x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
    x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
        0, 8, 16, 24, 32, 40, 48, 56,
        1, 9, 17, 25, 33, 41, 49, 57,
        2, 10, 18, 26, 34, 42, 50, 58,
        3, 11, 19, 27, 35, 43, 51, 59,
        4, 12, 20, 28, 36, 44, 52, 60,
        5, 13, 21, 29, 37, 45, 53, 61,
        6, 14, 22, 30, 38, 46, 54, 62,
        7, 15, 23, 31, 39, 47, 55, 63), x);
    x = _mm512_gf2p8affine_epi64_epi8(x, idmatrix, 0);
    return x;
}

I will probably keep using a SAT solver to solve the masks (using the same techniques as in (Not) transposing a 16x16 bitmatrix), but now at least I have a proper way to think about the shape of the solution, which makes it a lot easier to ask a SAT solver to fill in the specifics.

This framework could be extended with other bit-permutation operatations such as QWORD rotates, but that quickly becomes tricky to think about.

Propagating bounds through bitwise operations

2023-07-02T02:26:00.002-07:00

This post is meant as a replacement/recap of some work that I did over a decade ago on propagating bounds through bitwise operations, which was intended as an improvement over the implementations given in Hacker's Delight chapter 4, Arithmetic Bounds.

The goal is, given two variables x and y, with known bounds a ≤ x ≤ b, c ≤ y ≤ d, compute the bounds of x | y and of x & y. Thanks to De Morgan, we have the equations (most also listed in Hacker's Delight, except the last one)

minAND(a, b, c, d) = ~maxOR(~b, ~a, ~d, ~c)
maxAND(a, b, c, d) = ~minOR(~b, ~a, ~d, ~c)
minXOR(a, b, c, d) = minAND(a, b, ~d, ~c) | minAND(~b, ~a, c, d)
maxXOR(a, b, c, d) = maxOR(a, b, c, d) & ~minAND(a, b, c, d)

Everything can be written in terms of only minOR and maxOR and some basic operations.

`maxOR`

To compute the upper bound of the OR of x and y, what we need to do is find is the leftmost bit (henceforth the "target bit") such that it is both:

set in both b and d (the upper bounds of x and y) and,
changing an upper bound (either one of them, doesn't matter, but never both) by resetting the target bit and setting the bits that are less significant, keeps it greater-or-equal than the corresponding lower bound.

The explanation of why that works can be found in Hacker's Delight, along with a more of less direct transcription into code, but we can do better than a direct transcription.

Finding the leftmost bit that passes only the first condition would be easy, its the highest set bit in b & d. The second condition is a bit more complex to handle, but still surprisingly easy thanks to one simple observation: the bits that can pass it, are precisely those bits that are at (or to the right of) the leftmost bit where the upper and lower bound differ. Imagine two numbers in binary, one being the lower bound and the other the upper bound. The number have some equal prefix (possibly zero bits long, up to all bits) and then if they differ, they must differ by a bit in the upper bound being 1 while the corresponding bit in the lower bound is 0. Lowering the upper bound by resetting that bit while setting all bits the right of it, cannot make it lower than the lower bound.

For one of the inputs, say x, the position at which that second condition start being false (looking at that bit and to the left of it) can be computed directly with 64 - lzcnt(a ^ b). We actually need the maximum of that across both pairs of bounds, but there's no need to compute that for both bounds and then take the maximum, we can use this to let the lzcnt find the maximum automatically: 64 - lzcnt((a ^ b) | (c ^ d)).

bzhi(m, k) is an operation that resets the bits in m starting at index k. It can be emulated by shifting or masking, but an advantage of bzhi is that it is well defined for any relevant k, including when k is equal to the size of the integer in bits. bzhi is not strictly required here, but it is more convenient than "classic" bitwise operations, and available on most x64 processors today^[1]. Using bzhi, it's simple to take the position calculated in the previous paragraph and reset all the bits in b & d that do not pass the second condition: bzhi(b & d, 64 - lzcnt((a ^ b) | (c ^ d))).

With that bitmask in hand, all we need to do is apply it to one of the upper bounds. We can skip the "reset the target bit" part, since that bit will be set in the other upper bound and therefore also in the result. It also does not matter which upper bound is changed, regardless of which bound we were conceptually changing. Let's pick b for no particular reason. Then in total, the implementation could be:

uint64_t maxOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
    uint64_t index = 64 - _lzcnt_u64((a ^ b) | (c ^ d));
    uint64_t candidates = _bzhi_u64(b & d, index);
    if (candidates) {
        uint64_t target = highestSetBit(candidates);
        b |= target - 1;
    }
    return b | d;
}

For the highestSetBit function you can choose any way you like to isolate the highest set bit in an integer.

`minOR`

Computing the lower bound of x | y surprisingly seems to be more complex. The basic principles are similar, but this time bits are being reset in one of the lower bounds, and it does matter in which lower bound that happens. The computation of the mask of candidate bits also "splits" into separate candidates for each lower bound, unless there's some trick that I've missed. This whole "splitting" thing cannot be avoided by defining minOR in terms of maxAND either, because the same things happen there. But it's not too bad, a little bit of extra arithmetic. Anyway, let's see some code.

uint64_t minOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
    uint64_t candidatesa = _bzhi_u64(~a & c, 64 - _lzcnt_u64(a ^ b));
    uint64_t candidatesc = _bzhi_u64(a & ~c, 64 - _lzcnt_u64(c ^ d));
    uint64_t target = highestSetBit(candidatesa | candidatesc);
    if (a & target) {
        c &= -target;
    }
    if (c & target) {
        a &= -target;
    }
    return a | c;
}

A Fun Fact here is that the target bit cannot be set in both bounds, opposite to what happens in maxOR where the target bit is always set in both bounds. You may be tempted to turn the second if into else if, but in my tests it was quite important that the ifs are compiled into conditional moves rather than branches (which of the lower bounds the target bit is found in was essentially random), and using else if here apparently discourages compilers (MSVC at least) from using conditional moves.

candidatesa | candidatesc can be zero, although that is very rare, at least in my usage of the function. As written, the code assumes that highestSetBit deals with that gracefully by returning zero if its input is zero. Branching here is (unlike in the two ifs at the end of minOR) not a big deal since this case is so rare (and therefore predictable).

Conclusion

In casually benchmarking these functions, I found them to be a bit faster than the ones that I came up with over a decade ago, and significantly faster than the ones from Hacker's Delight. That basic conclusion probably translates to different scenarios, but the exact ratios will vary a lot based on how predictable the branches are in that case, on your CPU, and on arbitrary codegen decisions made by your compiler.

In any case these new versions look nicer to me.

There are probably much simpler solutions if the bounds were stored in bit-reversed form, but that doesn't seem convenient.

Someone on a certain link aggregation site asked about signed integers. As Hacker's Delight explains via a table, things can go wrong if one (or both) bounds cross the negative/positive boundary - but the solution in those cases is still easy to compute. The way I see it, the basic problem is that a signed bound that crosses the negative/positive boundary effectively encodes two different unsigned intervals, one starting at zero and one ending at the greatest unsigned integer, and the basic unsigned minOR and so on cannot (by themselves) handle those "split" intervals.

[1] Sadly not all, some low-end Intel processors have AVX disabled, which apparently is done by disabling the entire VEX encoding and it takes out BMI2 as collateral damage.

Some ways to check whether an unsigned sum wraps

2023-06-12T09:02:00.002-07:00

When computing x + y, does the sum wrap? There are various ways to find out, some of them well known, some less. Some of these are probably totally unknown, in some cases deservedly so.

This is not meant to be an exhaustive list.

~x < y

A cute trick in case you don't want to compute the sum, for whatever reason.

Basically a variation of how MISRA C recommends checking for wrapping if you use the precondition test since ~x = UINT_MAX - x.
(x + y) < x

(x + y) < y

The Classic™. Useful for being cheap to compute: x + y is often needed anyway, in which case this effectively only costs a comparison. Also recommended by MISRA C, in case you want to express the "has the addition wrapped"-test as a postcondition test.
avg_down(x, y) <s 0

avg_down(x, y) & 0x80000000 // adjust to integer size

<s is signed-less-than. Performed on unsigned integers here, so be it.

avg_down is the unsigned average rounded down. avg_up is the unsigned average rounded up.

Since avg_down is the sum (the full sum, without wrapping) shifted right by 1, what would have been the carry out of the top of the sum becomes the top bit of the average. So, checking the top bit of avg_down(x, y) is equivalent to checking the carry out of x + y.

Can be converted into avg_up(~x, ~y) >=s 0 through the equivalence avg_down(x, y) = ~avg_up(~x, ~y).
(x + y) < min(x, y)

(x + y) < max(x, y)

(x + y) < avg(x, y)

(x + y) < (x | y)

(x + y) < (x & y)

~(x | y) < (x & y)

Variants of The Classic™. They all work for essentially the same reason: addition is commutative, including at the bit-level. So if we have (x + y) < x, then we also have (x + y) < y, and together they imply that instead of putting x or y on the right hand side of the comparison, we could arbitrarily select one of them, or anything between them too. Bit-level commutativity takes care of the bottom three variants in a similar way.

Wait, is that a signed or unsigned min? Does avg round up or down or either way depending on the phase of the moon? It doesn't matter, all of those variants work and more.
(x + y) != addus(x, y)

(x + y) < addus(x, y)

addus is addition with unsigned saturation, meaning that instead of wrapping the result would be UINT_MAX.

When are normal addition and addition with unsigned saturation different? Precisely when one wraps and the other saturates. Wrapping addition cannot "wrap all the way back" to UINT_MAX, the highest result when the addition wraps is UINT_MAX + UINT_MAX = UINT_MAX - 1.

When the normal sum and saturating sum are different, the normal sum must be the smaller of the two (it certainly couldn't be greater than UINT_MAX), hence the second variant.
subus(y, ~x) != 0

subus(x, ~y) != 0

addus(~x, ~y) != UINT_MAX

subus is subtraction with unsigned saturation.

Strange variants of ~x < y. Since subus(a, b) will be zero when a <= b, it will be non-zero when b < a, therefore subus(y, ~x) != 0 is equivalent to ~x < y.

subus(a, b) = ~addus(~a, b) lets us turn the subus variant into the addus variant.
(x + y) < subus(y, ~x)

Looks like a cursed hybrid of (x + y) < avg(x, y) and subus(y, ~x) != 0, but the mechanism is (at least the way I see it) different from both of them.

subus(y, ~x) will be zero when ~x >= y, which is exactly when the sum x + y would not wrap. x + y certainly cannot be unsigned-less-than zero, so overall the condition (x + y) < subus(y, ~x) must be false (which is good, it's supposed to be false when x + y would not wrap).

In the other case, when ~x < y, we know that x + y will wrap and subus(y, ~x) won't be zero (and therefore cannot saturate). Perhaps there is a nicer way to show what happens, but at least under those conditions (predictable wrapping and no saturation) it is easy to do algebra:
- (x + y) < subus(y, ~x)
- x + y - 2^k < y - (2^k - 1 - x)
- x + y - 2^k < y - 2^k + 1 + x
- x + y < y + 1 + x
- 0 < 1
So the overall condition (x + y) < subus(y, ~x) is true IFF x + y wraps.
~x < avg_up(~x, y)

Similar to ~x < y, but stranger. Averaging y with ~x cannot take a low y to above ~x, nor a high y to below ~x. The direction of rounding is important: avg_down(~x, y) could take an y that's just one higher than ~x down to ~x itself, making it no longer higher than ~x. avg_up(~x, y) cannot do that thanks to rounding up.

grevmul

2023-05-22T02:48:00.004-07:00

grev (generalized bit-reverse) is an operation that implements bit-permutations corresponding to XOR-ing the indices by some value. It has been proposed to be part of the Zbp extension of RISC-V, with this reference implementation (source: release v0.93)

uint32_t grev32(uint32_t rs1, uint32_t rs2)
{
    uint32_t x = rs1;
    int shamt = rs2 & 31;
    if (shamt &  1) x = ((x & 0x55555555) <<  1) | ((x & 0xAAAAAAAA) >>  1);
    if (shamt &  2) x = ((x & 0x33333333) <<  2) | ((x & 0xCCCCCCCC) >>  2);
    if (shamt &  4) x = ((x & 0x0F0F0F0F) <<  4) | ((x & 0xF0F0F0F0) >>  4);
    if (shamt &  8) x = ((x & 0x00FF00FF) <<  8) | ((x & 0xFF00FF00) >>  8);
    if (shamt & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
    return x;
}

grev looks in some ways similar to bit-shifts and rotates: the left and right operands have distinct roles with the right operand being a mask of k bits if the left operand has 2^k bits^[1].

Carry-less multiplication normally has a left-shift in it, grevmul is what you get when that left-shift is replaced with grev.

uint32_t grevmul32(uint32_t x, uint32_t y)
{
    uint32_t r = 0;
    for (int k = 0; k < 32; k++) {
        if (y & (1 << k))
            r ^= grev32(x, k);
    }
    return x;
}

grevmul is, at its core, very similar to clmul: take single-bit products (logical AND) of every bit of the left operand with every bit of the right operand, then do some XOR-reduction. The difference is in which partial products are grouped together. For clmul, the partial products that contribute to bit k of the result are pairs with indices i,j such that i + j = k. For grevmul, it's the pairs with indices such that i ^ j = k. This goes back to grev permuting the bits by XOR-ing their indices by some value, and that value is k here.

Now that grevmul has been defined, let's look at some of its properties, comparing it to clmul and plain old imul.

	grevmul	clmul	imul
zero^[2]	0	0	0
identity	1	1	1
commutative	yes	yes	yes
associative	yes	yes	yes
distributes over	xor	xor	addition
op(x, 1 << k) is	grev(x, k)	x << k	x << k
x has inverse if	popcnt(x) & 1	x & 1	x & 1
op(x, x) is	popcnt(x) & 1	pdep(x, 0x55555555)	x²

What is the "grevmul inverse" of `x`?

Time for some algebra. Looking just at the table above, and forgetting the actual definition of grevmul, can we say something about the solutions of grevmul(x, y) == 1? Surprisingly, yes.

Assuming we have some x with odd hamming weight (numbers with even hamming weight do not have inverses, so let's ignore them for now), we know that grevmul(x, x) == 1. The inverse in a monoid is unique so x is not just some inverse of x, it is the (unique) inverse of x.

Since the "addition operator" is XOR (for which negation is the identity function), this is a non-trivial example of a ring in which x = -x = x^-1, when x^-1 exists. Strange, isn't it?

We also have that f(x) = grevmul(x, c) (for appropriate choices of c) is a (non-trivial) involution, so it may be a contenter for the "middle operation" of an involutary bit finalizer, but probably useless without an efficient implementation.

~~I was going to write about implementing grevmul by an 8-bit constant with two GF2P8AFFINEQBs but I've had enough for now, maybe later.~~ E: see Implementing grevmul with GF2P8AFFINEQB where I went ahead and implemented the whole thing, not only the "multiply by 8-bit constant" case.

[1] The right operand of a shift is often called the shift count, but it can also be interpreted as a mask indicating some subset of shift-by-2ⁱ operations to perform. That interpretation is useful for example when implementing a shift-by-variable operation on a machine that only has a shift-by-constant instruction, following the same pattern as the reference implementation of grev32.

[2] This looks like a joke, but I mean that the numeric value 0 acts as the zero element of the corresponding semigroup.

(Not) transposing a 16x16 bitmatrix

2023-04-12T17:11:00.006-07:00

Inverting a 16-element permutation may done like this:

for (int i = 0; i < 16; i++)
    inv[perm[i]] = i;

Computing a histogram of 16 nibbles may done like this:

for (int i = 0; i < 16; i++)
    hist[data[i]] += 1;

These different-sounding but already similar-looking tasks have something in common: they can be both be built around a 16x16 bitmatrix transpose. That sounds silly, why would anyone want to first construct a 16x16 bitmatrix, transpose it, and then do yet more processing to turn the resulting bitmatrix back into an array of numbers?

Because it turns out to be an efficiently-implementable operation, on some modern processors anyway.

If you know anything about the off-label application of GF2P8AFFINEQB, you may already suspect that it will be involved somehow (merely left-GF2P8AFFINEQB-ing by the identity matrix already results in some sort of 8x8 transpose, just horizontally mirrored), and it will be, but that's not the whole story.

First I will show not only how to do it with GF2P8AFFINEQB, but also how to find that solution programmatically using a SAT solver. There is nothing that fundamentally prevents a human from finding a solution by hand, but it seems difficult. Using a SAT solver to find a solution ex nihilo (requiring it to find both a sequence of instructions and their operands) is not that easy either (though that technique also exists). Thankfully, Geoff Langdale suggested a promising sequence of instructions:

nibble value, so now we have 16 bitfields with 1 bit set.

At this point, what we really want is a 16x16 transpose, not 4 8x8 transposes, but I'm pretty sure we can fake it by using VPERMB to redistribute our bytes (probably first grouping all top 8 bytes into the first 128b ...
— Geoff Langdale (@geofflangdale) January 7, 2023

The problem we have now (which the SAT solver will solve) is, under the constraint that for all X, f(X) = PERMB(GF2P8AFFINE(B, PERMB(X, A)), C) computes the transpose of X, what is a possible valuation of the variables A, B, C. Note that the variables in the SAT problem correspond to constants in the resulting code, and the variable in the resulting code (X) is quantified out of the problem.

If you know a bit about SAT solving, that "for all X" sounds like trouble, requiring either creating a set of constraints for every possible value of X (henceforth, concrete values of X will be known as "examples"), or some advanced technique such as CEGIS to dynamically discover a smaller set of examples to base the constraints on. Luckily, since we are dealing with a bit-permutation, there are simple and small sets of examples that together sufficiently constrain the problem. For a 16-bit permutation, this set of values could be used:

1010101010101010
1100110011001100
1111000011110000
1111111100000000

For a 256-bit permutation, a similar pattern can be used, where each of the examples has 256 bits and there would be 8 of them. Note that if you read the columns of the values, they list out the indices of the corresponding columns, which is no coincidence. Using that set of examples to constrain the problem with, essentially means that we assert that f when applied to the sequence 0..n-1 must result in the desired permutation. The way that I actually implemented this puts a column into one "abstract bit", so that it represents the index of the bit all in one place instead of spread out.

Implementing a "left GF2P8AFFINEQB" (multiplying a constant matrix on the left by a variable matrix on the right) in CNF, operating on "abstract bits" (8 variables each), is relatively straight forward. Every (abstract) bit of the result is the XOR of the AND of some (abstract) bits, writing that down is mostly a chore, but there is one interesting aspect: the XOR can be turned into an OR, since we know that we're multiplying by a permutation matrix. In CNF, OR is simpler than XOR, and easier for the solver to reason through.

VPERMB is more difficult to implement, given that the permutation operand is a variable (if it was a constant, we could just permute the abstract bits without generating any new constraints). To make it easier, I represent the permutation operand as a 32x32 permutation matrix, letting me create a bunch of simple ternary constraints of the form (¬P(i, j) ∨ ¬A(j) ∨ R(i)) ∧ (¬P(i, j) ∨ A(j) ∨ ¬R(i)) (read: if P(i, j), then A(j) must be equal to R(i)). The same thing can be used to implement VPSHUFB, with additional constraints on the permutation matrix (to prevent cross-slice movement).

Running that code, at least on my PC at this time^[1], results in (with some whitespace manually added):

__m256i t0 = _mm256_permutexvar_epi8(_mm256_setr_epi8(
    14, 12, 10, 8, 6, 4, 2, 0,
    30, 28, 26, 24, 22, 20, 18, 16,
    15, 13, 11, 9, 7, 5, 3, 1,
    31, 29, 27, 25, 23, 21, 19, 17), input);
__m256i t1 = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x1080084004200201), t0, 0);
__m256i t2 = _mm256_shuffle_epi8(t1, _mm256_setr_epi8(
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14,
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14));

So that's it. That's the answer^[2]. If you want to transpose a 16x16 bitmatrix, on a modern PC (this code requires AVX512_VBMI and AVX512_GFNI^[3]), it's fairly easy and cheap, it's just not so easy to find this solution to begin with.

Using this transpose to invert a 16-element permutation is pretty easy, for example using _mm256_sllv_epi16 to construct the matrix and _mm256_popcnt_epi16(_mm256_sub_epi16(t2, _mm256_set1_epi16(1))) (sadly there is no SIMD version of TZCNT .. yet) to convert the bit-masks back into indices. It may be tempting to try to use a mirrored matrix and leading-zero count, which AVX512 does offer, but it only offers the DWORD and QWORD versions VPLZCNTD/Q.

Making a histogram is even simpler, using only _mm256_popcnt_epi16(t2) to convert the matrix into counts.

And for my next trick, I will now not transpose the matrix

What if we didn't transpose that matrix. Does that even make sense? Well, at least for the two applications that I focused on, what we really need is not so much the transpose of the matrix, but any matrix such that:

Every bit of the original matrix occurs exactly once in the result.
Each row of the result contains all bits from a particular column.
The permutation within each row is "regular" enough that we can work with it. We don't need this when making a histogram (as Geoff already noted in one of his tweets).

There is no particular requirement on the order of the rows, any row-permutation we end up with is easy to undo.

The first two constraints leave plenty of options open, but the last constraint is quite vague. Too vague for me to do something such as searching for the best not-quite-transpose, so I don't promise to have found it. But here is a solution: rotate every row by its index, then rotate every column by its index.

At least, that's the starting point. Rotating the columns requires 3 rounds of blending a vector with cross-slice-permuted copy of that vector, and a VPERMQ sandwiched by two VPSHUFBs to rotate the last 8 columns by 8. That's a lot of cross-slice permuting, most of it can be avoided by modifying the overall permutation slightly:

Exchange the off-diagonal quadrants.
Rotate each row by its index.
For each quadrant individually, rotate each column by its index.

Here is some attempt at illustrating that process, feel free to skip past it

These three steps are implementable in AVX2:

Exchanging the off-diagonal quadrants can be done by gathering the quadrants into QWORDs, permuting them, and shuffling the QWORDs back into quadrants.
Rotating the rows can be done with VPMULLW (used as a variable shift-left), VPMULHUW (used as a variable shift-right), and VPOR.
Rotating the columns can be done by conditionally rotating the columns with odd indices by 1, conditionally rotating the columns that have the second bit of their index set by 2, and conditionally rotating the columns that have the third bit of their index set by 4. The rotations can be done using VPALIGNR^[4], the conditionality can be implemented with blending, but since this needs to be bit-granular blend, it cannot be performed using VPBLENDVB.

In total, here is how I don't transpose a 16x16 matrix with AVX2, hopefully there is a better way:

__m256i nottranspose16x16(__m256i x)
{
    // exchange off-diagonal quadrants
    x = _mm256_shuffle_epi8(x, _mm256_setr_epi8(
        0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15,
        0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15));
    x = _mm256_permute4x64_epi64(x, _MM_SHUFFLE(3, 1, 2, 0));
    x = _mm256_shuffle_epi8(x, _mm256_setr_epi8(
        0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15,
        0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15));
    // rotate every row by its y coordinate
    __m256i shifts = _mm256_setr_epi16(
        1 << 0, 1 << 1, 1 << 2, 1 << 3,
        1 << 4, 1 << 5, 1 << 6, 1 << 7,
        1 << 8, 1 << 9, 1 << 10, 1 << 11,
        1 << 12, 1 << 13, 1 << 14, 1 << 15);
    __m256i sll = _mm256_mullo_epi16(x, shifts);
    __m256i srl = _mm256_mulhi_epu16(x, shifts);
    x = _mm256_or_si256(sll, srl);
    // within each quadrant independently, 
    // rotate every column by its x coordinate
    __m256i x0, x1, m;
    // rotate by 4
    m = _mm256_set1_epi8(0x0F);
    x0 = _mm256_and_si256(x, m);
    x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 8));
    x = _mm256_or_si256(x0, x1);
    // rotate by 2
    m = _mm256_set1_epi8(0x33);
    x0 = _mm256_and_si256(x, m);
    x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 4));
    x = _mm256_or_si256(x0, x1);
    // rotate by 1
    m = _mm256_set1_epi8(0x55);
    x0 = _mm256_and_si256(x, m);
    x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 2));
    x = _mm256_or_si256(x0, x1);
    return x;
}

Using that not-transpose to invert a 16-element permutation takes some extra steps that, without AVX512, are about as annoying as not-transposing the matrix was.

Constructing the matrix is more difficult. AVX2 has shift-by-variable, but not for 16-bit element.^[5] There are various work-arounds, such as using DWORDs and then narrowing, of course (boring). Another (funnier) option is to duplicate every byte, add 0xF878 to every word, then use VPSHUFB in lookup-table-mode to index into a table of powers of two. Having added 0x78 to every low byte of every word, that byte will mapped to zero if it was 8 or higher, or otherwise two to the power of that byte. The high byte, having 0xF8 added to it, will be mapped to 0 if it was below 8, or otherwise to two to the power of that byte minus 8. As wild as that sounds, it is pretty fast, costing only 5 cheap instructions (whereas widening to DWORDs, shifting, and narrowing, would be worse than it sounds). Perhaps there is a better way.
Converting masks back into indices is more difficult due to the lack of trailing zero count, leading zero count, or even popcount. What AVX2 does have, is .. VPSHUFB again. We can multiply by an order-4 de Bruijn sequence and use VPSHUFB to map the results to the indices of the set bits.
Then we have indices, but since the rows and columns were somewhat arbitrarily permuted, they must still be mapped back into something that makes sense. Fortunately that's no big deal, a modular subtraction (or addition, same thing really) cancels out the row-rotations, and yet another VPSHUFB cancels out the strange order that the rows are in. Fun detail: the constants that are subtracted and the permutation are both 0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9.

All put together:

void invert_permutation_avx2(uint8_t *p, uint8_t *inv)
{
    __m256i v = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i*)p));
    // indexes to masks
    v = _mm256_or_si256(v, _mm256_slli_epi64(v, 8));
    v = _mm256_add_epi8(v, _mm256_set1_epi16(0xF878));
    __m256i m = _mm256_shuffle_epi8(_mm256_setr_epi8(
        1, 2, 4, 8, 16, 32, 64, 128,
        1, 2, 4, 8, 16, 32, 64, 128,
        1, 2, 4, 8, 16, 32, 64, 128,
        1, 2, 4, 8, 16, 32, 64, 128), v);
    // ???
    m = nottranspose16x16(m);
    // masks to indexes
    __m256i deBruijn = _mm256_and_si256(_mm256_mulhi_epu16(m, _mm256_set1_epi16(0x9AF0)), _mm256_set1_epi16(0x000F));
    __m128i i = _mm_packs_epi16(_mm256_castsi256_si128(deBruijn), _mm256_extracti128_si256(deBruijn, 1));
    i = _mm_shuffle_epi8(_mm_setr_epi8(
        0, 1, 2, 5, 3, 9, 6, 11, 15, 4, 8, 10, 14, 7, 13, 12), i);
    // un-mess-up the indexes
    i = _mm_sub_epi8(i, _mm_setr_epi8(0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9));
    i = _mm_and_si128(i, _mm_set1_epi8(0x0F));
    i = _mm_shuffle_epi8(i, _mm_setr_epi8(0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9));
    _mm_storeu_si128((__m128i*)inv, i);
}

To make a histogram, emulate VPOPCNTW using, you guessed it, PSHUFB.

The end

This post is, I think, one of the many examples of how AVX512 can be an enormous improvement compared to AVX2 even when not using 512-bit vectors. Every step of every problem had a simple solution in AVX512 (even if it was not always easy to find it). With AVX2, everything felt "only barely possible".

"As complicated as it is, is this actually faster than scalar code?" Yes actually, but feel free to benchmark it yourself. The AVX2 version being somewhat more efficient than scalar code is not really the point of this post anyway. The AVX512 version is nice and efficient, I'm showing an AVX2 version mostly to show how hard it is to create it.^[6]

Transposing larger matrices with AVX512 can be done by first doing some quadrant-swapping (also used at the start of the not-transpose) until the bits that need to end up together in one 512-bit block are all in there, and then a VPERMB, VGF2P8AFFINEQB, VPERMB sequence with the right constants (which can be found using the techniques that I described) can put the bits in their final positions. But well, I already did that, so there you go.

A proper transpose can be done in AVX2 of course, for example using 4 rounds of quadrant-swapping. Implementations of that already exist so I thought that would be boring to talk about, but there is an interesting aspect to that technique that is often not mentioned: every round of quadrant-swapping can be seen as exchanging two bits of the indices. Swapping the big 8x8 quadrants swaps bits 3 and 7 of the indices, transposing the 2x2 submatrices swaps bits 0 and 4 of the indices. From that point of view, it's easy to see that the order in which the four steps are performed does not matter - no matter the order, the lower nibble of the index is swapped with the higher nibble of the index.

[1] While MiniSAT (which this program uses as its SAT solver) is a "deterministic solver" in the sense of definitely finding a satifying valuation if there is one, it is not deterministic in the sense of guaranteeing that the same satisfying valuation is found every time the solver is run on the same input.

[2] Not the unique answer, there are multiple solutions.

[3] But not 512-bit vectors.

[4] Nice! It's not common to see a 256-bit VPALIGNR being useful, due to it not being the natural widening of 128-bit PALIGNR, but acting more like two PALIGNRs side-by-side (with the same shifting distance).

[5] Intel, why do you keep doing this.

[6] Also as an excuse to use PSHUFB for everything.

Column-oriented row reduction

2023-04-09T16:54:00.003-07:00

In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations.
-- The Wikipedia entry of Gaussian elimination

The usual implementation of row reduction of a bit-matrix^[1] takes an array of bit-packed rows, and applies row operations to the matrix in a row-wise manner. As a reminder, it works something like this:

Find a row with a non-zero element^[2] in the current column.
Swap it into position.
XOR that row into any other row that has a non-zero element in the current column.
Go to step 1 for the next column, if there is one.

For the application that I had mind when writing this blog post^[3], step 2 is not necessary and will be skipped.^[4]

Note that at the single-bit-level, the XOR-logic of step 3 comes down to if M(i, j) and M(j, k) then flip M(i, k). That is, in order for the XOR to flip M(i, k), the corresponding row must have a 1 in the current column (condition A), and the row that was selected to pivot on must have a 1 in the column k (condition B).

Turning it into a column-oriented algorithm

In the row-oriented algorithm, condition A is handled by conditionally XOR-ing into only those rows that have a 1 in the current column, and condition B is handled by the XOR itself (which is a conditional bit-flip afterall). A column-oriented algorithm would do it the other way around, using XOR to implement condition B, and skipping columns for which condition A is false:

Find a row that hasn't already been pivoted on, with a non-zero element in the current column.
Don't swap it into position.
XOR each column that has a 1 in the row that we're pivoting on, with a mask formed by taking the current column and resetting the bit that corresponds to the pivot row.
If there are any rows left that haven't been pivoted on, go to step 1 for the next column.

Step 1 may sound somewhat annoying, but it is really simple: AND the current column with a mask of the rows that have already been pivoted on, had, and use the old Isolate Lowest Set Bit^[5] trick x & -x to extract a mask corresponding to the first row that passes both conditions.

Here is an example of what an implementation might look like (shown in C#, or maybe this is Java code with a couple of capitalization errors)

static void reduceRows(long[] M)
{
    long had = 0;
    for (int i = 0; i < M.Length; i++)
    {
        long r = lowestSetBit(M[i] & ~had);
        if (r == 0)
            continue;
        long mask = M[i] & ~r;
        for (int j = 0; j < M.Length; j++)
        {
            if ((M[j] & r) != 0)
                M[j] ^= mask;
        }
        had |= r;
    }
}

static long lowestSetBit(long x)
{
    return x & -x;
}

The stuff that goes at the end of the post

While the column-oriented approach makes the step "search for a row that has a 1 in the current column" easy, transposing a matrix just to be able to do this does not seem worthwhile to me.

Despite the if inside the inner loops of both the row-oriented and the column-oriented algorithm, both of them can be accellerated using SIMD. The conditional code can easily be rewritten into arithmetic-based masking^[6] or "real" masked operations (in eg AVX512). "But don't compilers autovectorize these days?" Yes they do, but not always in a good way, note that both GCC and Clang both used a masked store, which is worse than writing back some unchanged values with a normal store (especially on AMD processors, and also especially considering that the stored values will be loaded again soon). Rumour has it that there are Reasons™ for that quirk, but that doesn't improve the outcome.

[1] Of course this blog post is not about floating point matrices, this is the bitmath blog.

[2] Sometimes known as 1 (one).

[3] XOR-clause simplification

[4] Row-swaps are particularly expensive to apply to a column-major bit-matrix, so perhaps this is just an excuse to justify the column-based approach. You decide.

[5] AKA BLSI, which AMD describes as "Isolate Lowest Set Bit", which makes sense, because well, that is what the operation does. Meanwhile Intel writes "Extract Lowest Set Isolated Bit" (isolated bit? what?) in their manuals.

[6] Arithmetic-based masking can also be used in scalar code.

Weighted popcnt

2023-01-03T08:02:00.002-08:00

Since I showed the code below on Twitter, and some people understood it and some didn't, I suppose I should explain how it works and what the more general technique is.

// sum of indexes of set bits
int A073642(uint64_t n)
{
    return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
          (__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}

A sum of several multi-digit numbers, such as 12 + 34, can be rewritten to make the place-value notation explicit, giving 1*10 + 2*1 + 3*10 + 4*1. Then the distributive property of multiplication can be used to group digits of equal place-value together, giving (1 + 3)*10 + (2 + 4)*1. I don't bring up something so basic as an insult to the reader, the basis of that possibly-intimidating piece of code really is this basic.

The masks 0xAAAAAAAAAAAAAAAA, 0xCCCCCCCCCCCCCCCC, etc, represent nothing more than the numbers 0..63, each written vertically in binary, starting with the least significant bit. Each column of bits contains its own index. AND-ing the masks with n leaves, in each column, either zero (if the corresponding bit of n was zero) or the index of that column.

The first popcnt then adds together all the digits of the "filtered" indexes with a place-value of 1, the second popcnt adds together the digits with a place-value of 2, and so on. The result of each popcnt is shifted to match the corresponding place-value, and all of them are added together.

Generalizing the technique

The numbers in the columns did not have to be 0..63, the columns can contain arbitrary numbers, even negative numbers. This gives some sort of weighted popcnt: rather than each bit having a weight of 1, we are free to choose an arbitrary weight for each bit.

In general, starting with some arbitrary numbers, interpret them as a binary matrix. Transpose that matrix. Every row of the transposed matrix is the mask that goes into the corresponding popcnt-step.

Some simplifications may be possible:

If a row of the matrix is zero, the corresponding step can be left out.
If two or more rows are equal, their popcnt-steps can be merged into one with a weight that is the sum of the weights of the steps that are merged.
If a row is a linear combination of some set of other rows, the popcnt corresponding to that row can be computed as a linear combination of the popcnts corresponding to those other rows.
If a row has exactly one bit set, its step can be implemented with an AND and a shift, without popcnt, by shifting the bit to its target position.
If every row has exactly one bit set, we're really dealing with a bit-permutation and there are better ways to implement them.

For example, let's construct "the sum of squares of indexes of set bits" (with 1-indexed indexes) aka oeis.org/A003995. Transposing the squares 1,2,4,9..4096 gives these masks:

0x5555555555555555
0x0000000000000000
0x2222222222222222
0x1414141414141414
0x0d580d580d580d58
0x0335566003355660
0x00f332d555a66780
0x555a5b6666387800
0x66639c78783f8000
0x787c1f807fc00000
0x7f801fff80000000
0x7fffe00000000000
0x8000000000000000

The second mask is zero because a square cannot be congruent to 2 or 3 modulo 4. The last mask has exactly one bit set, so its step can be implemented without popcnt. Putting it together:

int A003995(uint64_t n)
{
    return __popcnt64(n & 0x5555555555555555) +
          (__popcnt64(n & 0x2222222222222222) << 2) +
          (__popcnt64(n & 0x1414141414141414) << 3) +
          (__popcnt64(n & 0x0d580d580d580d58) << 4) +
          (__popcnt64(n & 0x0335566003355660) << 5) +
          (__popcnt64(n & 0x00f332d555a66780) << 6) +
          (__popcnt64(n & 0x555a5b6666387800) << 7) +
          (__popcnt64(n & 0x66639c78783f8000) << 8) +
          (__popcnt64(n & 0x787c1f807fc00000) << 9) +
          (__popcnt64(n & 0x7f801fff80000000) << 10) +
          (__popcnt64(n & 0x7fffe00000000000) << 11) +
          ((n & 0x8000000000000000) >> 51);
}

Aside from implementing strange sequences from the OEIS, possible uses of this technique may include

Summing the face-values of a hand of cards. Unfortunately this cannot take "special combinations" into account, unlike other techniques.
A basic evaluation of a board state in Reversi/Othello, based on weights for each captured square that differ by their position.
Determining the price for a pizza based on a set of selected toppings, I don't know, I'm grasping at straws here.

Addendum

What if the input is small, and the weights are also small? Here is a different trick that is applicable in some of those cases, depending on whether the temporary result fits in 64 bits. The trick this time is much simpler, or at least sounds much simpler: for bit i of weight w[i], make w[i] copies of bit i, then popcnt everything.

A common trick to make k copies of only one bit is (bit << k) - bit. Assuming that pdep exists and is efficient, that trick can be generalized to making different numbers of copies of different bits. The simplest version of that trick would sacrifice one bit of padding per input bit, which may be acceptable depending on whether that all still fits in 64 bits. For example A073642 with a 10-bit input would work:

// requires: n is a 10-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n, 0x000020101020844b);
    return __popcnt64(top - bot);
}

That can be extended to an 11-bit input like this:

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n >> 1, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n >> 1, 0x000020101020844b);
    return __popcnt64(top - bot | top);
}

Or like this

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225) >> 1;
    uint64_t bot = _pdep_u64(n, 0x008020101020844b) >> 1;
    return __popcnt64(top - bot);
}

pdep cannot move bits to the right of their original position, in some cases (if there is no space for padding bits) you may need to hack around that, as in the example above. Rotates and right-shift are good candidates to do that with, and in general you may gather the bits with non-zero weights with pext.

This approach relies strongly on being able to efficiently implement the step "make w[i] copies of bit i", it is probably not possible to do that efficiently using only the plain old standard integer operations. Also, pdep is not efficient on all processors that support it, unfortunately making the CPUID feature flag for BMI2 useless for deciding which implementation to use.

Bit-level commutativity and "sum gap" revisited

2022-07-22T07:04:00.003-07:00

In an earlier post, bit-level commutativity, it was shown that addition is not only invariant under swapping the operands, but also under swapping individual bits between the operands, provided that they are of equal place-value. For example the least significant bits of the operands of a sum could be exchanged without changing the value of the sum. Bit-level commutativity of addition implies that A + B = (A & B) + (A | B).

In the range a sum cannot fall within, it was shown that the "gap" which a sum cannot fall within can be widened from: $$A + B \begin{cases} \ge \max(A,B) & \text{ if addition does not carry} \\ \le \min(A,B) & \text{ if addition does carry} \end{cases}\tag{Eq1}$$ To: $$A + B \begin{cases} \ge A\:|\:B & \text{ if addition does not carry} \\ \le A\:\&\:B & \text{ if addition does carry} \end{cases}\tag{Eq2}$$ This makes the gap wider because A & B is often less than min(A, B), and A | B is often greater than max(A, B).

To start with, let's assume that $$A + B \begin{cases} \ge A & \text{ if addition does not carry} \\ \le A & \text{ if addition does carry} \end{cases}\tag{Eq3}$$ Since addition is commutative, the roles of A and B could be exchanged without affecting the truth of equation Eq3, so this must also hold: $$A + B \begin{cases} \ge B & \text{ if addition does not carry} \\ \le B & \text{ if addition does carry} \end{cases}\tag{Eq4}$$ If A + B is greater than or equal to both A and B, then it must also be greater than and equal to max(A, B), and a similar argument applies to the upper bound. Therefore, Eq1 follows from Eq3.

Substituting A and B with A & B and A | B (respectively) in Eq3 and Eq4 gives $$(A\:\&\:B) + (A\:|\:B) \begin{cases} \ge (A\:\&\:B) & \text{ if addition does not carry} \\ \le (A\:\&\:B) & \text{ if addition does carry} \end{cases}\tag{Eq5}$$ And $$(A\:\&\:B) + (A\:|\:B) \begin{cases} \ge (A\:|\:B) & \text{ if addition does not carry} \\ \le (A\:|\:B) & \text{ if addition does carry} \end{cases}\tag{Eq6}$$ Picking the "does carry" case from Eq5 and the "does not carry" case from Eq6, and using that A + B = (A & B) + (A | B), gives Eq2.

Integer promotion does not help performance

2021-07-10T09:43:00.004-07:00

There is a rule in the C language which roughly says that arithmetic operations on short integer types implicitly convert their operands to normal-sized integers, and also give their result as a normal-sized integer. For example in C:

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

Various other languages have similar rules. For example C#, the specification of which is not as jargon-infested as the C specifiction:

In the case of integral types, those operators (except the ++ and -- operators) are defined for the int, uint, long, and ulong types. When operands are of other integral types (sbyte, byte, short, ushort, or char), their values are converted to the int type, which is also the result type of an operation.

There may be various reasons to include such a rule in a programming language (and some reasons not to), but one that is commonly mentioned is that "the CPU prefers to work at its native word size, other sizes are slower". There is just one problem with that: it is based on an incorrect assumption about how the compiler will need to implement "narrow arithmetic".

To give that supposed reason the biggest benefit that I can fairly give it, I will be using MIPS for the assembly examples. MIPS completely lacks narrow arithmetic operations.

Implementing narrow arithmetic as a programmer

Narrow arithmetic is often required, even though various languages make it a bit cumbersome. C# and Java both demand that you explicitly convert the result back to a narrow type. Despite that, code that needs to perform several steps of narrow arithmetic is usually not littered with casts. The usual pattern is to do the arithmetic without intermediate casts, then only in the end use one cast just to make the compiler happy. In C, even that final cast is not necessary.

For example, let's reverse the bits of a byte in C#. This code was written by Igor Ostrovsky, in his blog post Programming job interview challenge. It's not a unique or special case, and I don't mean that negatively: it's good code that anyone proficient could have written, a job well done. Code that senselessly casts back to byte after every step is also sometimes seen, perhaps because in that case, the author does not really understand what they are doing.

// Reverses bits in a byte 
static byte Reverse(byte b)
{
    int rev = (b >> 4) | ((b & 0xf) << 4);
    rev = ((rev & 0xcc) >> 2) | ((rev & 0x33) << 2);
    rev = ((rev & 0xaa) >> 1) | ((rev & 0x55) << 1); 

    return (byte)rev;
}

Morally, all of the operations in this function are really narrow operations, but C# cannot express that. A special property of this code is that none of the intermediate results exceed the limits of a byte, so in a language without integer promotion it could be written in much the same way, but without going through int for the intermediate results.

Implementing narrow arithmetic as a compiler

The central misconception that (I think) gave rise to the myth that integer promotion helps performance, is the assumption that without integer promotion, the compiler must implement narrow operations by inserting an explicit narrowing operation after every arithmetic operation. But that's not true, a compiler for a language that lacks integer promotion can use the same approach that programmers use to implement narrow arithmetic in languages that do have integer promotion. For example, what if two bytes were added together (from memory, and storing the result in memory) in a hypothetical language that lacks integer promotion, and what if that code was compiled for MIPS? The assumption is that it will cost an additional operation to get rid of the "trash bits", but it does not:

        lbu     $2,0($4)
        lbu     $3,0($5)
        addu    $2,$2,$3
        sb      $2,0($4)

The sb instruction does not care about any "trash" in the upper 24 bits, those bits simply won't be stored. This is not a cherry-picked case. Even if there were more arithmetic operations, in most cases the "trash" in the upper bits could safely be left there, being mostly isolated from the bits of interest by the fact that carries only propagate from the least significant bit up, never down. For example, let's throw in a multiplication by 42 and a shift-left by 3 as well for good measure:

        lbu     $6,0($4)
        li      $3,42
        lbu     $2,0($5)
        mul     $5,$3,$6
        addu    $2,$5,$2
        sll     $2,$2,3
        sb      $2,0($4)

What is true, is that before some operations, the trash in the upper bits must be cleared. For example before a division, shift-right, or comparison. That is not an exhaustive list, but the list of operations that require the upper bits to be clean is shorter (and have less "total frequency") than the list of operations that do not require that, see for example which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? "Before some operations" is not the same thing as "after every operation", but that still sounds like an additional cost. However, the trash-clearing operations that a compiler for a language that lacks integer promotion would have to insert are not additional operations: they are the same ones that a programmer would write explicitly in a language with integer promotion.

It may be possible to construct a contrived case in which a human would know that the high bits of an integer are clean, while a compiler would struggle to infer that. For example, a compiler may have more trouble reasoning about bits that were cancelled by an XOR, or worse, by a multiplication. Such cases are not likely to be the reason behind the myth. A more likely reason is that many programmers are not as familiar with basic integer arithmetic as they perhaps ought to be.

So integer promotion is useless?

Integer promotion may prevent accidental use of narrow operations where wide operations were intended, whether that is worth it is another question. All I wanted to say with this post, is that "the CPU prefers to work at its native word size" is a bogus argument. Even when it is true, it is irrelevant.

Partial sums of blsi and blsmsk

2021-06-09T21:50:00.004-07:00

blsi is an x86 operation which extracts the rightmost set bit from a number, it can be implemented efficiently in terms of more familiar operations as i&-i. blsmsk is a closely related operation which extracts the rightmost set bit and also "smears" it right, setting the bits to the right of that bit as well. blsmsk can be implemented as i^(i-1). The smearing makes the result of blsmsk almost (but not quite) twice as high as the result of blsi for the same input: blsmsk(i) = floor(blsi(i) * (2 - ε)).

The partial sums of blsi and blsmsk can be defined as b(n) = sum(i=1, n, blsi(i)) and a(n) = sum(i=1, n, blsmsk(i)) respectively. These sequences are on OEIS as A006520 and A080277. Direct evaluation of those definitions would be inefficient, is there a better way?

Unrecursing the recursive definition

Let's look at the partial sums of blsmsk first. Its OEIS entry suggests the recursion below, which is already significantly better than the naive summation:

a(1) = 1, a(2*n) = 2*a(n) + 2*n, a(2*n+1) = 2*a(n) + 2*n + 1

A "more code-ish"/"less mathy" way to implement that could be like this:

int PartialSumBLSMSK(int i) {
    if (i == 1)
        return 1;
    return (PartialSumBLSMSK(i >> 1) << 1) + i;
}

Let's understand what this actually computes, and then find an other way to do it. The overal big picture of the recursion is that n is being shifted right on the way "down", and the results of the recursive calls are being shifted left on the way "up", in a way that cancels each other. So in total, what happens is that a bunch of "copies" of n is added up, except that at the kth step of the recursion, the kth bit of n is reset.

This non-tail recursion can be turned into tail-recursion using a standard trick: turning the "sum on the way"-logic into "sum on the way down", by passing two accumulators in extra arguments, one to keep track of the sum, and an other to keep track of how much to multiply i by:

int PartialSumBLSMSKTail(int i, int accum_a, int accum_m) {
    if (i <= 1)
        return accum_a + i * accum_m;
    return PartialSumBLSMSKTail(i >> 1, accum_a + i * accum_m, accum_m * 2);
}

Such a tail-recursive function is then simple to transform into a loop. Shockingly, GCC (even quite old versions) manages to compile the original non-tail recursive function into a loop as well without much help (just changing the left-shift into the equivalent multiplication), although some details differ.

Anyway, let's put in a value and see what happens. For example, if the input was 11101001, then the following numbers would be added up:

11101001 (reset bit 0)
11101000 (reset bit 1, it's already 0)
11101000 (reset bit 2)
11101000 (reset bit 3)
11100000 (reset bit 4)
11100000 (reset bit 5)
11000000 (reset bit 6)
10000000 (base case)

Look at the columns of the matrix of numbers above, the column for bit 3 has four ones in it, the column for bit 5 has six ones in it. If bit k is set bit in n, there is a pattern that that bit is set in k+1 rows.

Using the pattern

Essentially what that pattern means, is that a(n) can be expressed as the dot-product between n viewed as a vector of bits (weighted according to their position) (𝓷), and a constant vector (𝓬) with entries 1, 2, 3, 4, etc, up to the size of the integer. For example for n=5, 𝓷 would be (1, 0, 4), and the dot-product with 𝓬 would be 13. A dot-product like that can be implemented with some bitwise trickery, by using bit-slicing. The trick there is that instead of multiplying the entries of 𝓷 by the entries of 𝓬 directly, we multiply the entries of 𝓷 by the least-significant bits of the entries of 𝓬, and then seperately multiply it by all the second bits of the entries of 𝓬, and so on. Multiplying every entry of 𝓷 at once by a bit of an entry of 𝓬 can be implemented using just a bitwise-AND operation.

Although this trick lends itself well to any vector 𝓬, I will use 0,1,2,3.. and add an extra n separately (this corresponds to factoring out the +1 that appears at the end of the recursive definition), because that way part of the code can be reused directly by the solution of the partial sums of blsi (and also because it looks nicer). The masks that correspond to the chosen vector 𝓬 are easy to compute: each column across the masks is an entry of that vector. In this case, for 32bit integers:

c0 10101010101010101010101010101010
c1 11001100110011001100110011001100
c2 11110000111100001111000011110000
c3 11111111000000001111111100000000
c4 11111111111111110000000000000000

The whole function could look like this:

int PartialSumBLSMSK(int n)
{
    int sum = n;
    sum += (n & 0xAAAAAAAA);
    sum += (n & 0xCCCCCCCC) << 1;
    sum += (n & 0xF0F0F0F0) << 2;
    sum += (n & 0xFF00FF00) << 3;
    sum += (n & 0xFFFF0000) << 4;
    return sum;
}

PartialSumBLSI works almost the same way, with its recursive formula being b(1) = 1, b(2n) = 2b(n) + n, b(2n+1) = 2b(n) + n + 1. The +1 can be factored out as before, and the other part (n instead of 2*n) is exactly half of what it was before. Dividing 𝓬 by half seems like a problem, but it can be done implicitly by shifting the bit-slices of the product to the right by 1 bit. There are no problems with bits being lost that way, because the least significant bit is always zero in this case (𝓬 has zero as its first element).

int PartialSumBLSI(int n)
{
    int sum = n;
    sum += (n & 0xAAAAAAAA) >> 1;
    sum += (n & 0xCCCCCCCC);
    sum += (n & 0xF0F0F0F0) << 1;
    sum += (n & 0xFF00FF00) << 2;
    sum += (n & 0xFFFF0000) << 3;
    return sum;
}

Wrapping up

The particular set of constants I used is very useful and appears in more tricks, such as collecting indexes of set bits. They are the bitwise complements of a set of masks that Knuth (in The Art of Computer Programming volume 4, section 7.1.3) calls "magic masks", labeled µ_k.

This post was inspired by this question on Stack Overflow and partially based on my own answer and the answer of Eric Postpischil, without which I probably would not have come up with any of this, although I used a used a different derivation and explanation for this post.

The range a sum cannot fall within

2020-09-26T06:27:00.006-07:00

Throughout this post, the variables A, B, and R are used, with R defined as R = A + B, and A ≤ B. Arithmetic in this post is unsigned and modulo 2^k. Note that A ≤ B is not a restriction on the input, it is a choice to label the smaller input as A and the larger input as B. Addition is commutative, so this choice can be made without loss of generality.

`R < A || R ≥ B`

The sum is less than A iff the addition wraps (1), otherwise it has to be at least B (2).

B cannot be so high that the addition can wrap all the way up to or past A. To make A + B add up to A, B would have had to be 2^k, which is one beyond the maximum value it can be. R = A is possible only if B is zero, in which case R ≥ B holds instead.
Since A is at least zero, in the absence of wrapping there is no way to reduce the value below the inputs.

Perhaps that all looks obvious, but this has a useful application: if the carry-out of the addition is not available, it can be computed via carry = (x + y) < x, which is a relatively well-known trick. It does not matter which of x or y is the smaller or larger input, the sum cannot fall within the "forbidden zone" between them. The occasionally seen carry = (x + y) < max(x, y) adds an unnecessary complication.

`R < (A & B) || R ≥ (A | B)`

This is a stronger statement, because A & B is usually smaller than A and A | B is usually greater than B.

If no wrapping occurs, then R ≥ (A | B). This can be seen for example by splitting the addition into a XOR and adding the carries separately, (A + B) = (A ^ B) + (A & B) * 2, while bitwise OR can be decomposed similarly into (A | B) = (A ^ B) + (A & B)^{(see below)}. Since there is no wrapping (by assumption), (A & B) * 2 ≥ (A & B) and therefore (A + B) ≥ (A | B). Or, with less algebra: addition sometimes produces a zero where the bitwise OR produces a one, but then addition compensates doubly for it by carrying into the next position.

For the case in which wrapping occurs I will take a bit-by-bit view. In order to wrap, the carry out of bit k-1 must be 1. In order for the sum to be greater than or equal to A & B, bit k-1 of the sum must be greater than or equal to bit k-1 of A & B. That combination means that the carry into bit k-1 of the sum must have been 1 as well. Furthermore, bit k-1 of the sum can't be greater than bit k-1 of A & B, at most it can be equal, which means bit k-2 must be examined as well. The same argument applies to bit k-2 and so on, until finally for the least-significant bit it becomes impossible for it to be carried into, so the whole thing falls down: by contradiction, A + B must be less than A & B when the sum wraps.

What about `(A | B) = (A ^ B) + (A & B)` though?

The more obvious version is (A | B) = (A ^ B) | (A & B), compensating for the bits reset by the XOR by ORing exactly those bits back in. Adding them back in also works, because the set bits in A ^ B and A & B are disjoint: a bit being set in the XOR means that exactly one of the input bits was set, which makes their AND zero.

Information on incrementation

2020-05-03T16:28:00.000-07:00

Defining `increment`

Just to avoid any confusion, the operation that this post is about is adding 1 (one) to a value: $$\text{increment}(x) = x + 1$$ Specifically, performing that operation in the domain of bit-vectors.

Incrementing is very closely related to negating. After all, -x = ~x + 1 and therefore x + 1 = -~x, though putting it that way feel oddly reversed to me.

Bit-string notation

In bit-string notation (useful for analysing compositions of operations at the bit level), increment can be represented as: $$a01^k + 1 = a10^k$$

An "English" interpretation of that form is that an increment carries through the trailing set bits, turning them to zero, and then carries into the right-most unset bit, setting it.

That "do something special with the right-most unset bit" aspect of increment is the basis for various right-most bit manipulations, some of which were implemented in AMD Trailing Bit Manipulation (TBM) (which has been discontinued).

For example, the right-most unset bit in x can be set using x | (x + 1), which has a nice symmetry with the more widely known trick for unsetting the right-most set bit, x & (x - 1).

Increment by XOR

As was the case with negation, there is a way to define increment in terms of XOR. The bits that flip during an increment are all the trailing set bits and the right-most unset bit, the TBM instruction for which is BLCMSK. While that probably does not seem very useful yet, the fact that x ^ (x + 1) takes the form of some number of leading zeroes followed by some number of trailing ones turns out to be useful.

Suppose one wants to increment a bit-reversed integer, a possible (and commonly seen) approach is looping of the bits from top the bottom and implementing the "carry through the ones, into the first zero" logic by hand. However, if the non-reversed value was also available (let's call it i), the bit-reversed increment could be implemented by calculating the number of ones in the mask as tzcnt(i + 1) + 1 (or popcnt(i ^ (i + 1))) and forming a mask with that number of ones located at the desired place within an integer:

// i   = normal counter
// rev = bit-reversed counter
// N   = 1 << number_of_bits
int maskLen = tzcnt(i + 1) + 1;
rev ^= N - (N >> maskLen);

That may still not seem useful, but this enables an implementation of the bit-reversal permutation (not a bit-reversal itself, but the permutation that results from bit-reversing the indices). The bit-reversal permutation is sometimes used to re-order the result of a non-auto-sorting Fast Fourier Transform algorithm into the "natural" order. For example,

// X = array of data
// N = length of X, power of two
for (uint32_t i = 0, rev = 0; i < N; ++i)
{
    if (i < rev)
        swap(X[i], X[rev]);
    int maskLen = tzcnt(i + 1) + 1;
    rev ^= N - (N >> maskLen);
}

This makes no special effort to be cache-efficient.

Square root of bitwise NOT

2019-10-10T07:43:00.000-07:00

The square root of bitwise NOT, if it exists, would be some function f such that f(f(x)) = not x, or in other words, f²(x) = not x. It is similar in concept to the √NOT gate in Quantum Computing, but in a different domain which makes the solution very different.

Before trying to find any specific f, it may be interesting to wonder what properties it would have to have (and lack).

f must be bijective, because its square is bijective.
f² is an involution but f cannot be an involution, because its square would then be the identity.
f viewed as a permutation (which can be done, because it has to be bijective) must be a derangement, if it had any fixed point then that would also be a fixed point in f² and the not function does not have a fixed point.

Does f exist?

In general, a permutation has a square root if and only if the number of cycles of same even length is even. The not function, being an involution, can only consist of swaps and fixed points, and we already knew it has no fixed points so it must consist of only swaps. A swap is a cycle of length 2, so an even length. Since the not function operates on k bits, the size of its domain is a power of two, 2^k. That almost always guarantees an even number of swaps, except when k = 1. So, the not function on a single bit has no square root, but for more than 1 bit there are solutions.

f for even k

For 2 bits, the not function is the permutation (0 3) (1 2). An even number of even-length cycles, as predicted. The square root can be found by interleaving the cycles, giving (0 1 3 2) or (1 0 2 3). In bits, the first looks like:

in out 00 01 01 11 10 00 11 10

Which corresponds to swapping the bits and then inverting the lsb, the other variant corresponds to inverting the lsb first and then swapping the bits.

That solution can be applied directly to other even numbers of bits, swapping the even and odd bits and then inverting the even bits, but the square root is not unique and there are multiple variants. The solution can be generalized a bit, combining a step that inverts half of the bits with a permutation that brings each half of the bits into the positions that are inverted when it is applied twice, so that half the bits are inverted the first time and the other half of the bits are inverted the second time. For example for 32 bits, there is a nice solution in x86 assembly:

bswap eax
xor eax, 0xFFFF

f for odd k

Odd k makes things less easy. Consider k=3, so (0 7) (1 6) (2 5) (3 4). There are different ways to pair up and interleave the cycles, leading to several distinct square roots:

(0 1 7 6) (2 3 5 4)
(0 2 7 5) (1 3 6 4)
(0 3 7 4) (1 2 6 5)
etc..

in 1 2 3 000 001 010 011 001 111 011 010 010 011 111 110 011 101 110 111 100 010 001 000 101 100 000 001 110 000 100 101 111 110 101 100

These correspond to slightly tricky functions, for example the first one has as its three from lsb to msb: the msb but inverted, the parity of the input, and finally the lsb. The other ones also incorporate the parity of the input in some way.

abs and its "extra" result

2018-10-03T03:39:00.000-07:00

The abs function has, in its usual (most useful) formulation, one extra value in its codomain than just "all non-negative values". That extra value is the most negative integer, which satisfies abs(x) == x despite being negative. Even accepting that the absolute value of the most negative integer is itself, it may still seem strange (for an operation that is supposed to have such a nice symmetry) that the size of the codomain is not exactly half of the size of the domain.

That there is an "extra" value in the codomain, and that it is specifically the most negative integer, may be more intuitively obvious when the action of abs on the number ~~line~~ circle is depicted as "folding" the circle symmetrically in half across the center and through zero (around which abs is supposed to be symmetric), folding the negative numbers onto the corresponding positive numbers:

Clearly both zero and the most negative integer (which is also on the "folding line") stay in place in such a folding operation and remain part of the resulting half-circle. That there is an "extra" value in the codomain is the usual fencepost effect: the resulting half-circle is half the size of the original circle in some sense, but the "folding line" cuts through two points that have now become endpoints.

By the way the "ones' complement alternative" to the usual abs, let's call it OnesAbs(x) = x < 0 ? ~x : x (there is a nice branch-free formulation too) does have a codomain with a size exactly half of the size of its domain. The possible results are exactly the non-negative values. It has to pay for that by, well, not being the usual abs. The "folding line" for OnesAbs runs between points, avoiding the fencepost issue:

in	1	2	3
000	001	010	011
001	111	011	010
010	011	111	110
011	101	110	111
100	010	001	000
101	100	000	001
110	000	100	101
111	110	101	100

in	1	2	3
000	001	010	011
001	111	011	010
010	011	111	110
011	101	110	111
100	010	001	000
101	100	000	001
110	000	100	101
111	110	101	100

Bits, Math and Performance(?)

From Boolean logic to bitmath and SIMD: transitive closure of tiny graphs

MOR

No more MOR

Bit-permuting 16 u32s at once with AVX-512

The code

Histogramming bytes with positional popcount (GF2P8AFFINEQB edition)

pospopcnt

Binning

Extras

Enumerating identities, part 2

Finding the lowest pair of expressions

Encoding the lower bound

Code

Sorting the nibbles of a u64

Sharpening a lower bound with KnownBits information

Multiplying 64x64 bit-matrices with GF2P8AFFINEQB

Crude benchmarks

Implementing grevmul with GF2P8AFFINEQB

Implementing grevmul with GF2P8AFFINEQB

Forming m

Combining the results

The code

The end

Enumerating all mathematical identities (in fixed-size bitvector arithmetic with a restricted set of operations) of a certain size

The Approach

Original motivation

Future directions

The solutions to 𝚙𝚘𝚙𝚌𝚗𝚝(𝚡) < 𝚝𝚣𝚌𝚗𝚝(𝚡) and why there are Fibonacci[n] of them below 2ⁿ

Partial sums of popcount

Update:

Enter AVX512

Permuting bits with GF2P8AFFINEQB

The building blocks

Building something with the blocks

Propagating bounds through bitwise operations

maxOR

minOR

Conclusion

Some ways to check whether an unsigned sum wraps

~x < y

(x + y) < x

(x + y) < y

avg_down(x, y) <s 0

avg_down(x, y) & 0x80000000 // adjust to integer size

(x + y) < min(x, y)

(x + y) < max(x, y)

(x + y) < avg(x, y)

(x + y) < (x | y)

(x + y) < (x & y)

~(x | y) < (x & y)

(x + y) != addus(x, y)

(x + y) < addus(x, y)

subus(y, ~x) != 0

subus(x, ~y) != 0

addus(~x, ~y) != UINT_MAX

(x + y) < subus(y, ~x)

~x < avg_up(~x, y)

grevmul

What is the "grevmul inverse" of x?

(Not) transposing a 16x16 bitmatrix

And for my next trick, I will now not transpose the matrix

The end

Column-oriented row reduction

Turning it into a column-oriented algorithm

The stuff that goes at the end of the post

Weighted popcnt

Generalizing the technique

Addendum

Bit-level commutativity and "sum gap" revisited

Integer promotion does not help performance

Implementing narrow arithmetic as a programmer

Implementing narrow arithmetic as a compiler

So integer promotion is useless?

Partial sums of blsi and blsmsk

Unrecursing the recursive definition

Using the pattern

Wrapping up

The range a sum cannot fall within

R < A || R ≥ B

`pospopcnt`

Implementing `grevmul` with `GF2P8AFFINEQB`

Forming `m`

`maxOR`

`minOR`

`~x < y`

`(x + y) < x`

`(x + y) < y`

`avg_down(x, y) <s 0`

`avg_down(x, y) & 0x80000000 // adjust to integer size`

`(x + y) < min(x, y)`

`(x + y) < max(x, y)`

`(x + y) < avg(x, y)`

`(x + y) < (x | y)`

`(x + y) < (x & y)`

`~(x | y) < (x & y)`

`(x + y) != addus(x, y)`

`(x + y) < addus(x, y)`

`subus(y, ~x) != 0`

`subus(x, ~y) != 0`

`addus(~x, ~y) != UINT_MAX`

`(x + y) < subus(y, ~x)`

`~x < avg_up(~x, y)`

What is the "grevmul inverse" of `x`?

`R < A || R ≥ B`

`R < (A & B) || R ≥ (A | B)`

What about `(A | B) = (A ^ B) + (A & B)` though?

Defining `increment`

in	1	2	3
000	001	010	011
001	111	011	010
010	011	111	110
011	101	110	111
100	010	001	000
101	100	000	001
110	000	100	101
111	110	101	100