Monday, 22 May 2023

grevmul

grev (generalized bit-reverse) is an operation that implements bit-permutations corresponding to XOR-ing the indices by some value. It has been proposed to be part of the Zbp extension of RISC-V, with this reference implementation (source: release v0.93)

uint32_t grev32(uint32_t rs1, uint32_t rs2)
{
    uint32_t x = rs1;
    int shamt = rs2 & 31;
    if (shamt &  1) x = ((x & 0x55555555) <<  1) | ((x & 0xAAAAAAAA) >>  1);
    if (shamt &  2) x = ((x & 0x33333333) <<  2) | ((x & 0xCCCCCCCC) >>  2);
    if (shamt &  4) x = ((x & 0x0F0F0F0F) <<  4) | ((x & 0xF0F0F0F0) >>  4);
    if (shamt &  8) x = ((x & 0x00FF00FF) <<  8) | ((x & 0xFF00FF00) >>  8);
    if (shamt & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
    return x;
}

grev looks in some ways similar to bit-shifts and rotates: the left and right operands have distinct roles with the right operand being a mask of k bits if the left operand has 2k bits[1].

Carry-less multiplication normally has a left-shift in it, grevmul is what you get when that left-shift is replaced with grev.

uint32_t grevmul32(uint32_t x, uint32_t y)
{
    uint32_t r = 0;
    for (int k = 0; k < 32; k++) {
        if (y & (1 << k))
            r ^= grev32(x, k);
    }
    return x;
}

grevmul is, at its core, very similar to clmul: take single-bit products (logical AND) of every bit of the left operand with every bit of the right operand, then do some XOR-reduction. The difference is in which partial products are grouped together. For clmul, the partial products that contribute to bit k of the result are pairs with indices i,j such that i + j = k. For grevmul, it's the pairs with indices such that i ^ j = k. This goes back to grev permuting the bits by XOR-ing their indices by some value, and that value is k here.

Now that grevmul has been defined, let's look at some of its properties, comparing it to clmul and plain old imul.

grevmul clmul imul
zero[2] 0 0 0
identity 1 1 1
commutative yes yes yes
associative yes yes yes
distributes over xor xor addition
op(x, 1 << k) is grev(x, k) x << k x << k
x has inverse if
popcnt(x) & 1 x & 1 x & 1
op(x, x) is popcnt(x) & 1 pdep(x, 0x55555555)

What is the "grevmul inverse" of x?

Time for some algebra. Looking just at the table above, and forgetting the actual definition of grevmul, can we say something about the solutions of grevmul(x, y) == 1? Surprisingly, yes.

Assuming we have some x with odd parity (numbers with even parity do not have inverses, so let's ignore them for now), we know that grevmul(x, x) == 1. The inverse in a monoid is unique so x is not just some inverse of x, it is the (unique) inverse of x.

Since the "addition operator" is XOR (for which negation is the identity function), this is a non-trivial example of a ring in which x = -x = x-1, when x-1 exists. Strange, isn't it?

We also have that f(x) = grevmul(x, c) (for appropriate choices of c) is a (non-trivial) involution, so it may be a contenter for the "middle operation" of an involutary bit finalizer, but probably useless without an efficient implementation.

I was going to write about implementing grevmul by an 8-bit constant with two GF2P8AFFINEQBs but I've had enough for now, maybe later.


[1] The right operand of a shift is often called the shift count, but it can also be interpreted as a mask indicating some subset of shift-by-2i operations to perform. That interpretation is useful for example when implementing a shift-by-variable operation on a machine that only has a shift-by-constant instruction, following the same pattern as the reference implementation of grev32.

[2] This looks like a joke, but I mean that the numeric value 0 acts as the zero element of the corresponding semigroup.

Wednesday, 12 April 2023

(Not) transposing a 16x16 bitmatrix

Inverting a 16-element permutation may done like this:

for (int i = 0; i < 16; i++)
    inv[perm[i]] = i;

Computing a histogram of 16 nibbles may done like this:

for (int i = 0; i < 16; i++)
    hist[data[i]] += 1;

These different-sounding but already similar-looking tasks have something in common: they can be both be built around a 16x16 bitmatrix transpose. That sounds silly, why would anyone want to first construct a 16x16 bitmatrix, transpose it, and then do yet more processing to turn the resulting bitmatrix back into an array of numbers?

Because it turns out to be an efficiently-implementable operation, on some modern processors anyway.

If you know anything about the off-label application of GF2P8AFFINEQB, you may already suspect that it will be involved somehow (merely left-GF2P8AFFINEQB-ing by the identity matrix already results in some sort of 8x8 transpose, just horizontally mirrored), and it will be, but that's not the whole story.

First I will show not only how to do it with GF2P8AFFINEQB, but also how to find that solution programmatically using a SAT solver. There is nothing that fundamentally prevents a human from finding a solution by hand, but it seems difficult. Using a SAT solver to find a solution ex nihilo (requiring it to find both a sequence of instructions and their operands) is not that easy either (though that technique also exists). Thankfully, Geoff Langdale suggested a promising sequence of instructions:

The problem we have now (which the SAT solver will solve) is, under the constraint that for all X, f(X) = PERMB(GF2P8AFFINE(B, PERMB(X, A)), C) computes the transpose of X, what is a possible valuation of the variables A, B, C. Note that the variables in the SAT problem correspond to constants in the resulting code, and the variable in the resulting code (X) is quantified out of the problem.

If you know a bit about SAT solving, that "for all X" sounds like trouble, requiring either creating a set of constraints for every possible value of X (henceforth, concrete values of X will be known as "examples"), or some advanced technique such as CEGIS to dynamically discover a smaller set of examples to base the constraints on. Luckily, since we are dealing with a bit-permutation, there are simple and small sets of examples that together sufficiently constrain the problem. For a 16-bit permutation, this set of values could be used:

  • 1010101010101010
  • 1100110011001100
  • 1111000011110000
  • 1111111100000000

For a 256-bit permutation, a similar pattern can be used, where each of the examples has 256 bits and there would be 8 of them. Note that if you read the columns of the values, they list out the indices of the corresponding columns, which is no coincidence. Using that set of examples to constrain the problem with, essentially means that we assert that f when applied to the sequence 0..n-1 must result in the desired permutation. The way that I actually implemented this puts a column into one "abstract bit", so that it represents the index of the bit all in one place instead of spread out.

Implementing a "left GF2P8AFFINEQB" (multiplying a constant matrix on the left by a variable matrix on the right) in CNF, operating on "abstract bits" (8 variables each), is relatively straight forward. Every (abstract) bit of the result is the XOR of the AND of some (abstract) bits, writing that down is mostly a chore, but there is one interesting aspect: the XOR can be turned into an OR, since we know that we're multiplying by a permutation matrix. In CNF, OR is simpler than XOR, and easier for the solver to reason through.

VPERMB is more difficult to implement, given that the permutation operand is a variable (if it was a constant, we could just permute the abstract bits without generating any new constraints). To make it easier, I represent the permutation operand as a 32x32 permutation matrix, letting me create a bunch of simple ternary constraints of the form (¬P(i, j) ∨ ¬A(j) ∨ R(i)) ∧ (¬P(i, j) ∨ A(j) ∨ ¬R(i)) (read: if P(i, j), then A(j) must be equal to R(i)). The same thing can be used to implement VPSHUFB, with additional constraints on the permutation matrix (to prevent cross-slice movement).

Running that code, at least on my PC at this time[1], results in (with some whitespace manually added):

__m256i t0 = _mm256_permutexvar_epi8(_mm256_setr_epi8(
    14, 12, 10, 8, 6, 4, 2, 0,
    30, 28, 26, 24, 22, 20, 18, 16,
    15, 13, 11, 9, 7, 5, 3, 1,
    31, 29, 27, 25, 23, 21, 19, 17), input);
__m256i t1 = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x1080084004200201), t0, 0);
__m256i t2 = _mm256_shuffle_epi8(t1, _mm256_setr_epi8(
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14,
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14));

So that's it. That's the answer[2]. If you want to transpose a 16x16 bitmatrix, on a modern PC (this code requires AVX512_VBMI and AVX512_GFNI[3]), it's fairly easy and cheap, it's just not so easy to find this solution to begin with.

Using this transpose to invert a 16-element permutation is pretty easy, for example using _mm256_sllv_epi16 to construct the matrix and _mm256_popcnt_epi16(_mm256_sub_epi16(t2, _mm256_set1_epi16(1))) (sadly there is no SIMD version of TZCNT .. yet) to convert the bit-masks back into indices. It may be tempting to try to use a mirrored matrix and leading-zero count, which AVX512 does offer, but it only offers the DWORD and QWORD versions VPLZCNTD/Q.

Making a histogram is even simpler, using only _mm256_popcnt_epi16(t2) to convert the matrix into counts.

And for my next trick, I will now not transpose the matrix

What if we didn't transpose that matrix. Does that even make sense? Well, at least for the two applications that I focused on, what we really need is not so much the transpose of the matrix, but any matrix such that:

  1. Every bit of the original matrix occurs exactly once in the result.
  2. Each row of the result contains all bits from a particular column.
  3. The permutation within each row is "regular" enough that we can work with it. We don't need this when making a histogram (as Geoff already noted in one of his tweets).

There is no particular requirement on the order of the rows, any row-permutation we end up with is easy to undo.

The first two constraints leave plenty of options open, but the last constraint is quite vague. Too vague for me to do something such as searching for the best not-quite-transpose, so I don't promise to have found it. But here is a solution: rotate every row by its index, then rotate every column by its index.

At least, that's the starting point. Rotating the columns requires 3 rounds of blending a vector with cross-slice-permuted copy of that vector, and a VPERMQ sandwiched by two VPSHUFBs to rotate the last 8 columns by 8. That's a lot of cross-slice permuting, most of it can be avoided by modifying the overall permutation slightly:

  1. Exchange the off-diagonal quadrants.
  2. Rotate each row by its index.
  3. For each quadrant individually, rotate each column by its index.

Here is some attempt at illustrating that process, feel free to skip past it

These three steps are implementable in AVX2:

  1. Exchanging the off-diagonal quadrants can be done by gathering the quadrants into QWORDs, permuting them, and shuffling the QWORDs back into quadrants.
  2. Rotating the rows can be done with VPMULLW (used as a variable shift-left), VPMULHUW (used as a variable shift-right), and VPOR.
  3. Rotating the columns can be done by conditionally rotating the columns with odd indices by 1, conditionally rotating the columns that have the second bit of their index set by 2, and conditionally rotating the columns that have the third bit of their index set by 4. The rotations can be done using VPALIGNR[4], the conditionality can be implemented with blending, but since this needs to be bit-granular blend, it cannot be performed using VPBLENDVB.

In total, here is how I don't transpose a 16x16 matrix with AVX2, hopefully there is a better way:

__m256i nottranspose16x16(__m256i x)
{
    // exchange off-diagonal quadrants
    x = _mm256_shuffle_epi8(x, _mm256_setr_epi8(
        0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15,
        0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15));
    x = _mm256_permutex_epi64(x, _MM_SHUFFLE(3, 1, 2, 0));
    x = _mm256_shuffle_epi8(x, _mm256_setr_epi8(
        0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15,
        0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15));
    // rotate every row by its y coordinate
    __m256i shifts = _mm256_setr_epi16(
        1 << 0, 1 << 1, 1 << 2, 1 << 3,
        1 << 4, 1 << 5, 1 << 6, 1 << 7,
        1 << 8, 1 << 9, 1 << 10, 1 << 11,
        1 << 12, 1 << 13, 1 << 14, 1 << 15);
    __m256i sll = _mm256_mullo_epi16(x, shifts);
    __m256i srl = _mm256_mulhi_epu16(x, shifts);
    x = _mm256_or_si256(sll, srl);
    // within each quadrant independently, 
    // rotate every column by its x coordinate
    __m256i x0, x1, m;
    // rotate by 4
    m = _mm256_set1_epi8(0x0F);
    x0 = _mm256_and_si256(x, m);
    x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 8));
    x = _mm256_or_si256(x0, x1);
    // rotate by 2
    m = _mm256_set1_epi8(0x33);
    x0 = _mm256_and_si256(x, m);
    x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 4));
    x = _mm256_or_si256(x0, x1);
    // rotate by 1
    m = _mm256_set1_epi8(0x55);
    x0 = _mm256_and_si256(x, m);
    x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 2));
    x = _mm256_or_si256(x0, x1);
    return x;
}

Using that not-transpose to invert a 16-element permutation takes some extra steps that, without AVX512, are about as annoying as not-transposing the matrix was.

  • Constructing the matrix is more difficult. AVX2 has shift-by-variable, but not for 16-bit element.[5] There are various work-arounds, such as using DWORDs and then narrowing, of course (boring). Another (funnier) option is to duplicate every byte, add 0xF878 to every word, then use VPSHUFB in lookup-table-mode to index into a table of powers of two. Having added 0x78 to every low byte of every word, that byte will mapped to zero if it was 8 or higher, or otherwise two to the power of that byte. The high byte, having 0xF8 added to it, will be mapped to 0 if it was below 8, or otherwise to two to the power of that byte minus 8. As wild as that sounds, it is pretty fast, costing only 5 cheap instructions (whereas widening to DWORDs, shifting, and narrowing, would be worse than it sounds). Perhaps there is a better way.
  • Converting masks back into indices is more difficult due to the lack of trailing zero count, leading zero count, or even popcount. What AVX2 does have, is .. VPSHUFB again. We can multiply by an order-4 de Bruijn sequence and use VPSHUFB to map the results to the indices of the set bits.
  • Then we have indices, but since the rows and columns were somewhat arbitrarily permuted, they must still be mapped back into something that makes sense. Fortunately that's no big deal, a modular subtraction (or addition, same thing really) cancels out the row-rotations, and yet another VPSHUFB cancels out the strange order that the rows are in. Fun detail: the constants that are subtracted and the permutation are both 0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9.

All put together:

void invert_permutation_avx2(uint8_t *p, uint8_t *inv)
{
    __m256i v = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i*)p));
    // indexes to masks
    v = _mm256_or_si256(v, _mm256_slli_epi64(v, 8));
    v = _mm256_add_epi8(v, _mm256_set1_epi16(0xF878));
    __m256i m = _mm256_shuffle_epi8(_mm256_setr_epi8(
        1, 2, 4, 8, 16, 32, 64, 128,
        1, 2, 4, 8, 16, 32, 64, 128,
        1, 2, 4, 8, 16, 32, 64, 128,
        1, 2, 4, 8, 16, 32, 64, 128), v);
    // ???
    m = nottranspose16x16(m);
    // masks to indexes
    __m256i deBruijn = _mm256_and_si256(_mm256_mulhi_epu16(m, _mm256_set1_epi16(0x9AF0)), _mm256_set1_epi16(0x000F));
    __m128i i = _mm_packs_epi16(_mm256_castsi256_si128(deBruijn), _mm256_extracti128_si256(deBruijn, 1));
    i = _mm_shuffle_epi8(_mm_setr_epi8(
        0, 1, 2, 5, 3, 9, 6, 11, 15, 4, 8, 10, 14, 7, 13, 12), i);
    // un-mess-up the indexes
    i = _mm_sub_epi8(i, _mm_setr_epi8(0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9));
    i = _mm_and_si128(i, _mm_set1_epi8(0x0F));
    i = _mm_shuffle_epi8(i, _mm_setr_epi8(0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9));
    _mm_storeu_si128((__m128i*)inv, i);
}

To make a histogram, emulate VPOPCNTW using, you guessed it, PSHUFB.

The end

This post is, I think, one of the many examples of how AVX512 can be an enormous improvement compared to AVX2 even when not using 512-bit vectors. Every step of every problem had a simple solution in AVX512 (even if it was not always easy to find it). With AVX2, everything felt "only barely possible".

"As complicated as it is, is this actually faster than scalar code?" Yes actually, but feel free to benchmark it yourself. The AVX2 version being somewhat more efficient than scalar code is not really the point of this post anyway. The AVX512 version is nice and efficient, I'm showing an AVX2 version mostly to show how hard it is to create it.[6]

Transposing larger matrices with AVX512 can be done by first doing some quadrant-swapping (also used at the start of the not-transpose) until the bits that need to end up together in one 512-bit block are all in there, and then a VPERMB, VGF2P8AFFINEQB, VPERMB sequence with the right constants (which can be found using the techniques that I described) can put the bits in their final positions. But well, I already did that, so there you go.

A proper transpose can be done in AVX2 of course, for example using 4 rounds of quadrant-swapping. Implementations of that already exist so I thought that would be boring to talk about, but there is an interesting aspect to that technique that is often not mentioned: every round of quadrant-swapping can be seen as exchanging two bits of the indices. Swapping the big 8x8 quadrants swaps bits 3 and 7 of the indices, transposing the 2x2 submatrices swaps bits 0 and 4 of the indices. From that point of view, it's easy to see that the order in which the four steps are performed does not matter - no matter the order, the lower nibble of the index is swapped with the higher nibble of the index.


[1] While MiniSAT (which this program uses as its SAT solver) is a "deterministic solver" in the sense of definitely finding a satifying valuation if there is one, it is not deterministic in the sense of guaranteeing that the same satisfying valuation is found every time the solver is run on the same input.

[2] Not the unique answer, there are multiple solutions.

[3] But not 512-bit vectors.

[4] Nice! It's not common to see a 256-bit VPALIGNR being useful, due to it not being the natural widening of 128-bit PALIGNR, but acting more like two PALIGNRs side-by-side (with the same shifting distance).

[5] Intel, why do you keep doing this.

[6] Also as an excuse to use PSHUFB for everything.

Sunday, 9 April 2023

Column-oriented row reduction

In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations.

The usual implementation of row reduction of a bit-matrix[1] takes an array of bit-packed rows, and applies row operations to the matrix in a row-wise manner. As a reminder, it works something like this:

  1. Find a row with a non-zero element[2] in the current column.
  2. Swap it into position.
  3. XOR that row into any other row that has a non-zero element in the current column.
  4. Go to step 1 for the next column, if there is one.

For the application that I had mind when writing this blog post[3], step 2 is not necessary and will be skipped.[4]

Note that at the single-bit-level, the XOR-logic of step 3 comes down to if M(i, j) and M(j, k) then flip M(i, k). That is, in order for the XOR to flip M(i, k), the corresponding row must have a 1 in the current column (condition A), and the row that was selected to pivot on must have a 1 in the column k (condition B).

Turning it into a column-oriented algorithm

In the row-oriented algorithm, condition A is handled by conditionally XOR-ing into only those rows that have a 1 in the current column, and condition B is handled by the XOR itself (which is a conditional bit-flip afterall). A column-oriented algorithm would do it the other way around, using XOR to implement condition B, and skipping columns for which condition A is false:

  1. Find a row that hasn't already been pivoted on, with a non-zero element in the current column.
  2. Don't swap it into position.
  3. XOR each column that has a 1 in the row that we're pivoting on, with a mask formed by taking the current column and resetting the bit that corresponds to the pivot row.
  4. If there are any rows left that haven't been pivoted on, go to step 1 for the next column.

Step 1 may sound somewhat annoying, but it is really simple: AND the current column with a mask of the rows that have already been pivoted on, had, and use the old Isolate Lowest Set Bit[5] trick x & -x to extract a mask corresponding to the first row that passes both conditions.

Here is an example of what an implementation might look like (shown in C#, or maybe this is Java code with a couple of capitalization errors)

static void reduceRows(long[] M)
{
    long had = 0;
    for (int i = 0; i < M.Length; i++)
    {
        long r = lowestSetBit(M[i] & ~had);
        if (r == 0)
            continue;
        long mask = M[i] & ~r;
        for (int j = 0; j < M.Length; j++)
        {
            if ((M[j] & r) != 0)
                M[j] ^= mask;
        }
        had |= r;
    }
}

static long lowestSetBit(long x)
{
    return x & -x;
}

The stuff that goes at the end of the post

While the column-oriented approach makes the step "search for a row that has a 1 in the current column" easy, transposing a matrix just to be able to do this does not seem worthwhile to me.

Despite the if inside the inner loops of both the row-oriented and the column-oriented algorithm, both of them can be accellerated using SIMD. The conditional code can easily be rewritten into arithmetic-based masking[6] or "real" masked operations (in eg AVX512). "But don't compilers autovectorize these days?" Yes they do, but not always in a good way, note that both GCC and Clang both used a masked store, which is worse than writing back some unchanged values with a normal store (especially on AMD processors, and also especially considering that the stored values will be loaded again soon). Rumour has it that there are Reasons™ for that quirk, but that doesn't improve the outcome.


[1] Of course this blog post is not about floating point matrices, this is the bitmath blog.

[2] Sometimes known as 1 (one).

[3] XOR-clause simplification

[4] Row-swaps are particularly expensive to apply to a column-major bit-matrix, so perhaps this is just an excuse to justify the column-based approach. You decide.

[5] AKA BLSI, which AMD describes as "Isolate Lowest Set Bit", which makes sense, because well, that is what the operation does. Meanwhile Intel writes "Extract Lowest Set Isolated Bit" (isolated bit? what?) in their manuals.

[6] Arithmetic-based masking can also be used in scalar code.

Tuesday, 3 January 2023

Weighted popcnt

Since I showed the code below on Twitter, and some people understood it and some didn't, I suppose I should explain how it works and what the more general technique is.

// sum of indexes of set bits
int A073642(uint64_t n)
{
    return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
          (__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}

A sum of several multi-digit numbers, such as 12 + 34, can be rewritten to make the place-value notation explicit, giving 1*10 + 2*1 + 3*10 + 4*1. Then the distributive property of multiplication can be used to group digits of equal place-value together, giving (1 + 3)*10 + (2 + 4)*1. I don't bring up something so basic as an insult to the reader, the basis of that possibly-intimidating piece of code really is this basic.

The masks 0xAAAAAAAAAAAAAAAA, 0xCCCCCCCCCCCCCCCC, etc, represent nothing more than the numbers 0..63, each written vertically in binary, starting with the least significant bit. Each column of bits contains its own index. AND-ing the masks with n leaves, in each column, either zero (if the corresponding bit of n was zero) or the index of that column.

The first popcnt then adds together all the digits of the "filtered" indexes with a place-value of 1, the second popcnt adds together the digits with a place-value of 2, and so on. The result of each popcnt is shifted to match the corresponding place-value, and all of them are added together.

Generalizing the technique

The numbers in the columns did not have to be 0..63, the columns can contain arbitrary numbers, even negative numbers. This gives some sort of weighted popcnt: rather than each bit having a weight of 1, we are free to choose an arbitrary weight for each bit.

In general, starting with some arbitrary numbers, interpret them as a binary matrix. Transpose that matrix. Every row of the transposed matrix is the mask that goes into the corresponding popcnt-step.

Some simplifications may be possible:

  • If a row of the matrix is zero, the corresponding step can be left out.
  • If two or more rows are equal, their popcnt-steps can be merged into one with a weight that is the sum of the weights of the steps that are merged.
  • If a row is a linear combination of some set of other rows, the popcnt corresponding to that row can be computed as a linear combination of the popcnts corresponding to those other rows.
  • If a row has exactly one bit set, its step can be implemented with an AND and a shift, without popcnt, by shifting the bit to its target position.
  • If every row has exactly one bit set, we're really dealing with a bit-permutation and there are better ways to implement them.

For example, let's construct "the sum of squares of indexes of set bits" (with 1-indexed indexes) aka oeis.org/A003995. Transposing the squares 1,2,4,9..4096 gives these masks:

0x5555555555555555
0x0000000000000000
0x2222222222222222
0x1414141414141414
0x0d580d580d580d58
0x0335566003355660
0x00f332d555a66780
0x555a5b6666387800
0x66639c78783f8000
0x787c1f807fc00000
0x7f801fff80000000
0x7fffe00000000000
0x8000000000000000

The second mask is zero because a square cannot be congruent to 2 or 3 modulo 4. The last mask has exactly one bit set, so its step can be implemented without popcnt. Putting it together:

int A003995(uint64_t n)
{
    return __popcnt64(n & 0x5555555555555555) +
          (__popcnt64(n & 0x2222222222222222) << 2) +
          (__popcnt64(n & 0x1414141414141414) << 3) +
          (__popcnt64(n & 0x0d580d580d580d58) << 4) +
          (__popcnt64(n & 0x0335566003355660) << 5) +
          (__popcnt64(n & 0x00f332d555a66780) << 6) +
          (__popcnt64(n & 0x555a5b6666387800) << 7) +
          (__popcnt64(n & 0x66639c78783f8000) << 8) +
          (__popcnt64(n & 0x787c1f807fc00000) << 9) +
          (__popcnt64(n & 0x7f801fff80000000) << 10) +
          (__popcnt64(n & 0x7fffe00000000000) << 11) +
          ((n & 0x8000000000000000) >> 51);
}

Aside from implementing strange sequences from the OEIS, possible uses of this technique may include

  • Summing the face-values of a hand of cards. Unfortunately this cannot take "special combinations" into account, unlike other techniques.
  • A basic evaluation of a board state in Reversi/Othello, based on weights for each captured square that differ by their position.
  • Determining the price for a pizza based on a set of selected toppings, I don't know, I'm grasping at straws here.

Addendum

What if the input is small, and the weights are also small? Here is a different trick that is applicable in some of those cases, depending on whether the temporary result fits in 64 bits. The trick this time is much simpler, or at least sounds much simpler: for bit i of weight w[i], make w[i] copies of bit i, then popcnt everything.

A common trick to make k copies of only one bit is (bit << k) - bit. Assuming that pdep exists and is efficient, that trick can be generalized to making different numbers of copies of different bits. The simplest version of that trick would sacrifice one bit of padding per input bit, which may be acceptable depending on whether that all still fits in 64 bits. For example A073642 with a 10-bit input would work:

// requires: n is a 10-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n, 0x000020101020844b);
    return __popcnt64(top - bot);
}

That can be extended to an 11-bit input like this:

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n >> 1, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n >> 1, 0x000020101020844b);
    return __popcnt64(top - bot | top);
}

Or like this

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225) >> 1;
    uint64_t bot = _pdep_u64(n, 0x008020101020844b) >> 1;
    return __popcnt64(top - bot);
}

pdep cannot move bits to the right of their original position, in some cases (if there is no space for padding bits) you may need to hack around that, as in the example above. Rotates and right-shift are good candidates to do that with, and in general you may gather the bits with non-zero weights with pext.

This approach relies strongly on being able to efficiently implement the step "make w[i] copies of bit i", it is probably not possible to do that efficiently using only the plain old standard integer operations. Also, pdep is not efficient on all processors that support it, unfortunately making the CPUID feature flag for BMI2 useless for deciding which implementation to use.

Friday, 22 July 2022

Bit-level commutativity and "sum gap" revisited

In an earlier post, bit-level commutativity, it was shown that addition is not only invariant under swapping the operands, but also under swapping individual bits between the operands, provided that they are of equal place-value. For example the least significant bits of the operands of a sum could be exchanged without changing the value of the sum. Bit-level commutativity of addition implies that A + B = (A & B) + (A | B).

In the range a sum cannot fall within, it was shown that the "gap" which a sum cannot fall within can be widened from: $$A + B \begin{cases} \ge \max(A,B) & \text{ if addition does not carry} \\ \le \min(A,B) & \text{ if addition does carry} \end{cases}\tag{Eq1}$$ To: $$A + B \begin{cases} \ge A\:|\:B & \text{ if addition does not carry} \\ \le A\:\&\:B & \text{ if addition does carry} \end{cases}\tag{Eq2}$$ This makes the gap wider because A & B is often less than min(A, B), and A | B is often greater than max(A, B).

To start with, let's assume that $$A + B \begin{cases} \ge A & \text{ if addition does not carry} \\ \le A & \text{ if addition does carry} \end{cases}\tag{Eq3}$$ Since addition is commutative, the roles of A and B could be exchanged without affecting the truth of equation Eq3, so this must also hold: $$A + B \begin{cases} \ge B & \text{ if addition does not carry} \\ \le B & \text{ if addition does carry} \end{cases}\tag{Eq4}$$ If A + B is greater than or equal to both A and B, then it must also be greater than and equal to max(A, B), and a similar argument applies to the upper bound. Therefore, Eq1 follows from Eq3.

Substituting A and B with A & B and A | B (respectively) in Eq3 and Eq4 gives $$(A\:\&\:B) + (A\:|\:B) \begin{cases} \ge (A\:\&\:B) & \text{ if addition does not carry} \\ \le (A\:\&\:B) & \text{ if addition does carry} \end{cases}\tag{Eq5}$$ And $$(A\:\&\:B) + (A\:|\:B) \begin{cases} \ge (A\:|\:B) & \text{ if addition does not carry} \\ \le (A\:|\:B) & \text{ if addition does carry} \end{cases}\tag{Eq6}$$ Picking the "does carry" case from Eq5 and the "does not carry" case from Eq6, and using that A + B = (A & B) + (A | B), gives Eq2.

Saturday, 10 July 2021

Integer promotion does not help performance

There is a rule in the C language which roughly says that arithmetic operations on short integer types implicitly convert their operands to normal-sized integers, and also give their result as a normal-sized integer. For example in C:

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

Various other languages have similar rules. For example C#, the specification of which is not as jargon-infested as the C specifiction:

In the case of integral types, those operators (except the ++ and -- operators) are defined for the int, uint, long, and ulong types. When operands are of other integral types (sbyte, byte, short, ushort, or char), their values are converted to the int type, which is also the result type of an operation.

There may be various reasons to include such a rule in a programming language (and some reasons not to), but one that is commonly mentioned is that "the CPU prefers to work at its native word size, other sizes are slower". There is just one problem with that: it is based on an incorrect assumption about how the compiler will need to implement "narrow arithmetic".

To give that supposed reason the biggest benefit that I can fairly give it, I will be using MIPS for the assembly examples. MIPS completely lacks narrow arithmetic operations.

Implementing narrow arithmetic as a programmer

Narrow arithmetic is often required, even though various languages make it a bit cumbersome. C# and Java both demand that you explicitly convert the result back to a narrow type. Despite that, code that needs to perform several steps of narrow arithmetic is usually not littered with casts. The usual pattern is to do the arithmetic without intermediate casts, then only in the end use one cast just to make the compiler happy. In C, even that final cast is not necessary.

For example, let's reverse the bits of a byte in C#. This code was written by Igor Ostrovsky, in his blog post Programming job interview challenge. It's not a unique or special case, and I don't mean that negatively: it's good code that anyone proficient could have written, a job well done. Code that senselessly casts back to byte after every step is also sometimes seen, perhaps because in that case, the author does not really understand what they are doing.

// Reverses bits in a byte 
static byte Reverse(byte b)
{
    int rev = (b >> 4) | ((b & 0xf) << 4);
    rev = ((rev & 0xcc) >> 2) | ((rev & 0x33) << 2);
    rev = ((rev & 0xaa) >> 1) | ((rev & 0x55) << 1); 

    return (byte)rev;
}

Morally, all of the operations in this function are really narrow operations, but C# cannot express that. A special property of this code is that none of the intermediate results exceed the limits of a byte, so in a language without integer promotion it could be written in much the same way, but without going through int for the intermediate results.

Implementing narrow arithmetic as a compiler

The central misconception that (I think) gave rise to the myth that integer promotion helps performance, is the assumption that without integer promotion, the compiler must implement narrow operations by inserting an explicit narrowing operation after every arithmetic operation. But that's not true, a compiler for a language that lacks integer promotion can use the same approach that programmers use to implement narrow arithmetic in languages that do have integer promotion. For example, what if two bytes were added together (from memory, and storing the result in memory) in a hypothetical language that lacks integer promotion, and what if that code was compiled for MIPS? The assumption is that it will cost an additional operation to get rid of the "trash bits", but it does not:

        lbu     $2,0($4)
        lbu     $3,0($5)
        addu    $2,$2,$3
        sb      $2,0($4)

The sb instruction does not care about any "trash" in the upper 24 bits, those bits simply won't be stored. This is not a cherry-picked case. Even if there were more arithmetic operations, in most cases the "trash" in the upper bits could safely be left there, being mostly isolated from the bits of interest by the fact that carries only propagate from the least significant bit up, never down. For example, let's throw in a multiplication by 42 and a shift-left by 3 as well for good measure:

        lbu     $6,0($4)
        li      $3,42
        lbu     $2,0($5)
        mul     $5,$3,$6
        addu    $2,$5,$2
        sll     $2,$2,3
        sb      $2,0($4)

What is true, is that before some operations, the trash in the upper bits must be cleared. For example before a division, shift-right, or comparison. That is not an exhaustive list, but the list of operations that require the upper bits to be clean is shorter (and have less "total frequency") than the list of operations that do not require that, see for example which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? "Before some operations" is not the same thing as "after every operation", but that still sounds like an additional cost. However, the trash-clearing operations that a compiler for a language that lacks integer promotion would have to insert are not additional operations: they are the same ones that a programmer would write explicitly in a language with integer promotion.

It may be possible to construct a contrived case in which a human would know that the high bits of an integer are clean, while a compiler would struggle to infer that. For example, a compiler may have more trouble reasoning about bits that were cancelled by an XOR, or worse, by a multiplication. Such cases are not likely to be the reason behind the myth. A more likely reason is that many programmers are not as familiar with basic integer arithmetic as they perhaps ought to be.

So integer promotion is useless?

Integer promotion may prevent accidental use of narrow operations where wide operations were intended, whether that is worth it is another question. All I wanted to say with this post, is that "the CPU prefers to work at its native word size" is a bogus argument. Even when it is true, it is irrelevant.

Wednesday, 9 June 2021

Partial sums of blsi and blsmsk

blsi is an x86 operation which extracts the rightmost set bit from a number, it can be implemented efficiently in terms of more familiar operations as i&-i. blsmsk is a closely related operation which extracts the rightmost set bit and also "smears" it right, setting the bits to the right of that bit as well. blsmsk can be implemented as i^(i-1). The smearing makes the result of blsmsk almost (but not quite) twice as high as the result of blsi for the same input: blsmsk(i) = floor(blsi(i) * (2 - ฮต)).

The partial sums of blsi and blsmsk can be defined as b(n) = sum(i=1, n, blsi(i)) and a(n) = sum(i=1, n, blsmsk(i)) respectively. These sequences are on OEIS as A006520 and A080277. Direct evaluation of those definitions would be inefficient, is there a better way?

Unrecursing the recursive definition

Let's look at the partial sums of blsmsk first. Its OEIS entry suggests the recursion below, which is already significantly better than the naive summation:

a(1) = 1, a(2*n) = 2*a(n) + 2*n, a(2*n+1) = 2*a(n) + 2*n + 1

A "more code-ish"/"less mathy" way to implement that could be like this:

int PartialSumBLSMSK(int i) {
    if (i == 1)
        return 1;
    return (PartialSumBLSMSK(i >> 1) << 1) + i;
}

Let's understand what this actually computes, and then find an other way to do it. The overal big picture of the recursion is that n is being shifted right on the way "down", and the results of the recursive calls are being shifted left on the way "up", in a way that cancels each other. So in total, what happens is that a bunch of "copies" of n is added up, except that at the kth step of the recursion, the kth bit of n is reset.

This non-tail recursion can be turned into tail-recursion using a standard trick: turning the "sum on the way"-logic into "sum on the way down", by passing two accumulators in extra arguments, one to keep track of the sum, and an other to keep track of how much to multiply i by:

int PartialSumBLSMSKTail(int i, int accum_a, int accum_m) {
    if (i <= 1)
        return accum_a + i * accum_m;
    return PartialSumBLSMSKTail(i >> 1, accum_a + i * accum_m, accum_m * 2);
}

Such a tail-recursive function is then simple to transform into a loop. Shockingly, GCC (even quite old versions) manages to compile the original non-tail recursive function into a loop as well without much help (just changing the left-shift into the equivalent multiplication), although some details differ.

Anyway, let's put in a value and see what happens. For example, if the input was 11101001, then the following numbers would be added up:

11101001 (reset bit 0)
11101000 (reset bit 1, it's already 0)
11101000 (reset bit 2)
11101000 (reset bit 3)
11100000 (reset bit 4)
11100000 (reset bit 5)
11000000 (reset bit 6)
10000000 (base case)

Look at the columns of the matrix of numbers above, the column for bit 3 has four ones in it, the column for bit 5 has six ones in it. If bit k is set bit in n, there is a pattern that that bit is set in k+1 rows.

Using the pattern

Essentially what that pattern means, is that a(n) can be expressed as the dot-product between n viewed as a vector of bits (weighted according to their position) (๐“ท), and a constant vector (๐“ฌ) with entries 1, 2, 3, 4, etc, up to the size of the integer. For example for n=5, ๐“ท would be (1, 0, 4), and the dot-product with ๐“ฌ would be 13. A dot-product like that can be implemented with some bitwise trickery, by using bit-slicing. The trick there is that instead of multiplying the entries of ๐“ท by the entries of ๐“ฌ directly, we multiply the entries of ๐“ท by the least-significant bits of the entries of ๐“ฌ, and then seperately multiply it by all the second bits of the entries of ๐“ฌ, and so on. Multiplying every entry of ๐“ท at once by a bit of an entry of ๐“ฌ can be implemented using just a bitwise-AND operation.

Although this trick lends itself well to any vector ๐“ฌ, I will use 0,1,2,3.. and add an extra n separately (this corresponds to factoring out the +1 that appears at the end of the recursive definition), because that way part of the code can be reused directly by the solution of the partial sums of blsi (and also because it looks nicer). The masks that correspond to the chosen vector ๐“ฌ are easy to compute: each column across the masks is an entry of that vector. In this case, for 32bit integers:

c0 10101010101010101010101010101010
c1 11001100110011001100110011001100
c2 11110000111100001111000011110000
c3 11111111000000001111111100000000
c4 11111111111111110000000000000000

The whole function could look like this:

int PartialSumBLSMSK(int n)
{
    int sum = n;
    sum += (n & 0xAAAAAAAA);
    sum += (n & 0xCCCCCCCC) << 1;
    sum += (n & 0xF0F0F0F0) << 2;
    sum += (n & 0xFF00FF00) << 3;
    sum += (n & 0xFFFF0000) << 4;
    return sum;
}

PartialSumBLSI works almost the same way, with its recursive formula being b(1) = 1, b(2n) = 2b(n) + n, b(2n+1) = 2b(n) + n + 1. The +1 can be factored out as before, and the other part (n instead of 2*n) is exactly half of what it was before. Dividing ๐“ฌ by half seems like a problem, but it can be done implicitly by shifting the bit-slices of the product to the right by 1 bit. There are no problems with bits being lost that way, because the least significant bit is always zero in this case (๐“ฌ has zero as its first element).

int PartialSumBLSI(int n)
{
    int sum = n;
    sum += (n & 0xAAAAAAAA) >> 1;
    sum += (n & 0xCCCCCCCC);
    sum += (n & 0xF0F0F0F0) << 1;
    sum += (n & 0xFF00FF00) << 2;
    sum += (n & 0xFFFF0000) << 3;
    return sum;
}

Wrapping up

The particular set of constants I used is very useful and appears in more tricks, such as collecting indexes of set bits. They are the bitwise complements of a set of masks that Knuth (in The Art of Computer Programming volume 4, section 7.1.3) calls "magic masks", labeled ยตk.

This post was inspired by this question on Stack Overflow and partially based on my own answer and the answer of Eric Postpischil, without which I probably would not have come up with any of this, although I used a used a different derivation and explanation for this post.