Bits, Math and Performance(?): Weighted popcnt

Since I showed the code below on Twitter, and some people understood it and some didn't, I suppose I should explain how it works and what the more general technique is.

// sum of indexes of set bits
int A073642(uint64_t n)
{
    return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
          (__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}

A sum of several multi-digit numbers, such as 12 + 34, can be rewritten to make the place-value notation explicit, giving 1*10 + 2*1 + 3*10 + 4*1. Then the distributive property of multiplication can be used to group digits of equal place-value together, giving (1 + 3)*10 + (2 + 4)*1. I don't bring up something so basic as an insult to the reader, the basis of that possibly-intimidating piece of code really is this basic.

The masks 0xAAAAAAAAAAAAAAAA, 0xCCCCCCCCCCCCCCCC, etc, represent nothing more than the numbers 0..63, each written vertically in binary, starting with the least significant bit. Each column of bits contains its own index. AND-ing the masks with n leaves, in each column, either zero (if the corresponding bit of n was zero) or the index of that column.

The first popcnt then adds together all the digits of the "filtered" indexes with a place-value of 1, the second popcnt adds together the digits with a place-value of 2, and so on. The result of each popcnt is shifted to match the corresponding place-value, and all of them are added together.

Generalizing the technique

The numbers in the columns did not have to be 0..63, the columns can contain arbitrary numbers, even negative numbers. This gives some sort of weighted popcnt: rather than each bit having a weight of 1, we are free to choose an arbitrary weight for each bit.

In general, starting with some arbitrary numbers, interpret them as a binary matrix. Transpose that matrix. Every row of the transposed matrix is the mask that goes into the corresponding popcnt-step.

Some simplifications may be possible:

If a row of the matrix is zero, the corresponding step can be left out.
If two or more rows are equal, their popcnt-steps can be merged into one with a weight that is the sum of the weights of the steps that are merged.
If a row is a linear combination of some set of other rows, the popcnt corresponding to that row can be computed as a linear combination of the popcnts corresponding to those other rows.
If a row has exactly one bit set, its step can be implemented with an AND and a shift, without popcnt, by shifting the bit to its target position.
If every row has exactly one bit set, we're really dealing with a bit-permutation and there are better ways to implement them.

For example, let's construct "the sum of squares of indexes of set bits" (with 1-indexed indexes) aka oeis.org/A003995. Transposing the squares 1,2,4,9..4096 gives these masks:

0x5555555555555555
0x0000000000000000
0x2222222222222222
0x1414141414141414
0x0d580d580d580d58
0x0335566003355660
0x00f332d555a66780
0x555a5b6666387800
0x66639c78783f8000
0x787c1f807fc00000
0x7f801fff80000000
0x7fffe00000000000
0x8000000000000000

The second mask is zero because a square cannot be congruent to 2 or 3 modulo 4. The last mask has exactly one bit set, so its step can be implemented without popcnt. Putting it together:

int A003995(uint64_t n)
{
    return __popcnt64(n & 0x5555555555555555) +
          (__popcnt64(n & 0x2222222222222222) << 2) +
          (__popcnt64(n & 0x1414141414141414) << 3) +
          (__popcnt64(n & 0x0d580d580d580d58) << 4) +
          (__popcnt64(n & 0x0335566003355660) << 5) +
          (__popcnt64(n & 0x00f332d555a66780) << 6) +
          (__popcnt64(n & 0x555a5b6666387800) << 7) +
          (__popcnt64(n & 0x66639c78783f8000) << 8) +
          (__popcnt64(n & 0x787c1f807fc00000) << 9) +
          (__popcnt64(n & 0x7f801fff80000000) << 10) +
          (__popcnt64(n & 0x7fffe00000000000) << 11) +
          ((n & 0x8000000000000000) >> 51);
}

Aside from implementing strange sequences from the OEIS, possible uses of this technique may include

Summing the face-values of a hand of cards. Unfortunately this cannot take "special combinations" into account, unlike other techniques.
A basic evaluation of a board state in Reversi/Othello, based on weights for each captured square that differ by their position.
Determining the price for a pizza based on a set of selected toppings, I don't know, I'm grasping at straws here.

Addendum

What if the input is small, and the weights are also small? Here is a different trick that is applicable in some of those cases, depending on whether the temporary result fits in 64 bits. The trick this time is much simpler, or at least sounds much simpler: for bit i of weight w[i], make w[i] copies of bit i, then popcnt everything.

A common trick to make k copies of only one bit is (bit << k) - bit. Assuming that pdep exists and is efficient, that trick can be generalized to making different numbers of copies of different bits. The simplest version of that trick would sacrifice one bit of padding per input bit, which may be acceptable depending on whether that all still fits in 64 bits. For example A073642 with a 10-bit input would work:

// requires: n is a 10-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n, 0x000020101020844b);
    return __popcnt64(top - bot);
}

That can be extended to an 11-bit input like this:

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n >> 1, 0x0040100808104225);
    uint64_t bot = _pdep_u64(n >> 1, 0x000020101020844b);
    return __popcnt64(top - bot | top);
}

Or like this

// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
    uint64_t top = _pdep_u64(n, 0x0040100808104225) >> 1;
    uint64_t bot = _pdep_u64(n, 0x008020101020844b) >> 1;
    return __popcnt64(top - bot);
}

pdep cannot move bits to the right of their original position, in some cases (if there is no space for padding bits) you may need to hack around that, as in the example above. Rotates and right-shift are good candidates to do that with, and in general you may gather the bits with non-zero weights with pext.

This approach relies strongly on being able to efficiently implement the step "make w[i] copies of bit i", it is probably not possible to do that efficiently using only the plain old standard integer operations. Also, pdep is not efficient on all processors that support it, unfortunately making the CPUID feature flag for BMI2 useless for deciding which implementation to use.

Tuesday 3 January 2023

Weighted popcnt

Generalizing the technique

Addendum