Bits, Math and Performance(?): January 2024

The partial sums of popcount, aka A000788: Total number of 1's in binary expansions of 0, ..., n can be computed fairly efficiently with some mysterious code found through its OEIS entry (see the link Fast C++ function for computing a(n)): (reformatted slightly to reduce width)

unsigned A000788(unsigned n)
{
    unsigned v = 0;
    for (unsigned bit = 1; bit <= n; bit <<= 1)
        v += ((n>>1)&~(bit-1)) +
             ((n&bit) ? (n&((bit<<1)-1))-(bit-1) : 0);
    return v;
}

Knowing what we (or I, anyway) know from computing the partial sums of blsi and blsmsk, let's try to improve on that code. "Improve" is a vague goal, let's say we don't want to loop over the bits, but also not just unroll by 64x to do this for a 64-bit integer.

First let's split this thing into the sum of an easy problem and a harder problem, the easy problem being the sum of (n>>1)&~(bit-1) (reminder that ~(bit-1) == -bit, unsigned negation is safe, UB-free, and does exactly what we need, even on hypothetical non-two's-complement hardware). This is the same thing we saw in the partial sum of blsi, bit k of n occurs k times in the sum, which we can evaluate like this:

uint64_t v = 
    ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) +
    ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) +
    ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) +
    ((n & 0xFF00'FF00'FF00'FF00) << 2) +
    ((n & 0xFFFF'0000'FFFF'0000) << 3) +
    ((n & 0xFFFF'FFFF'0000'0000) << 4);

The harder problem, the contribution from ((n&bit) ? (n&((bit<<1)-1))-(bit-1) : 0), has a similar pattern but more annoying in three ways. Here's an example of the pattern, starting with n in the first row and listing the values being added together below the horizontal line:

00100011000111111100001010101111
--------------------------------
00000000000000000000000000000001
00000000000000000000000000000010
00000000000000000000000000000100
00000000000000000000000000001000
00000000000000000000000000010000
00000000000000000000000000110000
00000000000000000000000010110000
00000000000000000000001010110000
00000000000000000100001010110000
00000000000000001100001010110000
00000000000000011100001010110000
00000000000000111100001010110000
00000000000001111100001010110000
00000000000011111100001010110000
00000000000111111100001010110000
00000001000111111100001010110000
00000011000111111100001010110000

Some anomalous thing happens for the contiguous group of rightmost set bits.
The weights are based not on the column index, but sort of dynamic based on the number of set bits ...
... to the left of the bit we're looking at. That's significant, "to the right" would have been a lot nicer to deal with.

For problem 1, I'm just going to state without proof that we can add 1 to n and ignore the problem, as long as we add n & ~(n + 1) to the final sum. Problems 2 and 3 are more interesting. If we had problem 2 but counting the bits to the right of the bit we're looking at, that would have nice and easy, instead of (n & 0xAAAA'AAAA'AAAA'AAAA) we would have _pdep_u64(0xAAAA'AAAA'AAAA'AAAA, n), problem solved. If we had a "pdep but from left to right" (aka expand_left) named _pdepl_u64 we could have done this:

uint64_t u = 
    ((_pdepl_u64(0x5555'5555'5555'5555, m) >> shift) << 0) +
    ((_pdepl_u64(0x3333'3333'3333'3333, m) >> shift) << 1) +
    ((_pdepl_u64(0x0F0F'0F0F'0F0F'0F0F, m) >> shift) << 2) +
    ((_pdepl_u64(0x00FF'00FF'00FF'00FF, m) >> shift) << 3) +
    ((_pdepl_u64(0x0000'FFFF'0000'FFFF, m) >> shift) << 4) +
    ((_pdepl_u64(0x0000'0000'FFFF'FFFF, m) >> shift) << 5);

But as far as I know, ~~that requires bit-reversing the inputs~~ (see the update below) of a normal _pdep_u64 and bit-reversing the result, which is not so nice at least on current x64 hardware. Every ISA should have a Generalized Reverse operation like the grevi instruction which used to be in the drafts of the RISC-V Bitmanip Extension prior to version 1.

Update:

It turned out there is a reasonable way to implement _pdepl_u64(v, m) in plain scalar code after all, namely as _pdep_u64(v >> (std::popcount(~m) & 63), m). The & 63 isn't meaningful, it's just to prevent UB at the C++ level.

This approach turned out to be more efficient than the AVX512 approach, so that's obsolete now, but maybe still interesting to borrow ideas from. Here's the scalar implementation in full:

uint64_t _pdepl_u64(uint64_t v, uint64_t m)
{
    return _pdep_u64(v >> (std::popcount(~m) & 63), m);
}

uint64_t partialSumOfPopcnt(uint64_t n)
{
    uint64_t v =
        ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) +
        ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) +
        ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) +
        ((n & 0xFF00'FF00'FF00'FF00) << 2) +
        ((n & 0xFFFF'0000'FFFF'0000) << 3) +
        ((n & 0xFFFF'FFFF'0000'0000) << 4);
    uint64_t m = n + 1;
    int shift = std::countl_zero(m);
    m = m << shift;
    uint64_t u =
        ((_pdepl_u64(0x5555'5555'5555'5555, m) >> shift) << 0) +
        ((_pdepl_u64(0x3333'3333'3333'3333, m) >> shift) << 1) +
        ((_pdepl_u64(0x0F0F'0F0F'0F0F'0F0F, m) >> shift) << 2) +
        ((_pdepl_u64(0x00FF'00FF'00FF'00FF, m) >> shift) << 3) +
        ((_pdepl_u64(0x0000'FFFF'0000'FFFF, m) >> shift) << 4) +
        ((_pdepl_u64(0x0000'0000'FFFF'FFFF, m) >> shift) << 5);
    return u + (n & ~(n + 1)) + v;
}

Repeatedly calling _pdepl_u64 with the same mask creates some common-subexpressions, they could be manually factored out but compilers do that anyway, even MSVC only uses one actual popcnt instruction (but MSVC, annoyingly, actually performs the meaningless & 63).

Enter AVX512

Using AVX512, we could more easily reverse the bits of a 64-bit integer, there are various ways to do that. But just using that and then going back to scalar pdep would be a waste of a good opportunity to implement the whole thing in AVX512, pdep and all. The trick to doing a pdep in AVX512, if you have several 64-bit integers that you want to pdep with the same mask, is to transpose 8x 64-bit integers into 64x 8-bit integers, use vpexpandb, then transpose back. In this case the first operand of the pdep is a constant, so the first transpose is not necessary. We still have to reverse the mask though. Since vpexpandb takes the mask input in a mask register and we only have one thing to reverse, this trick to bit-permute integers seems like a better fit than Wunk's whole-vector bit-reversal or some variant thereof.

I sort of glossed over the fact that we're supposed to be bit-reversing relative to the most significant set bit in the mask, but that's easy to do by shifting left by std::countl_zero(m) and then doing a normal bit-reverse, so in the end it still comes down to a normal bit-reverse. The result of the pdeps have to be shifted right by the same amount to compensate.

Here's the whole thing: (note that this is less efficient than the updated approach without AVX512)

uint64_t partialSumOfPopcnt(uint64_t n)
{    
    uint64_t v = 
        ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) +
        ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) +
        ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) +
        ((n & 0xFF00'FF00'FF00'FF00) << 2) +
        ((n & 0xFFFF'0000'FFFF'0000) << 3) +
        ((n & 0xFFFF'FFFF'0000'0000) << 4);
    // 0..63
    __m512i weights = _mm512_setr_epi8(
        0, 1, 2, 3, 4, 5, 6, 7,
        8, 9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,
        24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39,
        40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63);
    // 63..0
    __m512i rev = _mm512_set_epi8(
        0, 1, 2, 3, 4, 5, 6, 7,
        8, 9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,
        24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39,
        40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63);
    uint64_t m = n + 1;
    // bit-reverse the mask to implement expand-left
    int shift = std::countl_zero(m);
    m = (m << shift);
    __mmask64 revm = _mm512_bitshuffle_epi64_mask(_mm512_set1_epi64(m), rev);
    // the reversal of expand-right with reversed inputs is expand-left
    __m512i leftexpanded = _mm512_permutexvar_epi8(rev, 
        _mm512_mask_expand_epi8(_mm512_setzero_si512(), revm, weights));
    // transpose back to 8x 64-bit integers
    leftexpanded = Transpose64x8(leftexpanded);
    // compensate for having shifted m left
    __m512i masks = _mm512_srlv_epi64(leftexpanded, _mm512_set1_epi64(shift));
    // scale and sum results
    __m512i parts = _mm512_sllv_epi64(masks,
        _mm512_set_epi64(7, 6, 5, 4, 3, 2, 1, 0));
    __m256i parts2 = _mm256_add_epi64(
        _mm512_castsi512_si256(parts), 
        _mm512_extracti64x4_epi64(parts, 1));
    __m128i parts3 = _mm_add_epi64(
        _mm256_castsi256_si128(parts2), 
        _mm256_extracti64x2_epi64(parts2, 1));
    uint64_t u = 
        _mm_cvtsi128_si64(parts3) + 
        _mm_extract_epi64(parts3, 1);
    
    return u + (n & ~(n + 1)) + v;
}

As for Transpose64x8, you can find out how to implement that in Permuting bits with GF2P8AFFINEQB.

Bits, Math and Performance(?)

Wednesday, 17 January 2024

Partial sums of popcount

Update:

Enter AVX512