The partial sums of popcount, aka A000788: Total number of 1's in binary expansions of 0, ..., n can be computed fairly efficiently with some mysterious code found through its OEIS entry (see the link Fast C++ function for computing a(n)): (reformatted slightly to reduce width)

unsigned A000788(unsigned n) { unsigned v = 0; for (unsigned bit = 1; bit <= n; bit <<= 1) v += ((n>>1)&~(bit-1)) + ((n&bit) ? (n&((bit<<1)-1))-(bit-1) : 0); return v; }

Knowing what we (or I, anyway) know from computing the partial sums of blsi and blsmsk, let's try to improve on that code. "Improve" is a vague goal, let's say we don't want to loop over the bits, but also not just unroll by 64x to do this for a 64-bit integer.

First let's split this thing into the sum of an easy problem and a harder problem, the easy problem being the sum of `(n>>1)&~(bit-1)` (reminder that `~(bit-1) == -bit`, unsigned negation is safe, UB-free, and does exactly what we need, even on hypothetical non-two's-complement hardware). This is the same thing we saw in the partial sum of blsi, bit `k` of `n` occurs `k` times in the sum, which we can evaluate like this:

uint64_t v = ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) + ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) + ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) + ((n & 0xFF00'FF00'FF00'FF00) << 2) + ((n & 0xFFFF'0000'FFFF'0000) << 3) + ((n & 0xFFFF'FFFF'0000'0000) << 4);

The harder problem, the contribution from `((n&bit) ? (n&((bit<<1)-1))-(bit-1) : 0)`, has a similar pattern but more annoying in three ways. Here's an example of the pattern, starting with `n` in the first row and listing the values being added together below the horizontal line:

00100011000111111100001010101111 -------------------------------- 00000000000000000000000000000001 00000000000000000000000000000010 00000000000000000000000000000100 00000000000000000000000000001000 00000000000000000000000000010000 00000000000000000000000000110000 00000000000000000000000010110000 00000000000000000000001010110000 00000000000000000100001010110000 00000000000000001100001010110000 00000000000000011100001010110000 00000000000000111100001010110000 00000000000001111100001010110000 00000000000011111100001010110000 00000000000111111100001010110000 00000001000111111100001010110000 00000011000111111100001010110000

- Some anomalous thing happens for the contiguous group of rightmost set bits.
- The weights are based not on the column index, but sort of dynamic based on the number of set bits ...
- ... to the
*left*of the bit we're looking at. That's significant, "to the right" would have been a lot nicer to deal with.

For problem 1, I'm just going to state without proof that we can add 1 to `n` and ignore the problem, as long as we add `n & ~(n + 1)` to the final sum. Problems 2 and 3 are more interesting. If we had problem 2 but counting the bits to the *right* of the bit we're looking at, that would have nice and easy, instead of `(n & 0xAAAA'AAAA'AAAA'AAAA)` we would have `_pdep_u64(0xAAAA'AAAA'AAAA'AAAA, n)`, problem solved. If we had a "pdep but from left to right" (aka `expand_left`) named `_pdepl_u64` we could have done this:

uint64_t u = ((_pdepl_u64(0x5555'5555'5555'5555, m) >> shift) << 0) + ((_pdepl_u64(0x3333'3333'3333'3333, m) >> shift) << 1) + ((_pdepl_u64(0x0F0F'0F0F'0F0F'0F0F, m) >> shift) << 2) + ((_pdepl_u64(0x00FF'00FF'00FF'00FF, m) >> shift) << 3) + ((_pdepl_u64(0x0000'FFFF'0000'FFFF, m) >> shift) << 4) + ((_pdepl_u64(0x0000'0000'FFFF'FFFF, m) >> shift) << 5);

But as far as I know, ~~that requires bit-reversing the inputs~~ (see the update below) of a normal `_pdep_u64` and bit-reversing the result, which is not so nice at least on current x64 hardware. Every ISA should have a Generalized Reverse operation like the `grevi` instruction which used to be in the drafts of the RISC-V Bitmanip Extension prior to version 1.

### Update:

It turned out there *is* a reasonable way to implement `_pdepl_u64(v, m)` in plain scalar code after all, namely as `_pdep_u64(v >> (std::popcount(~m) & 63), m)`. The `& 63` isn't meaningful, it's just to prevent UB at the C++ level.

This approach turned out to be more efficient than the AVX512 approach, so that's obsolete now, but maybe still interesting to borrow ideas from. Here's the scalar implementation in full:

uint64_t _pdepl_u64(uint64_t v, uint64_t m) { return _pdep_u64(v >> (std::popcount(~m) & 63), m); } uint64_t partialSumOfPopcnt(uint64_t n) { uint64_t v = ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) + ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) + ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) + ((n & 0xFF00'FF00'FF00'FF00) << 2) + ((n & 0xFFFF'0000'FFFF'0000) << 3) + ((n & 0xFFFF'FFFF'0000'0000) << 4); uint64_t m = n + 1; int shift = std::countl_zero(m); m = m << shift; uint64_t u = ((_pdepl_u64(0x5555'5555'5555'5555, m) >> shift) << 0) + ((_pdepl_u64(0x3333'3333'3333'3333, m) >> shift) << 1) + ((_pdepl_u64(0x0F0F'0F0F'0F0F'0F0F, m) >> shift) << 2) + ((_pdepl_u64(0x00FF'00FF'00FF'00FF, m) >> shift) << 3) + ((_pdepl_u64(0x0000'FFFF'0000'FFFF, m) >> shift) << 4) + ((_pdepl_u64(0x0000'0000'FFFF'FFFF, m) >> shift) << 5); return u + (n & ~(n + 1)) + v; }

Repeatedly calling `_pdepl_u64` with the same mask creates some common-subexpressions, they could be manually factored out but compilers do that anyway, even MSVC only uses one actual `popcnt` instruction (but MSVC, annoyingly, actually performs the meaningless `& 63`).

## Enter AVX512

Using AVX512, we could more easily reverse the bits of a 64-bit integer, there are various ways to do that. But just using that and then going back to scalar `pdep` would be a waste of a good opportunity to implement the whole thing in AVX512, `pdep` and all. The trick to doing a `pdep` in AVX512, if you have several 64-bit integers that you want to `pdep` with the same mask, is to transpose 8x 64-bit integers into 64x 8-bit integers, use `vpexpandb`, then transpose back. In this case the first operand of the `pdep` is a constant, so the first transpose is not necessary. We still have to reverse the mask though. Since `vpexpandb` takes the mask input in a mask register and we only have one thing to reverse, this trick to bit-permute integers seems like a better fit than Wunk's whole-vector bit-reversal or some variant thereof.

I sort of glossed over the fact that we're supposed to be bit-reversing relative to the most significant set bit in the mask, but that's easy to do by shifting left by `std::countl_zero(m)` and then doing a normal bit-reverse, so in the end it still comes down to a normal bit-reverse. The result of the `pdep`s have to be shifted right by the same amount to compensate.

Here's the whole thing: (note that this is less efficient than the updated approach without AVX512)

uint64_t partialSumOfPopcnt(uint64_t n) { uint64_t v = ((n & 0xAAAA'AAAA'AAAA'AAAA) >> 1) + ((n & 0xCCCC'CCCC'CCCC'CCCC) << 0) + ((n & 0xF0F0'F0F0'F0F0'F0F0) << 1) + ((n & 0xFF00'FF00'FF00'FF00) << 2) + ((n & 0xFFFF'0000'FFFF'0000) << 3) + ((n & 0xFFFF'FFFF'0000'0000) << 4); // 0..63 __m512i weights = _mm512_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63); // 63..0 __m512i rev = _mm512_set_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63); uint64_t m = n + 1; // bit-reverse the mask to implement expand-left int shift = std::countl_zero(m); m = (m << shift); __mmask64 revm = _mm512_bitshuffle_epi64_mask(_mm512_set1_epi64(m), rev); // the reversal of expand-right with reversed inputs is expand-left __m512i leftexpanded = _mm512_permutexvar_epi8(rev, _mm512_mask_expand_epi8(_mm512_setzero_si512(), revm, weights)); // transpose back to 8x 64-bit integers leftexpanded = Transpose64x8(leftexpanded); // compensate for having shifted m left __m512i masks = _mm512_srlv_epi64(leftexpanded, _mm512_set1_epi64(shift)); // scale and sum results __m512i parts = _mm512_sllv_epi64(masks, _mm512_set_epi64(7, 6, 5, 4, 3, 2, 1, 0)); __m256i parts2 = _mm256_add_epi64( _mm512_castsi512_si256(parts), _mm512_extracti64x4_epi64(parts, 1)); __m128i parts3 = _mm_add_epi64( _mm256_castsi256_si128(parts2), _mm256_extracti64x2_epi64(parts2, 1)); uint64_t u = _mm_cvtsi128_si64(parts3) + _mm_extract_epi64(parts3, 1); return u + (n & ~(n + 1)) + v; }

As for `Transpose64x8`, you can find out how to implement that in Permuting bits with GF2P8AFFINEQB.