Sunday, 10 November 2024

Histogramming bytes with positional popcount (GF2P8AFFINEQB edition)

A while ago, after some back and forth on twitter/X with @corsix, I dropped some implementation of byte histogramming without explaining anything. This post aims to rectify that lack of explanation.

pospopcnt

Imagine an array of N k-bit words as an N-by-k matrix of bits. It's easy to take a plain old popcnt across a row of the matrix. pospopcnt is the count per position, ie summing across a column.

The amount of "heavy vertical summation" (assuming that it is a heavy operation) can be reduced by putting the rows through some "vertical" carry-save adders (which works out to some cheap bitwise operations, especially cheap on AVX512 / AVX10 thanks to ternlog), this technique is also discussed in for example Efficient Computation of Positional Population Counts Using SIMD Instructions.

That reduction, though efficient, produces an intermediate result in an awkward format. If we have four 512-bit vectors as the intermediate result and are taking the pospopcnt of 64-bit QWORDs, each of the vectors now holds 8 groups of QWORDs, where the k'th bit of each QWORD in the i'th vector contributes a value of 2i to the k'th position of the pospopcnt. "Efficient Computation of Positional Population Counts Using SIMD Instructions" and the associated code show various ways to do this without using GF2P8AFFINEQB. That paper and code have a context of pospopcnt-ing 16-bit words, which makes GF2P8AFFINEQB-free approaches more viable.

With GF2P8AFFINEQB (and VPERMB), the bits in each vector can be regouped so that the 8 k'th bit of each QWORD are grouped together in the k'th byte of the vector. VPOPCNTB easily sums those bits, from here on all we need to do is shift each vector by i and add them up.

Now that we have a pospopcnt, we can use that to make a histogram of some data after transforming that data via 1 << data.

Binning

The pospopcnt as I described it in the previous paragraph only produces 64 counts, so it's suitable for making a histogram of data in the range from 0 up to 64. For bytes, the pospopcnt could be generalized to count across 256 positions, which is how one of my early prototypes of this algorithm worked. Doing it that way results in an somewhat awkward left shift (it's surprisingly not-horrible to set the k'th bit of a 256-bit vector, but still more awkward than just VPSLLVQ-ing 1 by the data), and in having to reduce/sum vectors in which only 1 out of every 256 bits is set. The 64-bit case already suffers from mostly summing zeroes, with the reduction operations mostly reducing nothing to nothing, and the 256-bit case makes that 4 times as bad.

For a while I thought that limited the applicability of the "pospopcnt histogram algorith" to 6-bit data, but then I had another idea: if the data is not already in the range 0..64, split it into bins that are. As long as that can be done quickly enough, and it turned out that it can be done quickly enough based on VPCOMPRESSB, the extra time spent on binning the data pays for itself by enabling the use of the 64-bit pospopcnt rather than its slow 256-bit generalization.

Extras

According to my own benchmarks on an Intel 11600K (Rocket Lake), this algorithm is almost twice as fast as Powturbo's hist_8_64 on both random and non-random data, reaching about 8.8GB/s vs 4.6GB/s for hist_8_64 either way, being mostly immune to patterns in the data (unlike the good old hist[data[i]]++ algorithm, which is infamous for slowing down for some kinds of patterned data).

Benchmarks have also been done on Zen5 (9700X), which as a side effect also make my 11600K look like a joke. Interestingly, from this benchmark it looks like some kinds of patterned data increase performance on Zen5, though I did not see that effect on my 11600K.

Below is a diagram summarizing the major components of the algorithm.