I was reminded (on mastodon) of this nibble-sorting technique (it could be adapted to other element sizes), which I apparently had only vaguely tweeted about in the past. It deserves a post, so here it is.
Binary LSD radix sort can be expressed as a sequence of stable-partitions, first stable-partitioning based on the least-significant bit, then on the second-to-least-significant bit and so on.
In modern x86, pext essentially implements half of a stable partition, only the half that moves a subset of the elements down towards lower indices. If we do that twice, the second time with an inverted mask, and shift the subset of elements where the mask is set left to put it at the top, we get a gadget that partitions a u64 based on a mask:
(pext(x, mask) << popcount(~mask)) | pext(x, ~mask)
This is sometimes called the sheep-and-goats operation.
For radix sort the masks that we need are, in order, the least significant bit of each element, each broadcasted to cover the whole corresponding element, then the same thing but with the second-to-least-signficant bit and so on. One way to express that is by shifting the whole thing right to put the bit that we want to broadcast in the the least significant position of the element, and then multiplying by 15 to broadcast that bit into every bit of the element. Different compilers handled that multiplication by 15 differently (there are alternative ways to express that).
static ulong sort_nibbles(ulong x) { ulong m = 0x1111111111111111; ulong t = (x & m) *15; x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) | Bmi2.X64.ParallelBitExtract(x, ~t); t = ((x >> 1) & m) * 15; x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) | Bmi2.X64.ParallelBitExtract(x, ~t); t = ((x >> 2) & m) * 15; x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) | Bmi2.X64.ParallelBitExtract(x, ~t); t = ((x >> 3) & m) * 15; x = (Bmi2.X64.ParallelBitExtract(x, t) << BitOperations.PopCount(~t)) | Bmi2.X64.ParallelBitExtract(x, ~t); return x; }
It's easy to extend this to a key-value sort. Hypothetically you could use that key-value sort to invert a permutation (sorting the values 0..15 by the permutation), but you can do much better with AVX512.