Bits, Math and Performance(?): April 2023

Inverting a 16-element permutation may done like this:

for (int i = 0; i < 16; i++)
    inv[perm[i]] = i;

Computing a histogram of 16 nibbles may done like this:

for (int i = 0; i < 16; i++)
    hist[data[i]] += 1;

These different-sounding but already similar-looking tasks have something in common: they can be both be built around a 16x16 bitmatrix transpose. That sounds silly, why would anyone want to first construct a 16x16 bitmatrix, transpose it, and then do yet more processing to turn the resulting bitmatrix back into an array of numbers?

Because it turns out to be an efficiently-implementable operation, on some modern processors anyway.

If you know anything about the off-label application of GF2P8AFFINEQB, you may already suspect that it will be involved somehow (merely left-GF2P8AFFINEQB-ing by the identity matrix already results in some sort of 8x8 transpose, just horizontally mirrored), and it will be, but that's not the whole story.

First I will show not only how to do it with GF2P8AFFINEQB, but also how to find that solution programmatically using a SAT solver. There is nothing that fundamentally prevents a human from finding a solution by hand, but it seems difficult. Using a SAT solver to find a solution ex nihilo (requiring it to find both a sequence of instructions and their operands) is not that easy either (though that technique also exists). Thankfully, Geoff Langdale suggested a promising sequence of instructions:

nibble value, so now we have 16 bitfields with 1 bit set.

At this point, what we really want is a 16x16 transpose, not 4 8x8 transposes, but I'm pretty sure we can fake it by using VPERMB to redistribute our bytes (probably first grouping all top 8 bytes into the first 128b ...
— Geoff Langdale (@geofflangdale) January 7, 2023

The problem we have now (which the SAT solver will solve) is, under the constraint that for all X, f(X) = PERMB(GF2P8AFFINE(B, PERMB(X, A)), C) computes the transpose of X, what is a possible valuation of the variables A, B, C. Note that the variables in the SAT problem correspond to constants in the resulting code, and the variable in the resulting code (X) is quantified out of the problem.

If you know a bit about SAT solving, that "for all X" sounds like trouble, requiring either creating a set of constraints for every possible value of X (henceforth, concrete values of X will be known as "examples"), or some advanced technique such as CEGIS to dynamically discover a smaller set of examples to base the constraints on. Luckily, since we are dealing with a bit-permutation, there are simple and small sets of examples that together sufficiently constrain the problem. For a 16-bit permutation, this set of values could be used:

1010101010101010
1100110011001100
1111000011110000
1111111100000000

For a 256-bit permutation, a similar pattern can be used, where each of the examples has 256 bits and there would be 8 of them. Note that if you read the columns of the values, they list out the indices of the corresponding columns, which is no coincidence. Using that set of examples to constrain the problem with, essentially means that we assert that f when applied to the sequence 0..n-1 must result in the desired permutation. The way that I actually implemented this puts a column into one "abstract bit", so that it represents the index of the bit all in one place instead of spread out.

Implementing a "left GF2P8AFFINEQB" (multiplying a constant matrix on the left by a variable matrix on the right) in CNF, operating on "abstract bits" (8 variables each), is relatively straight forward. Every (abstract) bit of the result is the XOR of the AND of some (abstract) bits, writing that down is mostly a chore, but there is one interesting aspect: the XOR can be turned into an OR, since we know that we're multiplying by a permutation matrix. In CNF, OR is simpler than XOR, and easier for the solver to reason through.

VPERMB is more difficult to implement, given that the permutation operand is a variable (if it was a constant, we could just permute the abstract bits without generating any new constraints). To make it easier, I represent the permutation operand as a 32x32 permutation matrix, letting me create a bunch of simple ternary constraints of the form (¬P(i, j) ∨ ¬A(j) ∨ R(i)) ∧ (¬P(i, j) ∨ A(j) ∨ ¬R(i)) (read: if P(i, j), then A(j) must be equal to R(i)). The same thing can be used to implement VPSHUFB, with additional constraints on the permutation matrix (to prevent cross-slice movement).

Running that code, at least on my PC at this time^[1], results in (with some whitespace manually added):

__m256i t0 = _mm256_permutexvar_epi8(_mm256_setr_epi8(
    14, 12, 10, 8, 6, 4, 2, 0,
    30, 28, 26, 24, 22, 20, 18, 16,
    15, 13, 11, 9, 7, 5, 3, 1,
    31, 29, 27, 25, 23, 21, 19, 17), input);
__m256i t1 = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x1080084004200201), t0, 0);
__m256i t2 = _mm256_shuffle_epi8(t1, _mm256_setr_epi8(
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14,
    0, 8, 1, 9, 3, 11, 5, 13,
    7, 15, 2, 10, 4, 12, 6, 14));

So that's it. That's the answer^[2]. If you want to transpose a 16x16 bitmatrix, on a modern PC (this code requires AVX512_VBMI and AVX512_GFNI^[3]), it's fairly easy and cheap, it's just not so easy to find this solution to begin with.

Using this transpose to invert a 16-element permutation is pretty easy, for example using _mm256_sllv_epi16 to construct the matrix and _mm256_popcnt_epi16(_mm256_sub_epi16(t2, _mm256_set1_epi16(1))) (sadly there is no SIMD version of TZCNT .. yet) to convert the bit-masks back into indices. It may be tempting to try to use a mirrored matrix and leading-zero count, which AVX512 does offer, but it only offers the DWORD and QWORD versions VPLZCNTD/Q.

Making a histogram is even simpler, using only _mm256_popcnt_epi16(t2) to convert the matrix into counts.

[1] While MiniSAT (which this program uses as its SAT solver) is a "deterministic solver" in the sense of definitely finding a satifying valuation if there is one, it is not deterministic in the sense of guaranteeing that the same satisfying valuation is found every time the solver is run on the same input.

[2] Not the unique answer, there are multiple solutions.

[3] But not 512-bit vectors.

In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations.
-- The Wikipedia entry of Gaussian elimination

The usual implementation of row reduction of a bit-matrix^[1] takes an array of bit-packed rows, and applies row operations to the matrix in a row-wise manner. As a reminder, it works something like this:

Find a row with a non-zero element^[2] in the current column.
Swap it into position.
XOR that row into any other row that has a non-zero element in the current column.
Go to step 1 for the next column, if there is one.

For the application that I had mind when writing this blog post^[3], step 2 is not necessary and will be skipped.^[4]

Note that at the single-bit-level, the XOR-logic of step 3 comes down to if M(i, j) and M(j, k) then flip M(i, k). That is, in order for the XOR to flip M(i, k), the corresponding row must have a 1 in the current column (condition A), and the row that was selected to pivot on must have a 1 in the column k (condition B).

Turning it into a column-oriented algorithm

In the row-oriented algorithm, condition A is handled by conditionally XOR-ing into only those rows that have a 1 in the current column, and condition B is handled by the XOR itself (which is a conditional bit-flip afterall). A column-oriented algorithm would do it the other way around, using XOR to implement condition B, and skipping columns for which condition A is false:

Find a row that hasn't already been pivoted on, with a non-zero element in the current column.
Don't swap it into position.
XOR each column that has a 1 in the row that we're pivoting on, with a mask formed by taking the current column and resetting the bit that corresponds to the pivot row.
If there are any rows left that haven't been pivoted on, go to step 1 for the next column.

Step 1 may sound somewhat annoying, but it is really simple: AND the current column with a mask of the rows that have already been pivoted on, had, and use the old Isolate Lowest Set Bit^[5] trick x & -x to extract a mask corresponding to the first row that passes both conditions.

Here is an example of what an implementation might look like (shown in C#, or maybe this is Java code with a couple of capitalization errors)

static void reduceRows(long[] M)
{
    long had = 0;
    for (int i = 0; i < M.Length; i++)
    {
        long r = lowestSetBit(M[i] & ~had);
        if (r == 0)
            continue;
        long mask = M[i] & ~r;
        for (int j = 0; j < M.Length; j++)
        {
            if ((M[j] & r) != 0)
                M[j] ^= mask;
        }
        had |= r;
    }
}

static long lowestSetBit(long x)
{
    return x & -x;
}

The stuff that goes at the end of the post

While the column-oriented approach makes the step "search for a row that has a 1 in the current column" easy, transposing a matrix just to be able to do this does not seem worthwhile to me.

Despite the if inside the inner loops of both the row-oriented and the column-oriented algorithm, both of them can be accellerated using SIMD. The conditional code can easily be rewritten into arithmetic-based masking^[6] or "real" masked operations (in eg AVX512). "But don't compilers autovectorize these days?" Yes they do, but not always in a good way, note that both GCC and Clang both used a masked store, which is worse than writing back some unchanged values with a normal store (especially on AMD processors, and also especially considering that the stored values will be loaded again soon). Rumour has it that there are Reasons™ for that quirk, but that doesn't improve the outcome.

[1] Of course this blog post is not about floating point matrices, this is the bitmath blog.

[2] Sometimes known as 1 (one).

[3] XOR-clause simplification

[4] Row-swaps are particularly expensive to apply to a column-major bit-matrix, so perhaps this is just an excuse to justify the column-based approach. You decide.

[5] AKA BLSI, which AMD describes as "Isolate Lowest Set Bit", which makes sense, because well, that is what the operation does. Meanwhile Intel writes "Extract Lowest Set Isolated Bit" (isolated bit? what?) in their manuals.

[6] Arithmetic-based masking can also be used in scalar code.

Bits, Math and Performance(?)

Wednesday, 12 April 2023

Transposing a 16x16 bitmatrix

Sunday, 9 April 2023

Column-oriented row reduction

Turning it into a column-oriented algorithm

The stuff that goes at the end of the post