Bits, Math and Performance(?): Column-oriented row reduction

In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations.
-- The Wikipedia entry of Gaussian elimination

The usual implementation of row reduction of a bit-matrix^[1] takes an array of bit-packed rows, and applies row operations to the matrix in a row-wise manner. As a reminder, it works something like this:

Find a row with a non-zero element^[2] in the current column.
Swap it into position.
XOR that row into any other row that has a non-zero element in the current column.
Go to step 1 for the next column, if there is one.

For the application that I had mind when writing this blog post^[3], step 2 is not necessary and will be skipped.^[4]

Note that at the single-bit-level, the XOR-logic of step 3 comes down to if M(i, j) and M(j, k) then flip M(i, k). That is, in order for the XOR to flip M(i, k), the corresponding row must have a 1 in the current column (condition A), and the row that was selected to pivot on must have a 1 in the column k (condition B).

Turning it into a column-oriented algorithm

In the row-oriented algorithm, condition A is handled by conditionally XOR-ing into only those rows that have a 1 in the current column, and condition B is handled by the XOR itself (which is a conditional bit-flip afterall). A column-oriented algorithm would do it the other way around, using XOR to implement condition B, and skipping columns for which condition A is false:

Find a row that hasn't already been pivoted on, with a non-zero element in the current column.
Don't swap it into position.
XOR each column that has a 1 in the row that we're pivoting on, with a mask formed by taking the current column and resetting the bit that corresponds to the pivot row.
If there are any rows left that haven't been pivoted on, go to step 1 for the next column.

Step 1 may sound somewhat annoying, but it is really simple: AND the current column with a mask of the rows that have already been pivoted on, had, and use the old Isolate Lowest Set Bit^[5] trick x & -x to extract a mask corresponding to the first row that passes both conditions.

Here is an example of what an implementation might look like (shown in C#, or maybe this is Java code with a couple of capitalization errors)

static void reduceRows(long[] M)
{
    long had = 0;
    for (int i = 0; i < M.Length; i++)
    {
        long r = lowestSetBit(M[i] & ~had);
        if (r == 0)
            continue;
        long mask = M[i] & ~r;
        for (int j = 0; j < M.Length; j++)
        {
            if ((M[j] & r) != 0)
                M[j] ^= mask;
        }
        had |= r;
    }
}

static long lowestSetBit(long x)
{
    return x & -x;
}

The stuff that goes at the end of the post

While the column-oriented approach makes the step "search for a row that has a 1 in the current column" easy, transposing a matrix just to be able to do this does not seem worthwhile to me.

Despite the if inside the inner loops of both the row-oriented and the column-oriented algorithm, both of them can be accellerated using SIMD. The conditional code can easily be rewritten into arithmetic-based masking^[6] or "real" masked operations (in eg AVX512). "But don't compilers autovectorize these days?" Yes they do, but not always in a good way, note that both GCC and Clang both used a masked store, which is worse than writing back some unchanged values with a normal store (especially on AMD processors, and also especially considering that the stored values will be loaded again soon). Rumour has it that there are Reasons™ for that quirk, but that doesn't improve the outcome.

[1] Of course this blog post is not about floating point matrices, this is the bitmath blog.

[2] Sometimes known as 1 (one).

[3] XOR-clause simplification

[4] Row-swaps are particularly expensive to apply to a column-major bit-matrix, so perhaps this is just an excuse to justify the column-based approach. You decide.

[5] AKA BLSI, which AMD describes as "Isolate Lowest Set Bit", which makes sense, because well, that is what the operation does. Meanwhile Intel writes "Extract Lowest Set Isolated Bit" (isolated bit? what?) in their manuals.

[6] Arithmetic-based masking can also be used in scalar code.

Sunday, 9 April 2023

Column-oriented row reduction

Turning it into a column-oriented algorithm

The stuff that goes at the end of the post