Monday, 22 December 2025

Computing 264 mod N, where N is a u64

Before we go any further, the answer is (usually) -N % N.

"Usually", because for example MSVC with "SDL checks" enabled is offended by the negation of an unsigned integer and you may need to subtract from zero (this may also apply to your programming language of choice), or you may know something about N that enables a better solution, or something else that I don't care to list.

In case such a simple expression needs an explanation, what's happening here is first we unconditionally subtract N from 264 (this is allowed because N < 264 by definition, N is a 64-bit unsigned integer), then reduce that modulo N. That's it. There's nothing complex going on.

Despite this lack of complexity, I thought this trick was worth dedicating a blog post to anyway, because more complex suggestions are regularly made. A classic "internet commenter's favourite" is implementing modular exponentiation with the exponentiation by squaring algorithm, and then just calling modpow(2, 64, N). It is indeed possible to compute 264 mod N this way, but not without some trickery, and most of the computation will turn out to be useless.

The usual, not "problem" exactly but let's call it a "difficulty", with modular exponentiation is that you generally don't have a nice built-in modular multiplication to base it on, you need to implement that too and it's non-trivial: the product before reducing it modulo N can be twice as wide as whatever integer type you were trying to work with. There are various ways to deal with that, of which I will describe none; this is not a post about modular multiplication.

Even with that difficulty surmounted: only the last iteration of the exponentiation by squaring algorithm would be useful. Iterations before the last iteration, either calculate pure powers of two (while the intermediate result stays under N), or already incorporate some reduction but do so redundantly, the last iteration may as well do all the reduction. That's not to say that the intermediate reductions are in general useless in modular exponentiation, far from it, but in this case they are useless. Given that you have implemented modular multiplication, you may as well directly take the modular square of 232 mod N (some implementation of modular multiplication may not allow that if N is small, that can be worked around again, but let's ignore this extra complication). Depending on the implementation, you may also be able to directly reduce 264 mod N using half of a modular multiplication.

It would still be more complicated than -N % N. Using _udiv128 or an equivalent intrinsic, which allows 264 to be passed as the input, may not be more complicated but it may be slower. Passing a dividend with a non-zero upper half tends to trigger slow paths, the specifics depend on the microarchitecture.

Some of you may have noticed that while using exponentiation by squaring to compute 264 mod N, you don't need to handle 128-bit product. The biggest it can get is 65 bits, and only a specific value: 264 itself. Maybe you can handle it in a special way and avoid the more complicated general case? Actually yes, but if you do that you're halfway to reinventing -N % N.

Friday, 6 June 2025

From Boolean logic to bitmath and SIMD: transitive closure of tiny graphs

Let's say that we have a graph of at most 8 nodes (you can also think of it as a relation between 8 things), represented as an 8 by 8 Boolean matrix, and we want to compute its transitive closure. A basic translation of Warshall's algorithm (Warshall's formulation of the Floyd–Warshall algorithm computes the transitive closure instead of shortest paths) into what I imagine a C++ programmer might write would look like this:

// version 0
std::bitset<64> closure8x8(std::bitset<64> g)
{
    for (size_t k = 0; k < 8; k++) {
        for (size_t i = 0; i < 8; i++) {
            if (g[i * 8 + k]) {
                for (size_t j = 0; j < 8; j++) {
                    if (g[k * 8 + j])
                        g.set(i * 8 + j);
                }
            }
        }
    }
    return g;
}

Some may be tempted to suggest switching to a non-bit-packed representation instead of std::bitset, but then you're reading the wrong blog. Working with individual Booleans one-by-one is a dead end whether they're packed or not, and a packed representation has more opportunities to work with multiple bits at once, which is exactly what we're doing to do. The "trick" here, if it can even be called a trick, is to extract the k-th row of the 8x8 matrix and OR it into the i-th row, if and only if node k is directly reachable from node i.

//version 1
uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        for (size_t i = 0; i < 8; i++) {
            if (x & (UINT64_C(1) << (i * 8 + k)))
                x |= kth_row << (i * 8);
        }
    }
    return x;
}

Although x is obviously modified in the i-loop, it is modified in a way such that the test doesn't really depend on the new value, we could use a copy of x that is taken just before entering the i-loop (challenge for compiler writers: automate this optimization). This isn't just for show, it's measurably more efficient (at least it was with my tests, on my PC, using a specific compiler..) and makes the next step less of a leap.

//version 2
uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t x_old = x;
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        for (size_t i = 0; i < 8; i++) {
            if (x_old & (UINT64_C(1) << (i * 8 + k)))
                x |= kth_row << (i * 8);
        }
    }
    return x;
}

Now we can unroll the i-loop. The 8 condition bits, the bits that control whether the if is entered or not, can be pulled out with (x >> k) & 0x0101010101010101. In each byte of that mask we get either 0, if the k-th row shouldn't be ORed into the corresponding row, or 1, if it should be. Multiplying the mask by the k-th row keeps every zero, but changes every 1 to a copy of the k-th row, precisely what we need to OR by.

//version 3
uint64_t closure8x8(uint64_t x)
{
    uint64_t lsb_of_byte = 0x0101010101010101;
    for (size_t k = 0; k < 8; k++)
    {
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        uint64_t m = (x >> k) & lsb_of_byte;
        x |= kth_row * m;
    }
    return x;
}

That's decent, if you're not into weird operations or SIMD you can stop there.

MOR

Similar to how (Floyd-)Warshall "looks like" a kind of matrix multiplication, transitive closure "look like" MMIX's MOR operation (which is also a kind of matrix multiplication), which I implemented like this:

uint64_t MOR(uint64_t a, uint64_t b)
{
    uint64_t lsb_of_byte = 0x0101010101010101;
    uint64_t r = 0;
    for (size_t i = 0; i < 8; i++)
    {
        uint64_t ith_row = (a >> (8 * i)) & 0xFF;
        uint64_t m = (b >> i) & lsb_of_byte;
        r |= ith_row * m;
    }
    return r;
}

Beyond merely looking similar, if we had an efficient implementation of MOR (do we?) it would be possible to use that to implement the transitive closure, although that's not as simple as taking the "MOR square" of the matrix. The "internal dependency" of Warshall's algorithm may look like a small matter on the surface but it's a crucial difference. Essentially, if A is the matrix (with Boolean entries and using OR-AND-algebra instead of the more familiar PLUS-MULTIPLY) representing our graph, we need the sum of A1, A2, ... A8, not just A2. That can be done with only 3 uses of MOR and 3 bitwise OR. Squaring A1 + A2 gives us A2 + A3 + A4, adding A1 and squaring again gives us A2 + ... A8 (1+1=1 in this context so multiple copies of A raised to the same exponent collapse into one when summed).

//version 4
uint64_t closure8x8(uint64_t x)
{
    x |= MOR(x, x);
    x |= MOR(x, x);
    x |= MOR(x, x);
    return x;
}

Although MOR cannot easily be built on top of GF2P8AFFINEQB, there is a trick that uses it (but only for its ability to transpose):

uint64_t MOR(uint64_t a, uint64_t b)
{
    __m512i va = _mm512_set1_epi64(a);
    __m512i z = _mm512_setzero_si512();
    __m512i m_id = _mm512_set1_epi64(0x8040201008040201);
    __m512i masked = _mm512_mask_blend_epi8(b, z, va);
    __m512i maskedt = _mm512_gf2p8affine_epi64_epi8(m_id, masked, 0);
    return _mm512_cmpneq_epi8_mask(maskedt, z);
}

On my Intel Rocket Lake, version 4 (using the AVX-512 trick for MOR) is not great. The latency is surprisingly decent. In terms of throughput, it beats version 3 by some 35% .. but that's bad. This approach "wastes" the entire SIMD width on processing a single graph. If we have multiple graphs, which we probably do if we're talking about throughput, this is not the way to go. There may be a niche use for it in a context where there's only one graph to process, and the latency isn't critical, and it's competing with another computation.

No more MOR

Taking the same approach as version 3, but using real SIMD instead of "faking SIMD" with scalar operations, the SIMD width is still available to process 4 graphs at once with AVX2:

// version 5
__m256i closure8x8_avx2(__m256i g)
{
    uint64_t ones = 0x0101010101010101;
    __m256i z = _mm256_setzero_si256();
    __m256i m, shuf;

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(0, ones * 8));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 7));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones, ones * 9));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 6));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 2, ones * 10));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 5));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 3, ones * 11));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 4));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 4, ones * 12));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 3));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 5, ones * 13));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 2));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 6, ones * 14));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 1));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 7, ones * 15));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), g);
    g = _mm256_or_si256(g, m);

    return g;
}

An AVX-512 version could process 8 graphs at once, but it would have to change a bit because there is no 512-bit vpblendvb. Instead you could use vpblendmb, and vptestmb to extract the masks.