Bits, Math and Performance(?): 2025

Monday, 6 October 2025

The non-problem of unsigned integers in Java

Java famously has no unsigned integer types^[1]. This is sometimes treated as a problem that needs a solution. Assuming that it is a problem, let's see some solutions.

Solution 1: use a wider type

[...] Although it is possible to get around this problem using conversion code and larger data types, it makes using Java cumbersome for handling unsigned data. While a 32-bit signed integer may be used to hold a 16-bit unsigned value losslessly, and a 64-bit signed integer a 32-bit unsigned integer, there is no larger type to hold a 64-bit unsigned integer. In all cases, the memory consumed may double, and typically any logic relying on two's complement overflow must be rewritten. [...]

Wikipedia: Criticism of Java

This is a legitimate approach, and not as cumbersome as the quote makes it sound. Adding two variables of type u32-in-long a and b would look like a + b. Wow, very cumbersome. But this is only representative of the "middle" of the solution, you may (and only "may") need the occasional x & 0xffffffffL to hammer the value back into the expected range - which I assume "logic relying on two's complement overflow must be rewritten" refers to^[2]. You can get away with using that very sparingly however, since having some garbage in the upper 32 bits of the long is mostly harmless. If you're paying attention, this is a hint that we mostly didn't need a long to begin with (carrying around garbage bits that don't contribute to the result doesn't sound terribly helpful), leading to the second solution.

Solution 2: don't use a wider type

[...] Alternatively, it is possible to use Java's signed integers to emulate unsigned integers of the same size, but this requires detailed knowledge of bitwise operations.[9] Some support for unsigned integer types was provided in JDK 8, but not for unsigned bytes and with no support in the Java language. [...]

Wikipedia: Criticism of Java

The level of "detailed knowledge" required for this approach is .. actually that part of the quote may just be bogus. There aren't going to be many bitwise operations. The link in that quote [9] links to an article which doesn't discuss or use this approach, it uses an u32-in-long at one point, several u8-in-ints, and a char.

In this case, adding two variables of type u32-in-int a and b would look like a + b. Wow, detailed knowledge of bitwise operations required.

But showing only addition is unfair

For some operations there is a bit more going on. Conveniently, Java 8 added functions such as Integer.compareUnsigned and Integer.divideUnsigned (and corresponding functions in Long) to handle that for you, so you again don't need detailed knowledge of bitwise operations. Perhaps that quote predates Java 8, and then if you wanted to manually implement something like "is this long unsigned-less-than this other long" then finally you may need detailed knowledge of bitwise operations to be able to write:

long m = 1L << 63;
if ((a ^ m) < (b ^ m))
  // a unsigned-less-than b

But instead of XOR you could have added or subtracted the offset (still the same m) that "rotates" the number line such that the low half of the unsigned range is mapped to the low half of the signed range, and the high half of the unsigned range is mapped to the high half of the signed range. That's a rotation by a half turn, so it doesn't matter which direction you do it in (you can add or subtract m) and it's also the same as swapping the halves (which is how you can interpret the XOR by m). So I'm still not sure where the quote thinks the bitwise operations come into play.

Anyway let's see some example with different kinds of operations, such as, picking an example is tricky but let's go with this:

uint32_t murmur_32_scramble(uint32_t k) {
    k *= 0xcc9e2d51;
    k = (k << 15) | (k >> 17);
    k *= 0x1b873593;
    return k;
}

Using u32-in-long, we could do this:

static long murmur_32_scramble(long k) {
    k *= 0xcc9e2d51;
    k = (k << 15) | ((k & 0xffffffffL) >> 17);
    k *= 0x1b873593;
    return k & 0xffffffffL;
}

The final & 0xffffffffL on the return value is not necessary if the caller of this function can be relied upon to ignore the upper 32 bits. Anyway, two extra AND operations, it's not that cumbersome. You need to know where to put the extra AND operations though, that's not hard but it is something that relies on a human not making mistakes which is always a risk, especially if the human in question is a programmer. For that reason I'd say that it's the u32-in-long approach that requires knowledge of bitwise operations, moreso than the u32-in-int approach, but specifically it is knowing when you need them which is more about bit-level knowledge of arithmetic operations. Alternatively you could spam & 0xffffffffL at every opportunity "just in case" as a safety net, at that point u32-in-long starts to look pretty silly compared to u32-in-int, which for comparison could look like:

static int murmur_32_scramble(int k) {
    k *= 0xcc9e2d51;
    k = (k << 15) | (k >>> 17);
    k *= 0x1b873593;
    return k;
}

Not much to see here, I only needed to change >> to >>>. Which is obvious. If we're doing unsigned arithmetic, the right shift had better be unsigned. This was just a one-to-one transliteration of the C code to Java, the two shifts and the OR form a rotate which could be spelled out directly:

static int murmur_32_scramble(int k) {
    k *= 0xcc9e2d51;
    k = Integer.rotateLeft(k, 15);
    k *= 0x1b873593;
    return k;
}

Opinions

In general I've found that u32-in-int is quite nice actually, perhaps preferable to having to no-op-cast between signed and unsigned integers of the same size as C# and C++ make me do, to the point that I don't consider it a solution to a problem: it's a nice and natural way to work with integers (signed or unsigned, with is often a non-distinction) to begin with. The occasional call of a library function such as Integer.divideUnsigned to replace an operator is the worst part, which is not that bad and also not that common. Similarly, u64-in-long is mostly a natural way to work with 64-bit integers, and it doesn't have a credible alternative - there is BigInteger, now that is cumbersome, thanks to the lack of operator overloading and also we cannot be quite as lazy about clearing the high bits because a BigInteger can otherwise hoard a problematic amount of garbage in the bits that we're going to ignore in the end.

To be fair there are at least two real downsides to not having unsigned types:

An integer type by itself does not describe the interpretation of that integer, only its size, and the "default interpretation" that all integers are signed can be misleading. This is especially an issue at interface boundaries.
Conversions to a wider integer type sign-extend, changing the bit-pattern you're working with, even implicitly. Java code that works with bytes tends to spam & 0xff all over the place to counteract this.

[1] Yes, I am going to ignore char.
[2] But why is the quote talking about "two's complement overflow" in an unsigned context?

Friday, 6 June 2025

From Boolean logic to bitmath and SIMD: transitive closure of tiny graphs

Let's say that we have a graph of at most 8 nodes (you can also think of it as a relation between 8 things), represented as an 8 by 8 Boolean matrix, and we want to compute its transitive closure. A basic translation of Warshall's algorithm (Warshall's formulation of the Floyd–Warshall algorithm computes the transitive closure instead of shortest paths) into what I imagine a C++ programmer might write would look like this:

// version 0
std::bitset<64> closure8x8(std::bitset<64> g)
{
    for (size_t k = 0; k < 8; k++) {
        for (size_t i = 0; i < 8; i++) {
            if (g[i * 8 + k]) {
                for (size_t j = 0; j < 8; j++) {
                    if (g[k * 8 + j])
                        g.set(i * 8 + j);
                }
            }
        }
    }
    return g;
}

Some may be tempted to suggest switching to a non-bit-packed representation instead of std::bitset, but then you're reading the wrong blog. Working with individual Booleans one-by-one is a dead end whether they're packed or not, and a packed representation has more opportunities to work with multiple bits at once, which is exactly what we're doing to do. The "trick" here, if it can even be called a trick, is to extract the k-th row of the 8x8 matrix and OR it into the i-th row, if and only if node k is directly reachable from node i.

//version 1
uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        for (size_t i = 0; i < 8; i++) {
            if (x & (UINT64_C(1) << (i * 8 + k)))
                x |= kth_row << (i * 8);
        }
    }
    return x;
}

Although x is obviously modified in the i-loop, it is modified in a way such that the test doesn't really depend on the new value, we could use a copy of x that is taken just before entering the i-loop (challenge for compiler writers: automate this optimization). This isn't just for show, it's measurably more efficient (at least it was with my tests, on my PC, using a specific compiler..) and makes the next step less of a leap.

//version 2
uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t x_old = x;
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        for (size_t i = 0; i < 8; i++) {
            if (x_old & (UINT64_C(1) << (i * 8 + k)))
                x |= kth_row << (i * 8);
        }
    }
    return x;
}

Now we can unroll the i-loop. The 8 condition bits, the bits that control whether the if is entered or not, can be pulled out with (x >> k) & 0x0101010101010101. In each byte of that mask we get either 0, if the k-th row shouldn't be ORed into the corresponding row, or 1, if it should be. Multiplying the mask by the k-th row keeps every zero, but changes every 1 to a copy of the k-th row, precisely what we need to OR by.

//version 3
uint64_t closure8x8(uint64_t x)
{
    uint64_t lsb_of_byte = 0x0101010101010101;
    for (size_t k = 0; k < 8; k++)
    {
        uint64_t kth_row = (x >> (k * 8)) & 0xff;
        uint64_t m = (x >> k) & lsb_of_byte;
        x |= kth_row * m;
    }
    return x;
}

That's decent, if you're not into weird operations or SIMD you can stop there.

MOR

Similar to how (Floyd-)Warshall "looks like" a kind of matrix multiplication, transitive closure "look like" MMIX's MOR operation (which is also a kind of matrix multiplication), which I implemented like this:

uint64_t MOR(uint64_t a, uint64_t b)
{
    uint64_t lsb_of_byte = 0x0101010101010101;
    uint64_t r = 0;
    for (size_t i = 0; i < 8; i++)
    {
        uint64_t ith_row = (a >> (8 * i)) & 0xFF;
        uint64_t m = (b >> i) & lsb_of_byte;
        r |= ith_row * m;
    }
    return r;
}

Beyond merely looking similar, if we had an efficient implementation of MOR (do we?) it would be possible to use that to implement the transitive closure, although that's not as simple as taking the "MOR square" of the matrix. The "internal dependency" of Warshall's algorithm may look like a small matter on the surface but it's a crucial difference. Essentially, if A is the matrix (with Boolean entries and using OR-AND-algebra instead of the more familiar PLUS-MULTIPLY) representing our graph, we need the sum of A¹, A², ... A⁸, not just A². That can be done with only 3 uses of MOR and 3 bitwise OR. Squaring A¹ + A² gives us A² + A³ + A⁴, adding A¹ and squaring again gives us A² + ... A⁸ (1+1=1 in this context so multiple copies of A raised to the same exponent collapse into one when summed).

//version 4
uint64_t closure8x8(uint64_t x)
{
    x |= MOR(x, x);
    x |= MOR(x, x);
    x |= MOR(x, x);
    return x;
}

Although MOR cannot easily be built on top of GF2P8AFFINEQB, there is a trick that uses it (but only for its ability to transpose):

uint64_t MOR(uint64_t a, uint64_t b)
{
    __m512i va = _mm512_set1_epi64(a);
    __m512i z = _mm512_setzero_si512();
    __m512i m_id = _mm512_set1_epi64(0x8040201008040201);
    __m512i masked = _mm512_mask_blend_epi8(b, z, va);
    __m512i maskedt = _mm512_gf2p8affine_epi64_epi8(m_id, masked, 0);
    return _mm512_cmpneq_epi8_mask(maskedt, z);
}

On my Intel Rocket Lake, version 4 (using the AVX-512 trick for MOR) is not great. The latency is surprisingly decent. In terms of throughput, it beats version 3 by some 35% .. but that's bad. This approach "wastes" the entire SIMD width on processing a single graph. If we have multiple graphs, which we probably do if we're talking about throughput, this is not the way to go. There may be a niche use for it in a context where there's only one graph to process, and the latency isn't critical, and it's competing with another computation.

No more MOR

Taking the same approach as version 3, but using real SIMD instead of "faking SIMD" with scalar operations, the SIMD width is still available to process 4 graphs at once with AVX2:

// version 5
__m256i closure8x8_avx2(__m256i g)
{
    uint64_t ones = 0x0101010101010101;
    __m256i z = _mm256_setzero_si256();
    __m256i m, shuf;

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(0, ones * 8));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 7));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones, ones * 9));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 6));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 2, ones * 10));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 5));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 3, ones * 11));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 4));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 4, ones * 12));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 3));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 5, ones * 13));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 2));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 6, ones * 14));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 1));
    g = _mm256_or_si256(g, m);

    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 7, ones * 15));
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), g);
    g = _mm256_or_si256(g, m);

    return g;
}

An AVX-512 version could process 8 graphs at once, but it would have to change a bit because there is no 512-bit vpblendvb. Instead you could use vpblendmb, and vptestmb to extract the masks.