Bits, Math and Performance(?): July 2023

This post is meant as a replacement/recap of some work that I did over a decade ago on propagating bounds through bitwise operations, which was intended as an improvement over the implementations given in Hacker's Delight chapter 4, Arithmetic Bounds.

The goal is, given two variables x and y, with known bounds a ≤ x ≤ b, c ≤ y ≤ d, compute the bounds of x | y and of x & y. Thanks to De Morgan, we have the equations (most also listed in Hacker's Delight, except the last one)

minAND(a, b, c, d) = ~maxOR(~b, ~a, ~d, ~c)
maxAND(a, b, c, d) = ~minOR(~b, ~a, ~d, ~c)
minXOR(a, b, c, d) = minAND(a, b, ~d, ~c) | minAND(~b, ~a, c, d)
maxXOR(a, b, c, d) = maxOR(a, b, c, d) & ~minAND(a, b, c, d)

Everything can be written in terms of only minOR and maxOR and some basic operations.

`maxOR`

To compute the upper bound of the OR of x and y, what we need to do is find is the leftmost bit (henceforth the "target bit") such that it is both:

set in both b and d (the upper bounds of x and y) and,
changing an upper bound (either one of them, doesn't matter, but never both) by resetting the target bit and setting the bits that are less significant, keeps it greater-or-equal than the corresponding lower bound.

The explanation of why that works can be found in Hacker's Delight, along with a more of less direct transcription into code, but we can do better than a direct transcription.

Finding the leftmost bit that passes only the first condition would be easy, its the highest set bit in b & d. The second condition is a bit more complex to handle, but still surprisingly easy thanks to one simple observation: the bits that can pass it, are precisely those bits that are at (or to the right of) the leftmost bit where the upper and lower bound differ. Imagine two numbers in binary, one being the lower bound and the other the upper bound. The number have some equal prefix (possibly zero bits long, up to all bits) and then if they differ, they must differ by a bit in the upper bound being 1 while the corresponding bit in the lower bound is 0. Lowering the upper bound by resetting that bit while setting all bits the right of it, cannot make it lower than the lower bound.

For one of the inputs, say x, the position at which that second condition start being false (looking at that bit and to the left of it) can be computed directly with 64 - lzcnt(a ^ b). We actually need the maximum of that across both pairs of bounds, but there's no need to compute that for both bounds and then take the maximum, we can use this to let the lzcnt find the maximum automatically: 64 - lzcnt((a ^ b) | (c ^ d)).

bzhi(m, k) is an operation that resets the bits in m starting at index k. It can be emulated by shifting or masking, but an advantage of bzhi is that it is well defined for any relevant k, including when k is equal to the size of the integer in bits. bzhi is not strictly required here, but it is more convenient than "classic" bitwise operations, and available on most x64 processors today^[1]. Using bzhi, it's simple to take the position calculated in the previous paragraph and reset all the bits in b & d that do not pass the second condition: bzhi(b & d, 64 - lzcnt((a ^ b) | (c ^ d))).

With that bitmask in hand, all we need to do is apply it to one of the upper bounds. We can skip the "reset the target bit" part, since that bit will be set in the other upper bound and therefore also in the result. It also does not matter which upper bound is changed, regardless of which bound we were conceptually changing. Let's pick b for no particular reason. Then in total, the implementation could be:

uint64_t maxOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
    uint64_t index = 64 - _lzcnt_u64((a ^ b) | (c ^ d));
    uint64_t candidates = _bzhi_u64(b & d, index);
    if (candidates) {
        uint64_t target = highestSetBit(candidates);
        b |= target - 1;
    }
    return b | d;
}

For the highestSetBit function you can choose any way you like to isolate the highest set bit in an integer.

`minOR`

Computing the lower bound of x | y surprisingly seems to be more complex. The basic principles are similar, but this time bits are being reset in one of the lower bounds, and it does matter in which lower bound that happens. The computation of the mask of candidate bits also "splits" into separate candidates for each lower bound, unless there's some trick that I've missed. This whole "splitting" thing cannot be avoided by defining minOR in terms of maxAND either, because the same things happen there. But it's not too bad, a little bit of extra arithmetic. Anyway, let's see some code.

uint64_t minOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
    uint64_t candidatesa = _bzhi_u64(~a & c, 64 - _lzcnt_u64(a ^ b));
    uint64_t candidatesc = _bzhi_u64(a & ~c, 64 - _lzcnt_u64(c ^ d));
    uint64_t target = highestSetBit(candidatesa | candidatesc);
    if (a & target) {
        c &= -target;
    }
    if (c & target) {
        a &= -target;
    }
    return a | c;
}

A Fun Fact here is that the target bit cannot be set in both bounds, opposite to what happens in maxOR where the target bit is always set in both bounds. You may be tempted to turn the second if into else if, but in my tests it was quite important that the ifs are compiled into conditional moves rather than branches (which of the lower bounds the target bit is found in was essentially random), and using else if here apparently discourages compilers (MSVC at least) from using conditional moves.

candidatesa | candidatesc can be zero, although that is very rare, at least in my usage of the function. As written, the code assumes that highestSetBit deals with that gracefully by returning zero if its input is zero. Branching here is (unlike in the two ifs at the end of minOR) not a big deal since this case is so rare (and therefore predictable).

Conclusion

In casually benchmarking these functions, I found them to be a bit faster than the ones that I came up with over a decade ago, and significantly faster than the ones from Hacker's Delight. That basic conclusion probably translates to different scenarios, but the exact ratios will vary a lot based on how predictable the branches are in that case, on your CPU, and on arbitrary codegen decisions made by your compiler.

In any case these new versions look nicer to me.

There are probably much simpler solutions if the bounds were stored in bit-reversed form, but that doesn't seem convenient.

Someone on a certain link aggregation site asked about signed integers. As Hacker's Delight explains via a table, things can go wrong if one (or both) bounds cross the negative/positive boundary - but the solution in those cases is still easy to compute. The way I see it, the basic problem is that a signed bound that crosses the negative/positive boundary effectively encodes two different unsigned intervals, one starting at zero and one ending at the greatest unsigned integer, and the basic unsigned minOR and so on cannot (by themselves) handle those "split" intervals.

[1] Sadly not all, some low-end Intel processors have AVX disabled, which apparently is done by disabling the entire VEX encoding and it takes out BMI2 as collateral damage.

Bits, Math and Performance(?)

Sunday, 2 July 2023

Propagating bounds through bitwise operations

`maxOR`

`minOR`

Conclusion