## Wednesday, 9 June 2021

### Partial sums of blsi and blsmsk

blsi is an x86 operation which extracts the rightmost set bit from a number, it can be implemented efficiently in terms of more familiar operations as i&-i. blsmsk is a closely related operation which extracts the rightmost set bit and also "smears" it right, setting the bits to the right of that bit as well. blsmsk can be implemented as i^(i-1). The smearing makes the result of blsmsk almost (but not quite) twice as high as the result of blsi for the same input: blsmsk(i) = floor(blsi(i) * (2 - ε)).

The partial sums of blsi and blsmsk can be defined as b(n) = sum(i=1, n, blsi(i)) and a(n) = sum(i=1, n, blsmsk(i)) respectively. These sequences are on OEIS as A006520 and A080277. Direct evaluation of those definitions would be inefficient, is there a better way?

### Unrecursing the recursive definition

Let's look at the partial sums of blsmsk first. Its OEIS entry suggests the recursion below, which is already significantly better than the naive summation:

a(1) = 1, a(2*n) = 2*a(n) + 2*n, a(2*n+1) = 2*a(n) + 2*n + 1

A "more code-ish"/"less mathy" way to implement that could be like this:

int PartialSumBLSMSK(int i) {
if (i == 1)
return 1;
return (PartialSumBLSMSK(i >> 1) << 1) + i;
}

Let's understand what this actually computes, and then find an other way to do it. The overal big picture of the recursion is that n is being shifted right on the way "down", and the results of the recursive calls are being shifted left on the way "up", in a way that cancels each other. So in total, what happens is that a bunch of "copies" of n is added up, except that at the kth step of the recursion, the kth bit of n is reset.

This non-tail recursion can be turned into tail-recursion using a standard trick: turning the "sum on the way"-logic into "sum on the way down", by passing two accumulators in extra arguments, one to keep track of the sum, and an other to keep track of how much to multiply i by:

int PartialSumBLSMSKTail(int i, int accum_a, int accum_m) {
if (i <= 1)
return accum_a + i * accum_m;
return PartialSumBLSMSKTail(i >> 1, accum_a + i * accum_m, accum_m * 2);
}

Such a tail-recursive function is then simple to transform into a loop. Shockingly, GCC (even quite old versions) manages to compile the original non-tail recursive function into a loop as well without much help (just changing the left-shift into the equivalent multiplication), although some details differ.

Anyway, let's put in a value and see what happens. For example, if the input was 11101001, then the following numbers would be added up:

11101001 (reset bit 0)
11101000 (reset bit 1, it's already 0)
11101000 (reset bit 2)
11101000 (reset bit 3)
11100000 (reset bit 4)
11100000 (reset bit 5)
11000000 (reset bit 6)
10000000 (base case)

Look at the columns of the matrix of numbers above, the column for bit 3 has four ones in it, the column for bit 5 has six ones in it. If bit k is set bit in n, there is a pattern that that bit is set in k+1 rows.

### Using the pattern

Essentially what that pattern means, is that a(n) can be expressed as the dot-product between n viewed as a vector of bits (weighted according to their position) (𝓷), and a constant vector (𝓬) with entries 1, 2, 3, 4, etc, up to the size of the integer. For example for n=5, 𝓷 would be (1, 0, 4), and the dot-product with 𝓬 would be 13. A dot-product like that can be implemented with some bitwise trickery, by using bit-slicing. The trick there is that instead of multiplying the entries of 𝓷 by the entries of 𝓬 directly, we multiply the entries of 𝓷 by the least-significant bits of the entries of 𝓬, and then seperately multiply it by all the second bits of the entries of 𝓬, and so on. Multiplying every entry of 𝓷 at once by a bit of an entry of 𝓬 can be implemented using just a bitwise-AND operation.

Although this trick lends itself well to any vector 𝓬, I will use 0,1,2,3.. and add an extra n separately (this corresponds to factoring out the +1 that appears at the end of the recursive definition), because that way part of the code can be reused directly by the solution of the partial sums of blsi (and also because it looks nicer). The masks that correspond to the chosen vector 𝓬 are easy to compute: each column across the masks is an entry of that vector. In this case, for 32bit integers:

c0 10101010101010101010101010101010
c1 11001100110011001100110011001100
c2 11110000111100001111000011110000
c3 11111111000000001111111100000000
c4 11111111111111110000000000000000

The whole function could look like this:

int PartialSumBLSMSK(int n)
{
int sum = n;
sum += (n & 0xAAAAAAAA);
sum += (n & 0xCCCCCCCC) << 1;
sum += (n & 0xF0F0F0F0) << 2;
sum += (n & 0xFF00FF00) << 3;
sum += (n & 0xFFFF0000) << 4;
return sum;
}

PartialSumBLSI works almost the same way, with its recursive formula being b(1) = 1, b(2n) = 2b(n) + n, b(2n+1) = 2b(n) + n + 1. The +1 can be factored out as before, and the other part (n instead of 2*n) is exactly half of what it was before. Dividing 𝓬 by half seems like a problem, but it can be done implicitly by shifting the bit-slices of the product to the right by 1 bit. There are no problems with bits being lost that way, because the least significant bit is always zero in this case (𝓬 has zero as its first element).

int PartialSumBLSI(int n)
{
int sum = n;
sum += (n & 0xAAAAAAAA) >> 1;
sum += (n & 0xCCCCCCCC);
sum += (n & 0xF0F0F0F0) << 1;
sum += (n & 0xFF00FF00) << 2;
sum += (n & 0xFFFF0000) << 3;
return sum;
}

### Wrapping up

The particular set of constants I used is very useful and appears in more tricks, such as collecting indexes of set bits. They are the bitwise complements of a set of masks that Knuth (in The Art of Computer Programming volume 4, section 7.1.3) calls "magic masks", labeled µk.

This post was inspired by this question on Stack Overflow and partially based on my own answer and the answer of Eric Postpischil, without which I probably would not have come up with any of this, although I used a used a different derivation and explanation for this post.

## Saturday, 26 September 2020

### The range a sum cannot fall within

Throughout this post, the variables A, B, and R are used, with R defined as R = A + B, and A ≤ B. Arithmetic in this post is unsigned and modulo 2k. Note that A ≤ B is not a restriction on the input, it is a choice to label the smaller input as A and the larger input as B. Addition is commutative, so this choice can be made without loss of generality.

## R < A || R ≥ B

The sum is less than A iff the addition wraps (1), otherwise it has to be at least B (2).

1. B cannot be so high that the addition can wrap all the way up to or past A. To make A + B add up to A, B would have had to be 2k, which is one beyond the maximum value it can be. R = A is possible only if B is zero, in which case R ≥ B holds instead.
2. Since A is at least zero, in the absence of wrapping there is no way to reduce the value below the inputs.

Perhaps that all looks obvious, but this has a useful application: if the carry-out of the addition is not available, it can be computed via carry = (x + y) < x, which is a relatively well-known trick. It does not matter which of x or y is the smaller or larger input, the sum cannot fall within the "forbidden zone" between them. The occasionally seen carry = (x + y) < max(x, y) adds an unnecessary complication.

## R < (A & B) || R ≥ (A | B)

This is a stronger statement, because A & B is usually smaller than A and A | B is usually greater than B.

If no wrapping occurs, then R ≥ (A | B). This can be seen for example by splitting the addition into a XOR and adding the carries separately, (A + B) = (A ^ B) + (A & B) * 2, while bitwise OR can be decomposed similarly into (A | B) = (A ^ B) + (A & B)(see below). Since there is no wrapping (by assumption), (A & B) * 2 ≥ (A & B) and therefore (A + B) ≥ (A | B). Or, with less algebra: addition sometimes produces a zero where the bitwise OR produces a one, but then addition compensates doubly for it by carrying into the next position.

For the case in which wrapping occurs I will take a bit-by-bit view. In order to wrap, the carry out of bit k-1 must be 1. In order for the sum to be greater than or equal to A & B, bit k-1 of the sum must be greater than or equal to bit k-1 of A & B. That combination means that the carry into bit k-1 of the sum must have been 1 as well. Furthermore, bit k-1 of the sum can't be greater than bit k-1 of A & B, at most it can be equal, which means bit k-2 must be examined as well. The same argument applies to bit k-2 and so on, until finally for the least-significant bit it becomes impossible for it to be carried into, so the whole thing falls down: by contradiction, A + B must be less than A & B when the sum wraps.

## What about (A | B) = (A ^ B) + (A & B) though?

The more obvious version is (A | B) = (A ^ B) | (A & B), compensating for the bits reset by the XOR by ORing exactly those bits back in. Adding them back in also works, because the set bits in A ^ B and A & B are disjoint: a bit being set in the XOR means that exactly one of the input bits was set, which makes their AND zero.

## Monday, 3 August 2020

### Why does AND distribute over XOR

AND distributes over XOR, unsurprisingly both from the left and right, that is:

x & y ^ z & y == (x ^ z) & y
x & y ^ x & z == x & (y ^ z)
a & c ^ a & d ^ b & c ^ b & d == (a ^ b) & (c ^ d)
A somewhat popular explanation for why is,
Conjunction and exclusive or form the multiplication and addition operations of a field GF(2), and as in any field they obey the distributive law.

Which is true and a useful way to think about it, but it is also the type of backwards explanation that relies on a concept that is more advanced than the thing which is being explained.

## Diagrams with crossing lines

Let's represent an expression such as a & c ^ a & d ^ b & c ^ b & d by putting the variables on the left of every AND along the top of a grid, and the variables on the right of every AND along the side. Then for example the grid cell on the intersection between the column of a and the row of c corresponds to the term a & c. Further, let's draw lines for variables that are True, in this example all variables are True:

The overall expression a & c ^ a & d ^ b & c ^ b & d counts the number of crossings, modulo 2. Rather than counting the crossings one by one, the number of crossings could be computed by counting how many variables along the top are True, how many along the side are True, and taking the product, again modulo 2. A sum modulo 2 is XOR and a product modulo 2 is AND, so this gives the equivalent expression (a ^ b) & (c ^ d).

The simpler cases x & y ^ z & y and x & y ^ x & z correspond to 1x2 and 2x1 diagrams.

## Diagrams with bites taken out of them

Such a diagram with a section of it missing can be dealt with by completing the grid and subtracting the difference. For example the unwieldy a & e ^ a & f ^ a & g ^ a & h ^ b & e ^ b & f ^ b & g ^ b & h ^ c & e ^ c & f ^ d & e ^ d & f (shown in the diagram below) is "incomplete", it misses the 2x2 square that corresponds to (c ^ d) & (g ^ h). Completing the grid and subtracting the difference gives ((a ^ b ^ c ^ d) & (e ^ f ^ g ^ h)) ^ ((c ^ d) & (g ^ h)), which is correct.

This all has a clear connection to the FOIL method and its generalizations, after all conjunction and exclusive or form the multiplication and addition operations of a field GF(2).

The same diagrams also show why AND distributes over OR (the normal, inclusive, OR), which could alternatively be explained in terms of the Boolean semiring.

## Defining increment

Just to avoid any confusion, the operation that this post is about is adding 1 (one) to a value: $$\text{increment}(x) = x + 1$$ Specifically, performing that operation in the domain of bit-vectors.

Incrementing is very closely related to negating. After all, -x = ~x + 1 and therefore x + 1 = -~x, though putting it that way feel oddly reversed to me.

## Bit-string notation

In bit-string notation (useful for analysing compositions of operations at the bit level), increment can be represented as: $$a01^k + 1 = a10^k$$

An "English" interpretation of that form is that an increment carries through the trailing set bits, turning them to zero, and then carries into the right-most unset bit, setting it.

That "do something special with the right-most unset bit" aspect of increment is the basis for various right-most bit manipulations, some of which were implemented in AMD Trailing Bit Manipulation (TBM) (which has been discontinued).

For example, the right-most unset bit in x can be set using x | (x + 1), which has a nice symmetry with the more widely known trick for unsetting the right-most set bit, x & (x - 1).

## Increment by XOR

As was the case with negation, there is a way to define increment in terms of XOR. The bits that flip during an increment are all the trailing set bits and the right-most unset bit, the TBM instruction for which is BLCMSK. While that probably does not seem very useful yet, the fact that x ^ (x + 1) takes the form of some number of leading zeroes followed by some number of trailing ones turns out to be useful.

Suppose one wants to increment a bit-reversed integer, a possible (and commonly seen) approach is looping of the bits from top the bottom and implementing the "carry through the ones, into the first zero" logic by hand. However, if the non-reversed value was also available (let's call it i), the bit-reversed increment could be implemented by calculating the number of ones in the mask as tzcnt(i + 1) + 1 (or popcnt(i ^ (i + 1))) and forming a mask with that number of ones located at the desired place within an integer:

// i   = normal counter
// rev = bit-reversed counter
// N   = 1 << number_of_bits
int maskLen = tzcnt(i + 1) + 1;
rev ^= N - (N >> maskLen);
That may still not seem useful, but this enables an implementation of the bit-reversal permutation (not a bit-reversal itself, but the permutation that results from bit-reversing the indices). The bit-reversal permutation is sometimes used to re-order the result of a non-auto-sorting Fast Fourier Transform algorithm into the "natural" order. For example,
// X = array of data
// N = length of X, power of two
for (uint32_t i = 0, rev = 0; i < N; ++i)
{
if (i < rev)
swap(X[i], X[rev]);
int maskLen = tzcnt(i + 1) + 1;
rev ^= N - (N >> maskLen);
}
This makes no special effort to be cache-efficient.

## Thursday, 10 October 2019

### Square root of bitwise NOT

The square root of bitwise NOT, if it exists, would be some function f such that f(f(x)) = not x, or in other words, f²(x) = not x. It is similar in concept to the √NOT gate in Quantum Computing, but in a different domain which makes the solution very different.

Before trying to find any specific f, it may be interesting to wonder what properties it would have to have (and lack).

• f must be bijective, because its square is bijective.
• is an involution but f cannot be an involution, because its square would then be the identity.
• f viewed as a permutation (which can be done, because it has to be bijective) must be a derangement, if it had any fixed point then that would also be a fixed point in and the not function does not have a fixed point.

## Does f exist?

In general, a permutation has a square root if and only if the number of cycles of same even length is even. The not function, being an involution, can only consist of swaps and fixed points, and we already knew it has no fixed points so it must consist of only swaps. A swap is a cycle of length 2, so an even length. Since the not function operates on k bits, the size of its domain is a power of two, 2k. That almost always guarantees an even number of swaps, except when k = 1. So, the not function on a single bit has no square root, but for more than 1 bit there are solutions.

## f for even k

For 2 bits, the not function is the permutation (0 3) (1 2). An even number of even-length cycles, as predicted. The square root can be found by interleaving the cycles, giving (0 1 3 2) or (1 0 2 3). In bits, the first looks like:

 in out 00 01 01 11 10 00 11 10

Which corresponds to swapping the bits and then inverting the lsb, the other variant corresponds to inverting the lsb first and then swapping the bits.

That solution can be applied directly to other even numbers of bits, swapping the even and odd bits and then inverting the even bits, but the square root is not unique and there are multiple variants. The solution can be generalized a bit, combining a step that inverts half of the bits with a permutation that brings each half of the bits into the positions that are inverted when it is applied twice, so that half the bits are inverted the first time and the other half of the bits are inverted the second time. For example for 32 bits, there is a nice solution in x86 assembly:

bswap eax
xor eax, 0xFFFF


## f for odd k

Odd k makes things less easy. Consider k=3, so (0 7) (1 6) (2 5) (3 4). There are different ways to pair up and interleave the cycles, leading to several distinct square roots:

1. (0 1 7 6) (2 3 5 4)
2. (0 2 7 5) (1 3 6 4)
3. (0 3 7 4) (1 2 6 5)
4. etc..

 in 1 2 3 000 001 010 011 001 111 011 010 010 011 111 110 011 101 110 111 100 010 001 000 101 100 000 001 110 000 100 101 111 110 101 100

These correspond to slightly tricky functions, for example the first one has as its three from lsb to msb: the msb but inverted, the parity of the input, and finally the lsb. The other ones also incorporate the parity of the input in some way.

## Wednesday, 3 October 2018

### abs and its "extra" result

The abs function has, in its usual (most useful) formulation, one extra value in its codomain than just "all non-negative values". That extra value is the most negative integer, which satisfies abs(x) == x despite being negative. Even accepting that the absolute value of the most negative integer is itself, it may still seem strange (for an operation that is supposed to have such a nice symmetry) that the size of the codomain is not exactly half of the size of the domain.

That there is an "extra" value in the codomain, and that it is specifically the most negative integer, may be more intuitively obvious when the action of abs on the number line circle is depicted as "folding" the circle symmetrically in half across the center and through zero (around which abs is supposed to be symmetric), folding the negative numbers onto the corresponding positive numbers:

Clearly both zero and the most negative integer (which is also on the "folding line") stay in place in such a folding operation and remain part of the resulting half-circle. That there is an "extra" value in the codomain is the usual fencepost effect: the resulting half-circle is half the size of the original circle in some sense, but the "folding line" cuts through two points that have now become endpoints.

By the way the "ones' complement alternative" to the usual abs, let's call it OnesAbs(x) = x < 0 ? ~x : x (there is a nice branch-free formulation too) does have a codomain with a size exactly half of the size of its domain. The possible results are exactly the non-negative values. It has to pay for that by, well, not being the usual abs. The "folding line" for OnesAbs runs between points, avoiding the fencepost issue:

## Tuesday, 21 August 2018

### Signed wrapping is meaningful and algebraically nice

In this post I defend wrapping, a bit more opinionated than my other posts. As usual I'm writing from the perspective that signed and unsigned integer types are a thin transparent wrapper around bit vectors, of course I am aware that they are often not used that way. That difference between their use and their actual nature is probably the source of the problems.

## Signed wrapping is not wrong

It is often said that when signed wraparound occurs, the result is simply wrong. That is an especially narrow view to take, probably inspired by treating fixed-size bit vector arithmetic as if it is arithmetic in ℤ, which it is not. Bit vector arithmetic can be viewed as arithmetic in ℤ so long as no "overflow" occurs, but violating that condition does not make the result wrong, it makes the interpretation wrong.

## Signed wrapping is meaningful

The wrapping works exactly the same as unsigned wrapping, it corresponds to taking the lowest k bits of the arbitrary precision result. Such a truncation therefore gives you exactly k meaningful bits, it's just a slice of the result. Some upper bits may be lost, they can be calculated if you need them. If the whole result is meaningful, then part of it is too, namely at least under the interpretation of being "part of the result".

An other well known example of benign wrapping is the calculation of the average of two non-negative signed integers. While (a + b) / 2 gives inconvenient results when the addition "overflows", (uint)(a + b) / 2 (using unsigned division) or (a + b) >>> 1 (unsigned right shift as in Java) are correct even when the addition of two positive values results in a negative value. An other way to look at it is that there is no unsigned wrapping. Nominally the integers being added here are signed but that doesn't really matter. Casting the inputs to unsigned before adding them is a no-op that can be performed mentally.

Wrapping can also sometimes be cancelled with more wrapping. For example, taking an absolute value with wrapping and casting the result to an unsigned type of the same width results in the actual absolute value without the funny int.MinValue edge case:

(uint)abs(int.MinValue) =
(uint)abs(-2147483648) =
(uint)(-2147483648) =
2147483648

This is not what Math.Abs in C# does, it throws, perhaps inspired by its signed return type. On the other hand, Java's Math.abs gets this right and leaves the reinterpretation up to the consumer of the result, of course in Java there is no uint32 to cast to but you can still treat that result as if it is unsigned. Such "manual reinterpretation" is in general central to integer arithmetic, it's really about the bits, not the "default meaning" of those bits.

The principle of cancelling wrapping also has some interesting data structure applications. For example, in a Fenwick tree or Summed Area Table, the required internal integer width is the desired integer width of any range/area-sum query that you actually want to make. So a SAT over signed bytes can use an internal width of 16 bits as long as you restrict queries to an area of 256 cells or fewer, since 256 * -128 = -215 which still fits a signed 16 bit word.

An other nice case of cancelled wrapping is strength reductions like A * 255 = (A << 8) - A. It is usually not necessary to do that manually, but that's not the point, the point is that the wrapping is not "destructive". The overall expression wraps only iff A * 255 wraps and even then it has exactly the same result. There are cases in which the left shift experience "signed wrapping" but A * 255 does not (for example, in 32 bits, A = 0x00800000), in those cases the subtraction also wraps and brings the result back to being "unwrapped". That is not a coincidence nor an instance of two wrongs making a right, it's a result of the intermediate wrapped result being meaningful and wrapping being algebraically nice.

## Signed wrapping is not inherent

Signed and unsigned integers are two different ways to interpret bit vectors. Almost all operations have no specific signed or unsigned version, only a generic version that does both. There is no such thing as signed addition or unsigned addition, addition is just addition. Operations that are actually different are:

• Comparisons except equality
• Division and remainder
• Right shift, maybe, but arithmetic right shift and logical right shift can both be reasonably applied in both signed and unsigned contexts
• Widening conversion
• Widening multiplication
One thing almost all of these have in common is that they cannot overflow, except division of the smallest integer by negative one. By the way I regard that particular quirk of division as a mistake since it introduces an asymmetry between dividing by negative one and multiplying by negative one.

The result is that the operations that can "overflow" are neither signed nor unsigned, and therefore do not overflow specifically in either of those ways. If they can be said to overflow at all, when and how they do so depends on how they are being viewed by an outside entity, not on the operation itself.

The distinction between unsigned and signed wrapping is equivalent to imagining a "border" on the ring of integers (not the mathematical Ring of Integers) either between 0 and -1 (unsigned) or between signed-smallest and signed-highest numbers, but there is no border. Crossing either of the imaginary borders does not mean nearly as much as many people think it means.

## Signed wrapping is algebraically nice

A property that wrapping arithmetic shares with arbitrary precision integer arithmetic, but not with trapping arithmetic, is that it obeys a good number of desirable algebraic laws. The root cause of this is that ℤ/ℤ2k is a ring, and trapping arithmetic is infested with implicit conditional exceptions. Signed arithmetic can largely be described by ℤ/ℤ2k, like unsigned arithmetic, since it is mostly a reinterpretation of unsigned arithmetic. That description does not cover all operations or properties, but it covers the most important aspects.

Here is a small selection of laws that apply to wrapping arithmetic but not to trapping arithmetic:

• -(-A) = A
• A + -A = 0
• A - B = A + -B
• A + (B + C) = (A + B) + C
• A * (B + C) = A * B + A * C
• A * -B = -A * B = -(A * B)
• A * (B * C) = (A * B) * C
• A * 15 = A * 16 - A
• A * multiplicative_inverse(A) = 1 (iff A is odd, this is something not found in ℤ which has only two trivially invertible numbers, so sometimes wrapping gives you a new useful property)
Some laws also apply to trapping arithmetic:
• A + 0 = A
• A - A = 0
• A * 0 = 0
• A * 1 = A
• A * -1 = -A
• -(-(-A)) = -A

The presence of all the implicit exceptional control flow makes the code very hard to reason about, for humans as well as compilers.

Compilers react to that by not optimizing as much as they otherwise would, since they are forced to preserve the exception behaviour. Almost anything written in the source code must actually happen, and in the same order as originally written, just to preserve exceptions that are not even supposed to ever actually be triggered. The consequences of that are often seen in Swift, where code using the &+ operator is optimized quite well (including auto-vectorization) and code using the unadorned + operator can be noticeably slower.

Humans .. probably don't truly want trapping arithmetic to begin with, what they want is to have their code checked for unintended wrapping. Wrapping is not a bug by itself, but unintended wrapping is. So while canceling a "bare" double negation is not algebraically justified in trapping arithmetic, a programmer will do it anyway since the goal is not to do trapping arithmetic, but removing bad edge cases. Statically checking for unintended wrapping would be a more complete solution, no longer relying on being lucky enough to dynamically encounter every edge case. Arbitrary precision integers would just remove most edge cases altogether, though it would rely heavily on range propagation for performance, making it a bit fragile.

But anyway, wrapping is not so bad. Just often unintended.