Sunday, 3 May 2020

Information on incrementation

Defining increment

Just to avoid any confusion, the operation that this post is about is adding 1 (one) to a value: $$\text{increment}(x) = x + 1$$ Specifically, performing that operation in the domain of bit-vectors.

Incrementing is very closely related to negating. After all, -x = ~x + 1 and therefore x + 1 = -~x, though putting it that way feel oddly reversed to me.

Bit-string notation

In bit-string notation (useful for analysing compositions of operations at the bit level), increment can be represented as: $$a01^k + 1 = a10^k$$

An "English" interpretation of that form is that an increment carries through the trailing set bits, turning them to zero, and then carries into the right-most unset bit, setting it.

That "do something special with the right-most unset bit" aspect of increment is the basis for various right-most bit manipulations, some of which were implemented in AMD Trailing Bit Manipulation (TBM) (which has been discontinued).

For example, the right-most unset bit in x can be set using x | (x + 1), which has a nice symmetry with the more widely known trick for unsetting the right-most set bit, x & (x - 1).

Increment by XOR

As was the case with negation, there is a way to define increment in terms of XOR. The bits that flip during an increment are all the trailing set bits and the right-most unset bit, the TBM instruction for which is BLCMSK. While that probably does not seem very useful yet, the fact that x ^ (x + 1) takes the form of some number of leading zeroes followed by some number of trailing ones turns out to be useful.

Suppose one wants to increment a bit-reversed integer, a possible (and commonly seen) approach is looping of the bits from top the bottom and implementing the "carry through the ones, into the first zero" logic by hand. However, if the non-reversed value was also available (let's call it i), the bit-reversed increment could be implemented by calculating the number of ones in the mask as tzcnt(i + 1) + 1 (or popcnt(i ^ (i + 1))) and forming a mask with that number of ones located at the desired place within an integer:

// i   = normal counter
// rev = bit-reversed counter
// N   = 1 << number_of_bits
int maskLen = tzcnt(i + 1) + 1;
rev ^= N - (N >> maskLen);
That may still not seem useful, but this enables an implementation of the bit-reversal permutation (not a bit-reversal itself, but the permutation that results from bit-reversing the indices). The bit-reversal permutation is sometimes used to re-order the result of a non-auto-sorting Fast Fourier Transform algorithm into the "natural" order. For example,
// X = array of data
// N = length of X, power of two
for (uint32_t i = 0, rev = 0; i < N; ++i)
{
    if (i < rev)
        swap(X[i], X[rev]);
    int maskLen = tzcnt(i + 1) + 1;
    rev ^= N - (N >> maskLen);
}
This makes no special effort to be cache-efficient.

Thursday, 10 October 2019

Square root of bitwise NOT

The square root of bitwise NOT, if it exists, would be some function f such that f(f(x)) = not x, or in other words, f²(x) = not x. It is similar in concept to the √NOT gate in Quantum Computing, but in a different domain which makes the solution very different.

Before trying to find any specific f, it may be interesting to wonder what properties it would have to have (and lack).

  • f must be bijective, because its square is bijective.
  • is an involution but f cannot be an involution, because its square would then be the identity.
  • f viewed as a permutation (which can be done, because it has to be bijective) must be a derangement, if it had any fixed point then that would also be a fixed point in and the not function does not have a fixed point.

Does f exist?

In general, a permutation has a square root if and only if the number of cycles of same even length is even. The not function, being an involution, can only consist of swaps and fixed points, and we already knew it has no fixed points so it must consist of only swaps. A swap is a cycle of length 2, so an even length. Since the not function operates on k bits, the size of its domain is a power of two, 2k. That almost always guarantees an even number of swaps, except when k = 1. So, the not function on a single bit has no square root, but for more than 1 bit there are solutions.

f for even k

For 2 bits, the not function is the permutation (0 3) (1 2). An even number of even-length cycles, as predicted. The square root can be found by interleaving the cycles, giving (0 1 3 2) or (1 0 2 3). In bits, the first looks like:

inout
0001
0111
1000
1110

Which corresponds to swapping the bits and then inverting the lsb, the other variant corresponds to inverting the lsb first and then swapping the bits.

That solution can be applied directly to other even numbers of bits, swapping the even and odd bits and then inverting the even bits, but the square root is not unique and there are multiple variants. The solution can be generalized a bit, combining a step that inverts half of the bits with a permutation that brings each half of the bits into the positions that are inverted when it is applied twice, so that half the bits are inverted the first time and the other half of the bits are inverted the second time. For example for 32 bits, there is a nice solution in x86 assembly:

bswap eax
xor eax, 0xFFFF

f for odd k

Odd k makes things less easy. Consider k=3, so (0 7) (1 6) (2 5) (3 4). There are different ways to pair up and interleave the cycles, leading to several distinct square roots:

  1. (0 1 7 6) (2 3 5 4)
  2. (0 2 7 5) (1 3 6 4)
  3. (0 3 7 4) (1 2 6 5)
  4. etc..

in123
000001010011
001111011010
010011111110
011101110111
100010001000
101100000001
110000100101
111110101100

These correspond to slightly tricky functions, for example the first one has as its three from lsb to msb: the msb but inverted, the parity of the input, and finally the lsb. The other ones also incorporate the parity of the input in some way.

Tuesday, 21 August 2018

Signed wrapping is meaningful and algebraically nice

In this post I defend wrapping, a bit more opinionated than my other posts. As usual I'm writing from the perspective that signed and unsigned integer types are a thin transparent wrapper around bit vectors, of course I am aware that they are often not used that way. That difference between their use and their actual nature is probably the source of the problems.

Signed wrapping is not wrong

It is often said that when signed wraparound occurs, the result is simply wrong. That is an especially narrow view to take, probably inspired by treating fixed-size bit vector arithmetic as if it is arithmetic in ℤ, which it is not. Bit vector arithmetic can be viewed as arithmetic in ℤ so long as no "overflow" occurs, but violating that condition does not make the result wrong, it makes the interpretation wrong.

Signed wrapping is meaningful

The wrapping works exactly the same as unsigned wrapping, it corresponds to taking the lowest k bits of the arbitrary precision result. Such a truncation therefore gives you exactly k meaningful bits, it's just a slice of the result. Some upper bits may be lost, they can be calculated if you need them. If the whole result is meaningful, then part of it is too, namely at least under the interpretation of being "part of the result".

An other well known example of benign wrapping is the calculation of the average of two non-negative signed integers. While (a + b) / 2 gives inconvenient results when the addition "overflows", (uint)(a + b) / 2 (using unsigned division) or (a + b) >>> 1 (unsigned right shift as in Java) are correct even when the addition of two positive values results in a negative value. An other way to look at it is that there is no unsigned wrapping. Nominally the integers being added here are signed but that doesn't really matter. Casting the inputs to unsigned before adding them is a no-op that can be performed mentally.

Wrapping can also sometimes be cancelled with more wrapping. For example, taking an absolute value with wrapping and casting the result to an unsigned type of the same width results in the actual absolute value without the funny int.MinValue edge case:

(uint)abs(int.MinValue) = 
(uint)abs(-2147483648) =
(uint)(-2147483648) =
2147483648

This is not what Math.Abs in C# does, it throws, perhaps inspired by its signed return type. On the other hand, Java's Math.abs gets this right and leaves the reinterpretation up to the consumer of the result, of course in Java there is no uint32 to cast to but you can still treat that result as if it is unsigned. Such "manual reinterpretation" is in general central to integer arithmetic, it's really about the bits, not the "default meaning" of those bits.

The principle of cancelling wrapping also has some interesting data structure applications. For example, in a Fenwick tree or Summed Area Table, the required internal integer width is the desired integer width of any range/area-sum query that you actually want to make. So a SAT over signed bytes can use an internal width of 16 bits as long as you restrict queries to an area of 256 cells or fewer, since 256 * -128 = -215 which still fits a signed 16 bit word.

An other nice case of cancelled wrapping is strength reductions like A * 255 = (A << 8) - A. It is usually not necessary to do that manually, but that's not the point, the point is that the wrapping is not "destructive". The overall expression wraps only iff A * 255 wraps and even then it has exactly the same result. There are cases in which the left shift experience "signed wrapping" but A * 255 does not (for example, in 32 bits, A = 0x00800000), in those cases the subtraction also wraps and brings the result back to being "unwrapped". That is not a coincidence nor an instance of two wrongs making a right, it's a result of the intermediate wrapped result being meaningful and wrapping being algebraically nice.

Signed wrapping is not inherent

Signed and unsigned integers are two different ways to interpret bit vectors. Almost all operations have no specific signed or unsigned version, only a generic version that does both. There is no such thing as signed addition or unsigned addition, addition is just addition. Operations that are actually different are:

  • Comparisons except equality
  • Division and remainder
  • Right shift, maybe, but arithmetic right shift and logical right shift can both be reasonably applied in both signed and unsigned contexts
  • Widening conversion
  • Widening multiplication
One thing almost all of these have in common is that they cannot overflow, except division of the smallest integer by negative one. By the way I regard that particular quirk of division as a mistake since it introduces an asymmetry between dividing by negative one and multiplying by negative one.

The result is that the operations that can "overflow" are neither signed nor unsigned, and therefore do not overflow specifically in either of those ways. If they can be said to overflow at all, when and how they do so depends on how they are being viewed by an outside entity, not on the operation itself.

The distinction between unsigned and signed wrapping is equivalent to imagining a "border" on the ring of integers (not the mathematical Ring of Integers) either between 0 and -1 (unsigned) or between signed-smallest and signed-highest numbers, but there is no border. Crossing either of the imaginary borders does not mean nearly as much as many people think it means.

Signed wrapping is algebraically nice

A property that wrapping arithmetic shares with arbitrary precision integer arithmetic, but not with trapping arithmetic, is that it obeys a good number of desirable algebraic laws. The root cause of this is that ℤ/ℤ2k is a ring, and trapping arithmetic is infested with implicit conditional exceptions. Signed arithmetic can largely be described by ℤ/ℤ2k, like unsigned arithmetic, since it is mostly a reinterpretation of unsigned arithmetic. That description does not cover all operations or properties, but it covers the most important aspects.

Here is a small selection of laws that apply to wrapping arithmetic but not to trapping arithmetic:

  • -(-A) = A
  • A + -A = 0
  • A - B = A + -B
  • A + (B + C) = (A + B) + C
  • A * (B + C) = A * B + A * C
  • A * -B = -A * B = -(A * B)
  • A * (B * C) = (A * B) * C
  • A * 15 = A * 16 - A
  • A * multiplicative_inverse(A) = 1 (iff A is odd, this is something not found in ℤ which has only two trivially invertible numbers, so sometimes wrapping gives you a new useful property)
Some laws also apply to trapping arithmetic:
  • A + 0 = A
  • A - A = 0
  • A * 0 = 0
  • A * 1 = A
  • A * -1 = -A
  • -(-(-A)) = -A

The presence of all the implicit exceptional control flow makes the code very hard to reason about, for humans as well as compilers.

Compilers react to that by not optimizing as much as they otherwise would, since they are forced to preserve the exception behaviour. Almost anything written in the source code must actually happen, and in the same order as originally written, just to preserve exceptions that are not even supposed to ever actually be triggered. The consequences of that are often seen in Swift, where code using the &+ operator is optimized quite well (including auto-vectorization) and code using the unadorned + operator can be noticeably slower.

Humans .. probably don't truly want trapping arithmetic to begin with, what they want is to have their code checked for unintended wrapping. Wrapping is not a bug by itself, but unintended wrapping is. So while canceling a "bare" double negation is not algebraically justified in trapping arithmetic, a programmer will do it anyway since the goal is not to do trapping arithmetic, but removing bad edge cases. Statically checking for unintended wrapping would be a more complete solution, no longer relying on being lucky enough to dynamically encounter every edge case. Arbitrary precision integers would just remove most edge cases altogether, though it would rely heavily on range propagation for performance, making it a bit fragile.

But anyway, wrapping is not so bad. Just often unintended.

Thursday, 2 August 2018

Implementing Euclidean division

While implementing various kinds of division in haroldbot, I had to look up/work out how to implement different kinds of signed division in terms of unsigned division. The common truncated division (written as /s in this post and in haroldbot, /t in some other places) is natural result of using your intuition from ℝ and writing the definition based on signs and absolute values, ensuring that the division only happens between non-negative numbers (making its meaning unambiguous) and that the result is an integer: $$\DeclareMathOperator{\sign}{sign} D /_s d = \sign(d)\cdot\sign(D)\cdot\left\lfloor\cfrac{\left|D\right|}{\left|d\right|}\right\rfloor$$ That definition leads to a plot like this, showing division by 3 as an example:

Of course the absolute values and sign functions create symmetry around the origin, and that seems like a reasonable symmetry to have. But that little plateau around the origin often makes the mirror at the origin a kind of barrier that you can run into, leading to the well-documented downsides of truncated division.

The alternative floored division and Euclidean division have a different symmetry, which does not lead to that little plateau, instead the staircase pattern simply continues:

The point of symmetry, marked by the red cross, is at (-0.5, -0.5). Flipping around -0.5 may remind you of bitwise complement, especially if you have read my earlier post visualizing addition, subtraction and bitwise complement, and mirroring around -0.5 is no more than a conditional complement. So Euclidean division may be implemented with positive division as: $$\DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\xor}{\bigoplus} D /_e d = \sign(d)\cdot(\sgn(D)\xor\left\lfloor\cfrac{D\xor\sgn(D)}{\left|d\right|}\right\rfloor)$$ Where the sgn function is -1 for negative numbers and 0 otherwise, and the circled plus is XOR. XORing with the sgn is a conditional complement, with the inner XOR being responsible for the horizontal component of the symmetry and the outer XOR being responsible for the vertical component.

It would have been even nicer if the symmetry of the divisor also worked that way, but unfortunately that doesn't quite work out. For the divisor, the offset introduced by mirroring around -0.5 would affect the size of the steps of the staircase instead of just their position.

The /e and %e operators are available in haroldbot, though like all forms of division the general case is really too hard, even for the circuit-SAT based truth checker (the BDD engine stands no chance at all).

Saturday, 16 December 2017

Notes on negation

The well known formulas

Most readers will be familiar with -x = ~x + 1 = ~(x - 1). These are often just stated without justification, or even an explanation for why they are equivalent. There are some algebraic tricks, but I don't think they explain much, so I'll use the rings from visualizing addition, subtraction and bitwise complement. ~x + 1, in terms of such a ring, means "flip it, then draw a CCW arrow on it with a length of one tick". ~(x - 1) means "draw a CW arrow with a length of one tick, then flip". Picking CCW first is arbitrary, but the point is that the direction is reversed because flipping the ring also flips the arrow if it is drawn first, but not if it is drawn second. Equivalent to drawing an arrow, you may rotate the ring around its center.

So they're equivalent, but why do they negate. The same effect also explains
a - b = ~(~a + b), which when you substitute a = 0 almost directly gives -b = ~(b - 1). Or using the difference between one's complement and proper negation as I pointed out in that visualization post: the axis of flipping is offset by half a tick, so the effect of flipping introduces a difference of 1 which can be removed by rotating by a tick.

Bit-string notation

I first saw this notation in The Art of Computer Programming v4A, but it probably predates it. It provides a more "structural" view of negation: $$-(a10^k) =\; {\sim} (a10^k - 1) =\; {\sim} (a01^k) = ({\sim} a)10^k$$ Here juxtaposition is concatenation, and exponentiation is repetition and is done before concatenation. a is an arbitrary bit string that may be infinitely long. It does not really deal with the negation of zero, since the input is presumed to end in 10k, but the negation of zero is not very interesting anyway.

What this notation shows is that negation can be thought of as complementing everything to the left of the rightmost set bit, a property that is frequently useful when working with the rightmost bit. A mask of the rightmost set bit and everything to the right of it can be found with
x ^ (x - 1) or, on a modern x86 processor, blsmsk. That leads to negation by XOR: $$-x = x\oplus {\sim}\text{blsmsk}(x)$$ which is sort of cheating since ~blsmsk(x) = x ^ ~(x - 1) = x ^ -x, so this said that
-x = x ^ x ^ -x. It may still be useful occasionally, for example when a value of "known odd-ness" is being negated and then XORed with something, the negation can be merged into the XOR.

Negation by MUX

Using that mask from blsmsk, negation can be written as $$-x = \text{mux}(\text{blsmsk}(x), {\sim} x, x)$$ which combines with bit-level commutativity in some fun ways:

  • (~x + 1) + (x - 1) = mux(blsmsk(x), ~x, x) + mux(blsmsk(x), x, ~x) = ~x + x = -1
  • (~x + 1) | (x - 1) = ~x | x = -1
  • (~x + 1) ^ (x - 1) = ~x ^ x = -1
  • (~x + 1) & (x - 1) = ~x & x = 0
All of these have simpler explanations that don't involve bit-level commutativity, by rewriting them back in terms of negation. But I thought it was nice that it was possible this way too, because it makes it seem as though a +1 and a -1 on both sides of an OR, XOR or AND cancel out, which in general they definitely do not.

The formula that I've been using as an example for the proof-finder on haroldbot.nl/how.html,
(a & (a ^ a - 1)) | (~a & ~(a ^ a - 1)) == -a, is actually a negation-by-MUX, written using mux(m, x, y) = y & m | x & ~m.

Friday, 1 December 2017

Bit-level commutativity

By bit-level commutativity I mean that a binary operator has the property that swapping any subset of bits between the left and right operands does not change the result. The subset may be any old thing, so in general I will call an operator o bit-level commutative if it satisfies the following property $$\forall m,a,b: a \circ b = \text{mux}(m, a, b) \circ \text{mux}(m, b, a)$$ For example, by setting m = b we get a ⊗ b = (a & b) ⊗ (a | b), sort of "bitwise sorting" the operands, with zeroes moved to the left operand and ones moved to the right operand (if possible).

Anyway, obviously AND, OR and XOR (and their complemented versions) are all bit-level commutative, indeed any purely bitwise operation (expressible as a vectorized function that takes two booleans as input) that is commutative is necessarily also bit-level commutative, for obvious reasons. Interestingly, addition is also bit-level commutative, which may be less obvious (at least in a recent coding competition, it seemed that people struggled with this). It may help to consider addition on a slightly more digit-by-digit level: $$ a + b = \sum_i 2^i (a_i + b_i)$$ It should be clear from the bit-level "exploded" sum, that the individual bits ai and bi can be either swapped or not, independently for any i. This should get more obvious the more you think about what representing a number in a positional numeral system even means in the first place: it was always a sum, so adding two numbers is like taking the sum of two "big" sums, of course it does not matter which of the big sums any particular contribution comes from.

Alternatively, the old a + b = (a ^ b) + 2(a & b) (ie computing bit-level sums and then adding the carries separately) can explain it: both XOR and AND are bit-level commutative, so the whole expression is, too.

Anyway, a consequence is that a + b = (a & b) + (a | b), which I have more commonly seen derived as:

a + b = (a ^ b) + 2(a & b)  // add carries separately
      = (a | b) - (a & b) + 2(a & b) // see below
      = (a | b) + (a & b)
Where (a ^ b) = (a | b) - (a & b) can be explained as XOR being like OR, except that unlike OR it is 0 when both operands are set, so just subtract that case out. I always like having two (or more!) explanations from completely different directions like that.

Multiplication (including carryless multiplication and OR-multiplication) is of course not bit-level commutative. For example if one operand is zero and the other is odd and not 1, then the lowest bit could be swapped to make neither operand zero, and a non-zero result could be produced that way. Operations such as comparison and (by extension) min and max are obviously not bit-level commutative.

There is probably more to this, I may add some stuff later.

Sunday, 6 August 2017

Visualizing addition, subtraction and bitwise complement

A relatively well-known relation between addition and subtraction, besides the basic relations a - b = a + (-b) and a - b = -(-a + b), is a - b = ~(~a + b). But I suspect most people have simply accepted that as fact, or perhaps proved it from the 2's complement definition of negation. haroldbot can do the latter, though not as succinctly as I hoped.

But it also has a nice one-dimensional geometric interpretation, really analogous to the way a - b = -(-a + b) looks in ℤ.

As negation mirrors ℤ around the origin, complement mirrors the space of two's complement signed bitvectors around the "space" between -1 and 0. Clearly addition in the mirrored space corresponds to subtraction in the unmirrored space, so the obvious way to subtract is mirror, add, and mirror back. That's precisely what -(-a + b) does in ℤ and what ~(~a + b) does for bitvectors. An observant reader may notice that I convenient disregarded the finiteness of the number line of fixed-size bitvectors, that's actually not a problem but the visualization gets a bit trickier.

What is in a way more surprising is that a - b = -(-a + b) works for bitvectors, since negation does not neatly mirror the whole number line when you're talking about two's complement negation. It's around the origin again instead of around a "space", but the most negative number is unaffected.

When we remember that this number line is really a number ring (in the circular sense), that starts to make sense again. To complete this picture, you can think about holding a ring in your hands, flipping it over while holding it at two diametrically opposite points - zero and the most negative number. Of course this visualization also works for complement, just hold the ring at slightly different places: between negative one and zero, and between the minimum and maximum (which are adjacent, you could think of it as where the ring closes). There are images below, but you should probably only look at them if visualization failed to appear in your mind naturally - if you already have one, your own image is probably easier to think about.

But why

OK all this spatial insight is fun (maybe), but what was it actually good for. I've found that thinking about the complement operation this way helps me to relate it to addition-like arithmetic operations (add, subtract, min, max, compare, etc) since they're all simple operations with "arrows" around that ring that we just flipped in our minds.

So it helps to make sense of various "mutants" of De Morgan's Law, such as:

  • ~x < ~y == x > y
  • ~x > ~y == x < y
  • ~min(~x, ~y) == max(x, y)
  • ~max(~x, ~y) == min(x, y)
  • ~avg_up(~x, ~y) == avg_down(x, y) where avg_down is the average rounding down, see also VirtualDub: Weighted averaging in SSE (part 2)
  • ~avg_down(~x, ~y) == avg_up(x, y)
  • ~paddsb(~x, y) == psubsb(x, y) (signed saturation)
  • ~psubsb(~x, y) == paddsb(x, y) (signed saturation)
  • ~paddusb(~x, y) == psubusb(x, y) (unsigned saturation)
  • ~psubusb(~x, y) == paddusb(x, y) (unsigned saturation)

A similar visualization works for the signed/unsigned "conversion" x ^ msb == x + msb == x - msb (msb is a mask with only the most significant bit set), which corresponds to rotating the ring 180 degrees. This may help when thinking about things such as:

  • x <s y == (x ^ msb) <u (y ^ msb)
  • x <u y == (x ^ msb) <s (y ^ msb)
  • max_s(x, y) == max_u(x ^ msb, y ^ msb) ^ msb

The relationship between the different kinds of min/max can be summed up by a nice commutative diagram:

Hope it helps, for me this sort of thing has come in handy occasionally when writing SSE code.








Here are the images two's complement negation:

and for plain one's complement:

This is in the orientation that I usually use when I think about these operations this way, but there is not particular meaning to going counter-clockwise with 0 at/near the bottom.

Thursday, 24 November 2016

Parallel prefix/suffix operations

After a "brief" hiatus..

Parallel prefix/suffix operations is a family of operations that follow either this template:

x = x ⊕ (x >> 1);
x = x ⊕ (x >> 2);
x = x ⊕ (x >> 4);
x = x ⊕ (x >> 8);
x = x ⊕ (x >> 16);
Or this template:
x = x ⊕ (x << 1);
x = x ⊕ (x << 2);
x = x ⊕ (x << 4);
x = x ⊕ (x << 8);
x = x ⊕ (x << 16);
Where ⊕ is typically OR or XOR. I've never seen ⊕=AND show up naturally, but it might be useful for something.

There is some disagreement (stemming from the usual "which end of an integer is the front" question) about which of them should be called the "parallel prefix" and which the "parallel suffix", for the purpose of this post I'll be calling the first one (with the shifts to the right) "parallel prefix". Either way, these operations are bit-level analogues to the more well known prefix sum, prefix max, prefix product, etc. The word "parallel" in the name refers to the bit-level parallelism, which has the same structure as the simple (not work-efficient) parallel prefix sum algorithm.

Some members of the family

The best-known member of parallel prefix/suffix family is the parallel prefix with ⊕ = XOR (PP-XOR), which converts Gray code back to normally-ordered integers and produces the parity of an integer in the lowest bit.

The parallel suffix with ⊕ = XOR (PS-XOR) is a carryless multiplication by -1, which isn't very interesting by itself (probably), but gives a hint about the algebraic structure that these operations give rise to.

The parallel prefix with ⊕ = OR (PP-OR) often shows up whenever bitwise operations are interacting with ranges/intervals, since PP-OR(low ^ high) gives a mask of the bits that are not constrained by the interval. I have used this operation (though I did not name it then) several times in my series on propagating bounds through bitwise operations.
This operation lends itself to some optimizations (I put "performance" in the title, which I admit I've mostly ignored), for example on x64 you could implement it as

   lzcnt rax, rax  ; input in rax
   sbb rdx, rdx
   not rdx
   shrx rax, rdx, rax
Or:
   lzcnt rax, rax
   mov edx, 64
   sub edx, eax
   or rax, -1
   bzhi rax, rax, rdx
The each have their pros and cons, and hopefully there's a better way to do it, but I couldn't really find any. I made sure to avoid the infamous false dependency that popcnt, tzcnt and lzcnt all share on any Intel processor that implements them to date. Probably the biggest problem here is that both versions require BMI2, that can be avoided, eg (suggested by Jasper Neumann)
   xor edx, edx    // provide a zero register for cmov
   bsr ecx, eax
   mov eax, -1
   not ecx         // flags not modified
   cmovz eax, edx
   shr eax, cl

Properties/structure

To start with the obvious, the prefix and suffix versions are exactly each others mirror image. So I'm going to look just at the suffix part of the family, for the prefix part just mirror everything.

PS-XOR(x) is clmul(x, -1)

Since every step is a clmul by 1 + 2k and if you clmul those constants together you get -1 (the two's complement -1, not the -1 in the ring formed by XOR and clmul over bitvectors of length k, which would just be 1 again), or from a more intuitive angle, what the PS-XOR is supposed to do in the first place is XOR each bit into all higher bits. So it inherits the properties, such as inversibility (-1 is odd). The inverse of PS-XOR is x ^ (x << 1).

It also inherits that PS-XOR(x) ^ PS-XOR(y) == PS-XOR(x ^ y) from the distributivity of clmul.

Since clmul is commutative and associative, the steps in PS-XOR can be reordered.

PS-OR(x) is or_mul(x, -1)

This isn't as nice (XOR and clmul form a ring structure) since OR doesn't form a group because it has no negation (this is just a fancy way of saying that you can't un-OR numbers, in the way you can un-ADD by subtraction or un-XOR with XOR again) and so it doesn't extended to a ring, but it can be extended into a semiring by defining the following multiplication:

static uint or_mul(uint a, uint b)
{
    uint r = 0;
    for (int i = 0; i < 32; i++)
    {
        if ((a & 1) == 1)
            r |= b; // the only difference is here
        a >>= 1;
        b <<= 1;
    }
    return r;
}
It's not fancy math notation but you get the idea.

This (with OR) forming a commutative semiring (proof "left as an exercise to the reader"), it has some nice properties:

  • or_mul(a, b) == or_mul(b, a) commutivity
  • or_mul(a, or_mul(b, c)) == or_mul(or_mul(a, b), c) associativity
  • or_mul(a, b | c) == or_mul(a, b) | or_mul(a, c) left distributivity
  • or_mul(a | b, c) == or_mul(a, c) | or_mul(b, c) right distributivity
But, of course, no multiplicative inverses. Except when multiplying by 1, but that doesn't count since it's the multiplicative identity.

PS-OR(x) == or_mul(x, -1), and the individual steps of the PS-OR are of the form x = or_mul(x, 1 + 2k). The or_mul of all those constants together is, unsurprisingly, -1 (though this notation is now slightly odd since this was all a semiring, I mean the element that has all bits set).

And again, PS-OR inherits PS-OR(x) | PS-OR(y) = PS-OR(x | y) from distributivity.

And again, since or_mul is commutative and associative, the steps in PS-OR can be reordered.

PS-OR(x) is also x | -x

This one doesn't have a nice mirrored equivalent, just the obvious one where you insert a bit-reversal before and after. It also doesn't have an obvious relation to or_mul. As for why it's the case, given that (in string notation) -(a10k) = (~a)10k; a10k | -(a10k) = (~a|a)10k = 110k, so it copies the lowest set bit into all higher bits, just as PS-OR does.

Sunday, 13 April 2014

A bitwise relational domain

This is the thing that I was referring to in my previous post where I said it didn't work out as well as I'd hoped. That's still the case. But it's not a complete failure, it's just not as good as I had hoped it might be, and it's still interesting (and, as far as I know, new).

The basic idea is really the same one as in Miné's A Few Graph-Based Relational Numerical Abstract Domains, namely a matrix where the element i,j says something about the relation between the variables i and j, and there's a closure operator that very closely resembles the Floyd–Warshall algorithm (or matrix multiplication, which comes down to the same thing). Of course saying that the difference of two variables must be in some set isn't great for bitwise operations, so my first thought was to change it to a xor, with Miné's bitfield domain (the one I've written several other posts about) as the Basis. For the closure operator, the multiplication is represented by the bitwise-xor from the bitfield domain, addition is bitwise-and from the bitfield domain. The rest is similarly easy.

The problem with that idea is that it tends finds almost no relations. It can represent some non-trivial relations, for example that one variable is (partially) equal to and other xor-ed with a constant, but it's just not enough.

The problem is that xor just doesn't remember enough. So the next idea I had was to abandon xor, and go with an operator that doesn't throw away anything, and use 4-tuples to represent that set instead of 2-tuples. Due to lack of inspiration, I named the elements of the tuple (a, b, c, d), and they work like this:

ax = 0, y = 0
bx = 0, y = 1
cx = 1, y = 0
dx = 1, y = 1
Meaning that if, say, bit 0 in a is set in the element at (x,y) then variables x and y can be even simultaneously. On the diagonal of the matrix, where x=y, b and c must both be zero, because a variable can not be unequal from itself.

The "multiplication" operator for the closure works like this:

aij &= aik & akj | bik & ckj
bij &= aik & bkj | bik & dkj
cij &= dik & ckj | cik & akj
dij &= dik & dkj | cik & bkj
For all k. This essentially "chains" i,k with k,j to get an implied constraint on i,j.

To show how powerful this is, here is what happens after asserting that z = x & y

x
y
z
x-1,0,0,-1-1,-1,-1,-1-1,-1,0,-1
y-1,-1,-1,-1-1,0,0,-1-1,-1,0,-1
z-1,0,-1,-1-1,0,-1,-1-1,0,0,-1
And now some cool things can happen. For example, if you assert that z is odd, the closure operation will propagate that back to both x and y. Or if you assert that w = x | z, it will deduce that w == x.

There is a symmetry between the upper triangle and the lower triangle. They're not the same, but an element (a, b, c, d) is mirrored by an element (a, c, b, d). One of those triangles can easily be left out, to save some space. Elements that are (-1, -1, -1, -1) could also be left out, but that doesn't typically save many elements - if something non-relational is known about a variable (that is, there is an element on the diagonal that isn't (-1, 0, 0, -1)), then everything in the same row and everything in the same column will have that information in it as well. But an element (i,j) exactly equal to (ai,i & aj,j, ai,i & dj,j, di,i & aj,j, di,i & dj,j) can be left out. That removes all elements that describe a trivial relation of the kind that says that the combination of x and y must be in the Cartesian product of the sets that x and y can be in, which is completely redundant.

But it does have problems. If you xor (or add or subtract) two unconstrained variables, nothing happens. That sort of 3-variable relation isn't handled at all. This led me to explore a 3D variant of this domain (with 8-tuples), which is really great if you only consider its expressive power, but it likes to take a large amount of space, and its closure operator is very inefficient.

The algorithms I presented earlier for addition in the bitfield domain still work in this relational domain, and with just a couple of simple changes they can make use of some relational information. They want to know something about p = x ^ y and g = x & y, relational information can be available for both of them. For example, if it is known that x & y == 0 (indicated by a 4-tuple (a, b, c, 0)), g will be known to be 0, all carries will be known to be 0, and two relations will be made that look exactly the same as if you had asserted z = x | y. Of course that only works if x & y == 0 is known before the addition.

Assertions such as x & y == 0 (more generally: x ⊗ y is in (z, o) where (z, o) is an element from the unrelational bitfield domain) can be done directly (without inventing a temporary to hold x ⊗ y and asserting on it), and actually that's simpler and far more efficient.

Saturday, 28 September 2013

Determining the cardinality of a set described by a mask and bounds

In other words, calculating this \begin{equation} | \left\{ x | (x \& m) = v \land x \ge a \land x \le b \right\} | \end{equation}

A BDD (which haroldbot uses to solve this) has no trouble with that at all, but the structure the problem is so nice that I thought it should be possible to do better. And it is, though the solution I'm about to present is probably far from optimal. I don't know how to significantly improve on it, but I just get a very non-scientific gut-feeling that there should be a fundamentally better way to do it.

The cardinality of the set without the bounds (only the mask) is obviously trivial to compute (as 1 << popcnt(~m)), and the bounds divide that set in three parts: items that are lower than the lower bound ("left part"), items that are higher than the upper bound ("right part")), and items that are actual set of interest ("middle part"). The main idea it is built on is that it's relatively easy to solve the problem if it didn't have a lower bound and the upper bound was a power of two, with a formula roughly similar to the one above. Using that, all the powers of two that fit before lower bound can be counted from high to low, giving the cardinality of the "left part". The same can be done for the "right part", actually in exactly the same way, by complementing the mask and the upper bound. Obviously the cardinality of the "middle part" can be computed from this.

And here's the code. It doesn't like it when the cardinality is 232, and watch out for weird corner cases such as when the lower bound is bigger than the upper bound (why would you even do that?). It usually works, that's about as much as I can say - I didn't prove it correct.

static uint Cardinality(uint m, uint v, uint a, uint b)
{
    // count x such that a <= x <= b && (x & m) == v
    // split in three parts:
    // left  = 0 <= x < a
    // right = b < x <= -1
    // the piece in the middle is (1 << popcnt(~m)) - left - right
    uint left = 0;
    uint x = 0;
    for (int i = 32 - 1; i >= 0; i--)
    {
        uint mask = 1u << i;
        if (x + mask <= a)
        {
            x += mask;
            uint mask2 = 0 - mask;
            if ((x - 1 & m & mask2) == (v & mask2))
            {                        
                uint amount = 1u << popcnt(~(m | mask2));
                left += amount;
            }
        }
    }
    uint right = 0;
    uint y = 0;
    for (int i = 32 - 1; i >= 0; i--)
    {
        uint mask = 1u << i;
        if (y + mask <= ~b)
        {
            y += mask;
            uint mask2 = 0 - mask;
            if ((y - 1 & m & mask2) == (~v & m & mask2))
            {
                uint amount = 1u << popcnt(~(m | mask2));
                right += amount;
            }
        }
    }
    uint res = (uint)((1UL << popcnt(~m)) - (left + right));
    return res;
}
The loops can be merged of course, but for clarity they're separate here.

If you have any improvements, please let me know.

Saturday, 10 August 2013

Announcing haroldbot

haroldbot was an ircbot (hence the name) that solves some bitmath problems. The title is actually a lie - haroldbot has been around for a while now. But now it finally got its own website.
haroldbot.nl
Check it out, if you work with bits you will probably find this useful.

Thursday, 30 May 2013

Carryless multiplicative inverse

Note: this post is neither about the normal multiplicative inverse, nor the modular multiplicative inverse. This other post has information about the modular multiplicative inverse, which might be what you were looking for.

Mathematicians may call carryless multiplication "multiplication in GF(2^n)", but that doesn't explain how it works - recall the shift-and-add algorithm for multiplication:
static uint mul(uint a, uint b)
{
    uint r = 0;
    while (b != 0)
    {
        if ((a & 1) != 0)
            r += b;
        a >>= 1;
        b <<= 1;
    }
    return r;
}
Carryless multiplication is a very simple variation on that: do the addition without carries. That's just a XOR.
static uint cl_mul(uint a, uint b)
{
    uint r = 0;
    while (b != 0)
    {
        if ((a & 1) != 0)
            r ^= b;      // carryless addition is xor
        a >>= 1;
        b <<= 1;
    }
    return r;
}
It has some applications in complicated cryptography related algorithms, but it also seems like this should be an interesting and powerful operation when working with bits, and it may well be, but I know of almost no uses for it (besides, Intel's implementation is so slow that it often wouldn't help Intel made it over twice as fast in Haswell). But anyway, let's just start with its basic properties: like normal multiplication, it's commutative and associative. It's also distributive, but over xor instead of over addition. None of this is very surprising.

As an aside, using associativity, it can be shown that the parallel suffix with XOR (which does have some known uses in bitmath, for example in implementing compress_right in software), code shown below, is equivalent to a carryless multiplication by -1.
// parallel suffix with XOR
x ^= x << 1;
x ^= x << 2;
x ^= x << 4;
x ^= x << 8;
x ^= x << 16;
Every step is clearly a carryless multiplication, by 3, 5, 17, 257, and 65537 respectively. So it's equivalent to:
clmul(clmul(clmul(clmul(clmul(x, 3), 5), 17), 257), 65537) which can be rearranged (using associativity) to:
clmul(x, clmul(clmul(clmul(clmul(3, 5), 17), 257), 65537)) which works out to clmul(x, -1). Of course it was supposed to work out that way, because every bit of the result should be the XOR of all bits up to (and including) that bit, but it's nice that it also follows easily from a basic property. Incidentally if you have a full-width carryless multiplication, multiplying by -1 also computes the parallel prefix with XOR in the upper bits (the upper bit of the low word, which is the parity of the input, is shared by the suffix and the prefix.)

Carryless multiplication also shares an other property with normal multiplication: there are multiplicative inverses modulo 2n (and also modulo other numbers, but 2n is of particular interest since we're working in that by default anyway). Again there are only inverses for odd numbers, and it's equally obvious (as for normal multiplication) why that should be so - an even multiplier will throw out at least one high order bit. First, here's an example of carrlessly multiplying x by -1 and then carrylessly multiplying that by 3.
x = {d}{c}{b}{a}    // the letters are bits
y = cl_mul(x, -1) = {d^c^b^a}{c^b^a}{b^a}{a}
z = cl_mulinv(-1) = 0011
cl_mul(y, z) = {d^c^b^a ^ c^b^a}{c^b^a ^ b^a}{b^a ^ a}{a} = {d}{c}{b}{a}
Ok, so that worked out well, and it also gives part the answer to exercise 3 in chapter 5 of Hacker's Delight (about whether parallel prefix/suffix with XOR is invertible and how) because a carryless multiplication by -1 is the same as the parallel suffix with XOR. A carryless multiplication of y by 3 is of course just y ^ (y << 1).

But back to actually computing the inverse. The inverse had better be odd, so bit 0 is already known, and for all the other bits follow these steps
  1. if the remainder is 1, stop
  2. if bit k is 0, go to step 5
  3. set bit k in the inverse
  4. xor the remainder with input << k
  5. increment k and go to step 1
Step 4 always resets the offending bit because the input had to be odd, so it's obvious that the remainder always ends up being 1 eventually, and so the algorithm always terminates. Moreover, even in the worst case it only has to process every bit but one, and continuing after the remainder becomes 1 simply does nothing, so step 1 could read "if k is 32" (or some other number, depending on how many bits your ints are wide), which is easier to unroll and better suited for a hardware implementation (not that I've seen this operation implemented in hardware anywhere).
For example, in C# it could look like this:
static uint clmulinv(uint x)
{
    uint inv = 1;
    uint rem = x;
    for (int i = 1; i < 32; i++)
    {
        if (((rem >> i) & 1) != 0)
        {
            rem ^= x << i;
            inv |= 1u << i;
        }
    }
    return inv;
}

A variation of the algorithm to find a multiplicative inverse modulo a power of two (see inv here) also works, which is useful when clmul is fast:
static uint clmulinv(uint d)
{
    uint x = 1;
    for (int i = 0; i < 5; i++)
    {
        x = clmul(x, clmul(x, d));
    }
    return x;
}

The first iteration sets x to d, that can be done immediately to skip an iteration.

Some sample inverses
1  0x00000001
3  0xFFFFFFFF
5  0x55555555
7  0xDB6DB6DB
9  0x49249249
11 0x72E5CB97
13 0xD3A74E9D
15 0x33333333

The definition of clmul at the start of the post was meant to be just that, a faster way to emulate it is this:

static uint clmul(uint a, uint b)
{
    uint r = 0;
    do
    {
        r ^= a * (b & (0 - b));
        b &= b - 1;
        r ^= a * (b & (0 - b));
        b &= b - 1;
        r ^= a * (b & (0 - b));
        b &= b - 1;
        r ^= a * (b & (0 - b));
        b &= b - 1;
    } while (b != 0);
    return r;
}
This works by extracting a bit from b and multiplying by it (which just shifts a left), then resetting that bit. This can be unrolled safely since when b == 0, no further changes are made to r automatically. The 0 - b thing is due to an unfortunate misfeature of C#, negating unsigned integers converts it to a long.

A similar trick works for the inverse:

static uint clmulinv(uint x)
{
    uint inv = 1;
    uint rem = x - 1;
    while (rem != 0)
    {
        uint m = rem & (0 - rem);
        rem ^= x * m;
        inv += m;
    }
    return inv;
}



By the way, why is this post so popular? Please let me know in the comments down below.

Thursday, 11 April 2013

Improving bounds when some bits have a known value

This problem is closely related the series of problems discussed in calculating the lower bound of the bitwise OR of two bounded variables (and some of the posts after that one), and the algorithm is very closely related, too. The question in this post is, suppose some bits may be known to be zero and some may be known to be one, is there a better lower/upper bound than the given one, and if so, what is it? That is, calculate \begin{equation} \min _{x \in [a, b] \wedge (x | \sim z) = x \wedge (x \& \sim o) = 0 } x \end{equation} and \begin{equation} \max _{x \in [a, b] \wedge (x | \sim z) = x \wedge (x \& \sim o) = 0 } x \end{equation} where z is a bitmask containing the bits that are allowed to be 0, and o is a bitmask containing the bits that are allowed to be 1.

The idea behind the algorithms is to do a binary search (the one-sided variant) over the numbers that the masks allow, for the lowest value bigger than or equal to the original lower bound (or smaller than or equal to the original upper bound, for the new upper bound). Just as in the case of propagating bounds through XOR, it may take more than one step, so there aren't many shortcuts. I called them both "reduce" even though ReduceMin actually increases the value, because their purpose is to reduce the range [min, max].
static uint ReduceMin(uint min, uint z, uint o)
{
    uint mask = z & o;   // note: mask is a subset of r
    uint r = o;
    uint m = 0x80000000 >> nlz(mask);
    while (m != 0)
    {
        // reset the bit if it can be freely chosen
        uint newr = r ^ (m & mask);
        if (newr >= min)
            // keep the change if still within bounds
            r = newr;
        m >>= 1;
    }
    return r;
}
static uint ReduceMax(uint max, uint z, uint o)
{
    uint mask = z & o;
    uint r = ~z;
    uint m = 0x80000000 >> nlz(mask);
    while (m != 0)
    {
        // set the bit if it can be freely chosen
        uint newr = r | (m & mask);
        if (newr <= max)
            // keep the change if still within bounds
            r = newr;
        m >>= 1;
    }
    return r;
}

There is one shortcut (that I know of): using nlz on every iteration, thereby skipping iterations where the current bit isn't even changed. With the implementation of nlz I was working with, that wasn't worth it, so whether it's actually a real shortcut or not is up for debate.

Occasionally the new lower bound can be higher than the new upper bound, that means the set of values was actually empty. If you were working with clockwise intervals, that changes to "if the new bounds aren't ordered the same way as the old ones" - ie if the interval was proper and the new one isn't or vice versa, the set is empty.

Thursday, 29 November 2012

Tesseral arithmetic - useful snippets

This post doesn't introduce anything new, and is, in my opinion, boring. Feel free to skip.

My previous post didn't have too many useful snippets in it (mainly useful techniques to make your own snippets), and I thought I could improve on that. This post is not a good read in isolation - it's probably a good idea to read my previous post first, if you haven't already.

Tesseral addition (see previous post) was nice, but very often you only need to increment/decrement one dimension of a coordinate (for example when iterating over a portion of a Z-ordered grid in reading order), equivalent to adding/subtracting (1, 0) or (0, 1) to/from a coordinate. Since only one part of the coordinate changes, only about half as much code is necessary. Also, since the thing being added to the coordinate is a constant, one of the masking operations can be merged with it.
static uint IncX(uint z)
{
    uint xsum = (z | 0xAAAAAAAA) + 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint IncY(uint z)
{
    uint ysum = (z | 0x55555555) + 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

static uint DecX(uint z)
{
    uint xsum = (z & 0x55555555) - 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint DecY(uint z)
{
    uint ysum = (z & 0xAAAAAAAA) - 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

My previous post only had TesseralMin, not the corresponding TesseralMax, so here you go:
public static uint TesseralMax(uint z, uint w)
{
    uint xdiff = (z & 0x55555555) - (w & 0x55555555);
    uint ydiff = (z >> 1 & 0x55555555) - (w >> 1 & 0x55555555);
    uint maskx = (uint)((int)xdiff >> 31);
    uint masky = (uint)((int)ydiff >> 31);
    uint xmin = (~maskx & z) | (maskx & w);
    uint ymin = (~masky & z) | (masky & w);
    return new T((xmin & 0x55555555) | (ymin & 0xAAAAAAAA));
}
Note that the only difference is that the mask and the complemented mask have switched places.

This TesseralMax and the TesseralMin from the previous post can be combined with the increments and decrements (and with full tesseral addition, but that's less frequently useful) to form saturating increments and decrements, useful for sampling around a position on a Z-ordered grid without getting out of bounds.
static uint IncXSat(uint z, uint xmax)
{
    uint xsum = ((z | 0xAAAAAAAA) + 1) & 0x55555555;
    uint xdiff = xsum - xmax;
    uint maskx = (uint)((int)xdiff << 1 >> 31);
    uint xsat = (maskx & xsum) | (~maskx & xmax);
    return xsat | (z & 0xAAAAAAAA);
}

static uint IncYSat(uint z, uint ymax)
{
    uint ysum = ((z | 0x55555555) + 2) & 0xAAAAAAAA;
    uint ydiff = ysum - ymax;
    uint masky = (uint)((int)ydiff >> 31);
    uint ysat = (masky & ysum) | (~masky & ymax);
    return ysat | (z & 0x55555555);
}

static uint DecXSat(uint z, uint xmin)
{
    uint xsum = ((z & 0x55555555) - 1) & 0x55555555;
    uint xdiff = xsum - xmin;
    uint maskx = (uint)((int)xdiff << 1 >> 31);
    uint xsat = (~maskx & xsum) | (maskx & xmin);
    return xsat | (z & 0xAAAAAAAA);
}

static uint DecYSat(uint z, uint ymin)
{
    uint ysum = ((z & 0xAAAAAAAA) - 2) & 0xAAAAAAAA;
    uint ydiff = ysum - ymin;
    uint masky = (uint)((int)ydiff >> 31);
    uint ysat = (~masky & ysum) | (masky & ymin);
    return ysat | (z & 0x55555555);
}
Merging them this way is nice, because only "half" of a TesseralMin or TesseralMax is necessary that way. On the other hand, they do have the overflow problem again, though that usually won't be a problem.

Next time, back to "stuff with bounds".

Tuesday, 27 November 2012

Tesseral arithmetic

Introductions are boring, feel free to skip to the interesting stuff

Tesseral arithmetic is a type of arithmetic that operates on interleaved coordinates. That may not seem very useful, so first, when would you want to do that?

The Z-order curve is a space-filling curve (also known as Morton order, Morton coordinates, etc) that is closely related to quad trees (and octrees) and (in some contexts) improves the locality of reference when working with multidimensional data.

In essence, it maps multidimensional coordinates to single-dimensional coordinates, which can be used to address memory, and it does so in a way that sometimes leads to better locality of reference than concatenating the parts of a coordinate into a longer one. The trick is to interleave the bits. While that is not the best (ie. optimal locality of reference) mapping, but it's interesting that it works so well for such a simple trick.

But where it really gets interesting is when you have interleaved coordinates and you want to do math with them. You could unpack them, do your math, and then repack, but if you follow the previous link you can see that while unpacking and packing are simple and fast relative to the mappings of other space-filling curves, unpacking and packing would add a lot of overhead to what would otherwise be simple math.

That's where tesseral arithmetic comes in.

Bitwise AND, OR and XOR still work the same way, because the bits of the result only depend on the corresponding bits in the inputs. Shifts are simple - the shift count must be multiplied by two. So for example x ^ (x << 1) becomes x ^ (x << 2) in tesseral arithmetic.

Addition is more trouble. The carries in normal addition propagate into bits they shouldn't be affecting in tesseral arithmetic. But consider what would happen if the bit pairs at odd positions would each sum to 1. A carry coming into an odd position would always be passed on, and no extra carries would be generated from odd positions. So if the bits at odd positions are just right, the bits at the even positions are summed tesserally, with the carry moving two places instead of one. Obviously this extends to the odd bits as well, when the bits at even positions are just right. This actually makes tesseral addition quite simple:
static uint TesseralAdd(uint z, uint w)
{
    uint xsum = (z | 0xAAAAAAAA) + (w & 0x55555555);
    uint ysum = (z | 0x55555555) + (w & 0xAAAAAAAA);
    return (xsum & 0x55555555) | (ysum & 0xAAAAAAAA);
}
Unsurprisingly, the same principle applies to subtraction. In subtraction, borrows are passed on unmodified through a pair of bits if they sum to zero, or in other words, if both are zero. In a way that's conceptually even simpler than addition.
static uint TesseralSubtract(uint z, uint w)
{
    uint xdiff = (z & 0x55555555) - (w & 0x55555555);
    uint ydiff = (z & 0xAAAAAAAA) - (w & 0xAAAAAAAA);
    return (xdiff & 0x55555555) | (ydiff & 0xAAAAAAAA);
}
But multiplication isn't that nice. The problem is that multiplication is basically build out of a lot of shifts and additions (it's not implemented that way in hardware anymore) and the additions aren't tesseral nor can they be made tesseral.
Unless, of course, we implement multiplication in software:
static uint TesseralMultiply(uint z, uint w)
{
    uint x = z & 0x55555555;
    uint y = w & 0x55555555;
    uint xres = 0;
    while (x != 0)
    {
        if ((x & 1) != 0)
            xres = (xres | 0xAAAAAAAA) + y;
        y <<= 2;
        x >>= 2;
    }

    x = z & 0xAAAAAAAA;
    y = w & 0xAAAAAAAA;
    uint yres = 0;
    while (x != 0)
    {
        if ((x & 2) != 0)
            yres = (yres | 0x55555555) + y;
        y <<= 2;
        x >>= 2;
    }

    return (xres & 0x55555555) | (yres & 0xAAAAAAAA);
}
But that doesn't achieve the goal of being faster than unpacking, doing math, and repacking. If anyone has a better idea, please let me know.

So ok, no tricks multiplication or division. But we're not done. As I hinted in my previous post, many bitwise tricks extend to tesseral arithmetic. For example, taking the absolute value of both parts of the coordinate simultaneously, using the same trick as in my previous post (working with the signbit). The basic principle is simple: replace all operations by their tesseral counterparts. Then look for simplifications and other improvements.
static uint TesseralAbs(uint z)
{
    uint maskx = (uint)((int)z << 1 >> 31);
    uint masky = (uint)((int)z >> 31);

    // this is a simplified tesseral addition (followed by a xor)
    uint xabs = (z & 0x55555555) + maskx ^ maskx;
    uint yabs = (z & 0xAAAAAAAA) + masky ^ masky;

    return (xabs & 0x55555555) | (yabs & 0xAAAAAAAA);
}
The mask is known to be either all ones or all zeroes. It may seem at first as though that means we'd have to OR it with something to make the "in between" bits sum to one, but when the mask is zero there are no carries to pass on anyway. So the OR can be skipped.

But calculating absolute values of coordinates doesn't happen that often. So let's calculate an element-wise minimum, using the same basic principle as before, replace normal operators by tesseral operators. This time however, a substantial improvement over the non-tesseral version is possible.
static uint TesseralMin(uint z, uint w)
{
    // these are tesseral subtractions, of course
    uint xdiff = (z & 0x55555555) - (w & 0x55555555);
    uint ydiff = (z >> 1 & 0x55555555) - (w >> 1 & 0x55555555);

    uint maskx = (uint)((int)xdiff >> 31);
    uint masky = (uint)((int)ydiff >> 31);

    uint xmin = (maskx & z) | (~maskx & w);
    uint ymin = (masky & z) | (~masky & w);

    return (xmin & 0x55555555) | (ymin & 0xAAAAAAAA);
}
And there's something very nice about how that worked out. In the normal min, there was a problem with overflow. That doesn't happen here, because for xdiff there was an extra bit anyway, and for ydiff that extra bit could easily be arranged by shifting right by 1. That makes the comparison unsigned, though, because the "extra bit" is zero, not a sign-extended bit.

So that's it for this post. Many other bitwise tricks can be extended to tesseral math, using the same basic principle. And of course this all generalizes to higher dimensions as well.

In the next post, I'll have some more useful snippets for tesseral arithmetic.

There are some other references for this type of arithmetic or its generalizations, for example The Art of Computer Programming volume 4A which calls this "working with fragmented fields" and Morton-order Matrices Deserve Compilers’ Support which calls this the "algebra of dilated integers".

By the way I originally wrote this post thanks to (or maybe due to?) this article, which I found by searching how to do coordinate arithmetic in a quad tree with Morton order. That's where the title comes from. Unfortunately the article didn't really say how to actually do it, so I worked that out (though the algebra of dilated integers had been explored before, I did not know it went by that name) and posted it for the benefit of other people who perhaps traversed the same steps up to that point.

Sunday, 18 November 2012

The basics of working with the signbit

this is a filler (in that it is much easier than the usual material), but it seems like most readers only read the fillers anyway

When I write signbit, I mean the upper bit in a bit string that is interpreted as a two's complement signed integer.

Central to working with the signbit is the idea that signed shift right aka arithmetic shift right copies the signbit to other bits, and specifically, a signed shift right by 31 (or 63 or in general, one less than the size of your numbers) broadcasts the signbit to all other bits.

Perhaps the most obvious thing you can do with that is broadcasting an arbitrary bit to all other bits. Simply shift that bit into the signbit, and then shift right by 31:
static int broadcastbit(int value, int bitindex)
{
    // put the target bit in the sign
    int temp = value << (31 - bitindex);
    // copy it to all bits
    return temp >> 31;
}
In C, that's undefined behaviour (UB). Letting a left shift overflow (which could easily happen here) is UB, and signed right shift is UB in any case. But this is C# code (the source of this page will tell you so) where it's perfectly well-defined. And anyway, this is the kind of UB that is safe to use; the expected thing happens when you combine a sane compiler with a typical platform (say, MSVC on x86). But, of course, purists won't like it and on platforms without arithmetic right shift it's probably not going to work.

That actually applies to most of this blog, I suppose.

On to other tricks. This one is slightly harder to grasp, but more useful: calculating the absolute value of an integer without branching. First, the simple to understand version.
static int abs(int value)
{
    // make a mask that is all ones if negative, or all zeroes if non-negative
    int mask = value >> 31;
    // select -value if negative, or value if non-negative
    return (mask & -value) | (~mask & value);
}
That's just the usual branchless selection between two things.

The better way to do this has to do with how negation works. The negation of a number x is ~x + 1 (first definition) or ~(x - 1) (second definition). Those definitions are, of course, equivalent. The trick (and you may have seen this coming), is to make the complement and the increment/decrement conditional based on the mask.
static int abs(int value)
{
    // make a mask that is all ones if negative, or all zeroes if non-negative
    int mask = value >> 31;
    // conditionally complement and subtract -1 (first definition)
    return (value ^ mask) - mask;
    // conditionally add -1 and complement (second definition)
    return (value + mask) ^ mask;
}
I've heard that the version of abs using the first definition is patented. That probably doesn't hold up (there will be a mountain of prior art and it's an obvious trick that anyone could derive), and no one's going to find out you're using it much less sue you for it, but you could use the version using the second definition just to be on the safe side.

One good thing about the simple version of abs is that it's using a generic branchless selection. That means you're not limited to choosing between value and -value, you can select anything. For example, you can subtract two numbers and use the sign of the difference to select the (unsigned) smallest one. That doesn't always work. The subtraction must not overflow, otherwise it selects the wrong one. The problem goes away if the inputs are smaller than ints, for example if they are bytes.
static byte min(byte x, byte y)
{
    int difference = x - y;
    // make a mask that is all ones if x < y, or all zeroes if x >= y
    int mask = difference >> 31;
    // select x if x < y, or y if x >= y
    return (byte)((mask & x) | (~mask & y));
    // alternative: use arithmetic to select the minimum
    return (byte)(y + (difference & mask));
}
The weird mixing of signed and unsigned may be confusing. Try to think of numbers as pure bit strings and only look at the type when an operator depends on it. That's closer to what actually happens in a computer, and it's less confusing that way.

The problem also goes away if you can use the carry flag instead of the signbit, because then you're not using a bit of the result to hold a flag but a separate thing, and thus doesn't "eat into the range of values". But high level languages are too good for the carry flag or something like that, and don't enable you to use it. So here's min in x86 assembly:
    ; inputs are in eax and edx, result in eax
    sub eax, edx
    sbb ecx, ecx    ; makes ecx all ones if carry (ie. if eax < edx)
    and eax, ecx
    add eax, edx
Whether this or the more usual branchless version with cmov is faster depends on the processor.

And that has nothing to do with the signbit anymore, I know.

These tricks, and many others, also extend to tesseral arithmetic, which I'll cover in my next post, which isn't a filler.

Sunday, 16 September 2012

Calculating the lower and upper bound of the bitwise OR of two variables that are bounded and may have bits known to be zero

This new problem clearly is related to two of my previous posts. But this time, there is slightly more information. It may look like a contrived, purely theoretical, problem, but it actually has applications in abstract interpretation. Static knowledge about the values that variables could have at runtime often takes the form of a range and a number that the variable is known to be a multiple of, which is most commonly a power of two.

The lower bound will be \begin{equation} \min _{x \in [a, b] \wedge m\backslash x, y \in [c, d] \wedge n\backslash y} x | y \end{equation} And the upper bound will be \begin{equation} \max _{x \in [a, b] \wedge m\backslash x, y \in [c, d] \wedge n\backslash y} x | y \end{equation} Where m\x means "x is divisible by m".

So how can we calculate them faster than direct evaluation? I don't know, and to my knowledge, no one else does either. But if sound (ie only overapproximating) but non-tight bounds are OK, then there is a way. Part of the trick is constraining m and n to be powers of two. It's safe to use m = m & -m. That should look familiar - it's extracting the rightmost bit of m. An other explanation of "the rightmost bit of m" is "the highest power of two that divides m". That doesn't rule out any values of x that were valid before, so it's a sound approximation.

Strangely, for minOR, if the bounds are pre-rounded to their corresponding powers of two, there is absolutely no difference in the code whatsoever. It is possible to set a bit that is known to be zero in that bound, but that can only happen if that bit is one in the other bound anyway, so it doesn't affect the result. The other case, setting a bit that is not known to be zero, is the same as it would be with only the range information.

maxOR is a problem though. In maxOR, bits at the right are set which may be known to be zero. Some of those bits may have to be reset. But how many? To avoid resetting too many bits, we have to round the result down to a multiple of min(m, n). That's clearly sound - if a bit can't be one in both x and n, obviously it can't be one in the result. But it turns out not to be tight - for example for [8, 9] 1\x and [0, 8] 4\y, it computes 0b1111, even though the last two bits can only be 0b00 or 0b01 (y does not contribute to these bits, and the range of x is so small that the bits only have those values) so the tight upper bound is 0b1101. If that's acceptable, the code would be
static uint maxOR(uint a, uint b, uint c, uint d, uint m, uint n)
{
    uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint resettable = b & d & (resettableb | resettabled);
    uint target = resettable == 0 ? 0 : 1u << bsr(resettable);
    uint targetb = target & resettableb;
    uint targetd = target & resettabled & ~resettableb;
    uint newb = b | (targetb == 0 ? 0 : targetb - 1);
    uint newd = d | (targetd == 0 ? 0 : targetd - 1);
    uint mask = (m | n) & (0 - (m | n));
    return (newb | newd) & (0 - mask);
}
Which also uses a sneaky way of getting min(m, n) - by ORing them and then taking the rightmost bit. Because why not.

I haven't (yet?) found a nice way to calculate the tight upper bound. Even if I do, that still leaves things non-tight when the old m or n were not powers of two.

Friday, 14 September 2012

Calculating the lower and upper bounds of the bitwise AND of two bounded variables

This post is the closely related the previous post and the post before it, so I strongly suggest you read those two first.

It's the same idea as before, but with bitwise AND instead of OR. That leads to some interesting symmetries. First, the definitions. The lower bound will be \begin{equation} \min _{x \in [a, b], y \in [c, d]} x \& y \end{equation} And the upper bound will be \begin{equation} \max _{x \in [a, b], y \in [c, d]} x \& y \end{equation} The algorithms given by Warren are
unsigned minAND(unsigned a, unsigned b, 
                unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (~a & ~c & m) {
         temp = (a | m) & -m; 
         if (temp <= b) {a = temp; break;} 
         temp = (c | m) & -m; 
         if (temp <= d) {c = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return a & c; 
}
unsigned maxAND(unsigned a, unsigned b, 
                unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (b & ~d & m) {
         temp = (b & ~m) | (m - 1); 
         if (temp >= a) {b = temp; break;} 
      } 
      else if (~b & d & m) {
         temp = (d & ~m) | (m - 1); 
         if (temp >= c) {d = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return b & d; 
}
Obviously, they follow the same basic idea. Try to set a bit so you can reset the bits to the right of it in the lower bound, or try to reset a bit so you can set the bits to the right of it in the upper bound. The same reasoning about starting at 0x80000000 >> nlz(~a & ~c) or 0x80000000 >> nlz(b ^ d) applies, and the same reasoning about "bits at and to the right of a ^ b" applies as well. I'll skip the "sparse loops" this time, they're nice enough but mainly instructive, and repeating the same idea twice doesn't make it twice as instructive. So straight to the loopless algorithms:
static uint minAND(uint a, uint b, uint c, uint d)
{
    uint settablea = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint settablec = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint settable = ~a & ~c & (settablea | settablec);
    uint target = settable == 0 ? 0 : 1u << bsr(settable);
    uint targeta = target & settablea;
    uint targetc = target & settablec & ~settablea;
    uint newa = a & (targeta == 0 ? 0xFFFFFFFF : 0-targeta);
    uint newc = c & (targetc == 0 ? 0xFFFFFFFF : 0-targetc);
    return newa & newc;
}
static uint maxAND(uint a, uint b, uint c, uint d)
{
    uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint candidatebitsb = b & ~d & resettableb;
    uint candidatebitsd = ~b & d & resettabled;
    uint candidatebits = candidatebitsb | candidatebitsd;
    uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);
    uint targetb = target & b;
    uint targetd = target & d & ~b;
    uint newb = b | (targetb == 0 ? 0 : targetb - 1);
    uint newd = d | (targetd == 0 ? 0 : targetd - 1);
    return newb & newd;
Symmetry everywhere. But not really anything to new to explain.

Next post, something new to explain.

Calculating the upper bound of the bitwise OR of two bounded variables

This post is the closely related the previous one, so I strongly suggest you read that one first.

The only difference with the previous post, is that this time, we're interested in the upper bound instead of the lower bound. In other words, evaluate
\begin{equation} \max _{x \in [a, b], y \in [c, d]} x | y \end{equation} The algorithm given by Warren in Hackers Delight is
unsigned maxOR(unsigned a, unsigned b, 
               unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (b & d & m) {
         temp = (b - m) | (m - 1); 
         if (temp >= a) {b = temp; break;} 
         temp = (d - m) | (m - 1); 
         if (temp >= c) {d = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return b | d; 
}
And it's really the same sort of idea as the algorithm to calculate the minimum, except this time we're looking for a place where both b and d are one, so we can try to reset that bit and set all the bits to the right of it.

Warren notes that m can start at 0x80000000 >> nlz(b & d), and once again the same principle holds: it's enough to only look at those bits which are one in b & d, and they can be visited from high to low with bsr
static uint maxOR(uint a, uint b, uint c, uint d)
{
    uint bits = b & d;
    while (bits != 0)
    {
        uint m = 1u << bsr(bits);

        uint temp;
        temp = (b - m) | (m - 1);
        if (temp >= a) { b = temp; break; }
        temp = (d - m) | (m - 1);
        if (temp >= c) { d = temp; break; }

        bits ^= m;
    }
    return b | d;
}
And also, again, we can use that the bit we're looking for in b must be at or to the right of the leftmost bit in a ^ b (c ^ d for d), and that the selected bit doesn't actually have to be changed.
static uint maxOR(uint a, uint b, uint c, uint d)
{
    uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint candidatebits = b & d & (resettableb | resettabled);
    uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);
    uint targetb = target & resettableb;
    uint targetd = target & resettabled & ~resettableb;
    uint newb = b | (targetb == 0 ? 0 : targetb - 1);
    uint newd = d | (targetd == 0 ? 0 : targetd - 1);
    return newb | newd;
}
Most of the code should be obvious after a moments thought, but something interesting and non-symmetric happens for targetd. There, I had to make sure that a change is not made to both bounds (that would invalidate the whole idea of "being able to make the change without affecting that bit in the result"). In minOR that happened automatically because it looked at positions where the bits were different, so both targets couldn't both be non-zero. Here, one of the bounds has to be explicitly prioritized before the other.

Next post, maybe the same sort of thing but for bitwise AND. Then again, maybe not. I'll see what I can come up with.
edit: bitwise AND it is.

Thursday, 13 September 2012

Calculating the lower bound of the bitwise OR of two bounded variables

What does that even mean?

Suppose you have the variables x in [a, b] and y in [c, d]. The question then is: what is the lowest possible value of x | y where x and y are both in their corresponding ranges. In other words, evaluate
\begin{equation} \min _{x \in [a, b], y \in [c, d]} x | y \end{equation} At a maximum of 264 iterations, direct evaluation is clearly not an option for 32-bit integers.

Fortunately, there is an algorithm that has a complexity linear in the number of bits, given by Warren in Hackers Delight, Propagating Bounds through Logical Operations, which the license permits me to show here:
unsigned minOR(unsigned a, unsigned b, 
               unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (~a & c & m) {
         temp = (a | m) & -m; 
         if (temp <= b) {a = temp; break;} 
      } 
      else if (a & ~c & m) {
         temp = (c | m) & -m; 
         if (temp <= d) {c = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return a | c; 
}

So let's break down what it's doing. It starts at the MSB, and then it searches for either the highest bit that is zero a and one in c such that changing a to have that bit set and all bits the right of it unset would not make the new a higher than b, or, the highest bit that is zero c and one in a such that changing c to have that bit set and all bits the right of it unset would not make the new c higher than d, whichever one comes first.

That's literally easier to code than to explain, and I haven't even explained yet why it works.
Suppose the highest such bit is found in a. Setting that bit in a does not affect the value of a | c, after all, that bit must have been set in c already so it was already set in a | c, too. However, resetting the bits to the right of that bit however can lower a | c. Notice that it is pointless to continue looking at lower bits - in a there are no more bits to reset, and for c there are no more bits that have the corresponding bit in a set.

Warren notes that m could start at 0x80000000 >> nlz(a ^ c) (where nlz is the "number of leading zeros" function), meaning it starts looking at the first bit that is different in a and c. But we can do better. Not only can we start at the first bit which is different in a and c, we could look at only those bits. That requires frequent invocation of the nlz function (or bsr, bit scan reverse, giving the index of the leftmost bit), but it maps to a fast instruction on many platforms.
uint minOR(uint a, uint b, uint c, uint d)
{
    uint bits = a ^ c;
    while (bits != 0)
    {
        // get the highest bit
        uint m = 1u << (nlz(bits) ^ 31);
        // remove the bit
        bits ^= m;
        if ((a & m) == 0)
        {
            uint temp = (a | m) & -m;
            if (temp <= b) { a = temp; break; }
        }
        else
        {
            uint temp = (c | m) & -m;
            if (temp <= d) { c = temp; break; }
        }
    }
    return a | c;
}
One interesting consequence of looking only at the bits that are different is that the second if disappears - the case where the bits are equal is ruled out by looking only at the different bits in the first place.

But that is not all. The bit positions at which the <= operators could return true, are precisely all those at and to the right of one important point: the highest set bit in a ^ b (or c ^ d for the other bound). Why? Well the upper bounds are not lower than the lower bounds, so the first bit at which they differ must be the first position at which the lower bound has a zero where the upper bound has a one. Setting that bit to one and all bits to the right to zero in the lower is clearly valid (ie doesn't make it higher than the upper bound), but whether that bit can actually be set depends on the other lower bound as well.

What that means in practical terms, is that the value of m that first passes the tests is directly computable. No loops required. Also, because the test to check whether the new bound is still less than or equal to the upper bound isn't necessary anymore (by construction, that test always passes), the bit doesn't even have to be set anymore - without the test the new value isn't really needed, and the entire idea was that setting that bit would not change the result, so setting it is pointless.
uint minOR(uint a, uint b, uint c, uint d)
{
    uint settablea = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint settablec = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint candidatebitsa = (~a & c) & settablea;
    uint candidatebitsc = (a & ~c) & settablec;
    uint candidatebits = candidatebitsa | candidatebitsc;

    uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);
    uint targeta = c & target;
    uint targetc = a & target;

    uint newa = a & ~(targeta == 0 ? 0 : targeta - 1);
    uint newc = c & ~(targetc == 0 ? 0 : targetc - 1);
    return newa | newc;
}
Sadly, there's an awful lot of conditionals in there, which could be branches. But they could also be conditional moves. And on x86 at least, both bsr and lzcnt set a nice condition flag if the input was zero, so it's really not too bad in practice. It is, in my opinion, a pity that there aren't more instruction to deal with leftmost bits, while instruction that deal with the rightmost bit are being added. They are nice, I will admit, but the rightmost bit could already be efficiently dealt with, while the leftmost bit is somewhat problematic.

Next post, the same thing but for the upper bound. This post is the start of a series of posts that address the propagation of intervals through bitwise operations.