Thursday, 2 August 2018

Implementing Euclidean division

While implementing various kinds of division in haroldbot, I had to look up/work out how to implement different kinds of signed division in terms of unsigned division. The common truncated division (written as /s in this post and in haroldbot, /t in some other places) is natural result of using your intuition from ℝ and writing the definition based on signs and absolute values, ensuring that the division only happens between non-negative numbers (making its meaning unambiguous) and that the result is an integer: $$\DeclareMathOperator{\sign}{sign} D /_s d = \sign(d)\cdot\sign(D)\cdot\left\lfloor\cfrac{\left|D\right|}{\left|d\right|}\right\rfloor$$ That definition leads to a plot like this, showing division by 3 as an example:

Of course the absolute values and sign functions create symmetry around the origin, and that seems like a reasonable symmetry to have. But that little plateau around the origin often makes the mirror at the origin a kind of barrier that you can run into, leading to the well-documented downsides of truncated division.

The alternative floored division and Euclidean division have a different symmetry, which does not lead to that little plateau, instead the staircase pattern simply continues:

The point of symmetry, marked by the red cross, is at (-0.5, -0.5). Flipping around -0.5 may remind you of bitwise complement, especially if you have read my earlier post visualizing addition, subtraction and bitwise complement, and mirroring around -0.5 is no more than a conditional complement. So Euclidean division may be implemented with positive division as: $$\DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\xor}{\bigoplus} D /_e d = \sign(d)\cdot(\sgn(D)\xor\left\lfloor\cfrac{D\xor\sgn(D)}{\left|d\right|}\right\rfloor)$$ Where the sgn function is -1 for negative numbers and 0 otherwise, and the circled plus is XOR. XORing with the sgn is a conditional complement, with the inner XOR being responsible for the horizontal component of the symmetry and the outer XOR being responsible for the vertical component.

It would have been even nicer if the symmetry of the divisor also worked that way, but unfortunately that doesn't quite work out. For the divisor, the offset introduced by mirroring around -0.5 would affect the size of the steps of the staircase instead of just their position.

The /e and %e operators should be live in haroldbot soon, though like all forms of division the general case is really too hard, even for the circuit-SAT based truth checker (the BDD engine stands no chance at all).

Saturday, 16 December 2017

Notes on negation

The well known formulas

Most readers will be familiar with -x = ~x + 1 = ~(x - 1). These are often just stated without justification, or even an explanation for why they are equivalent. There are some algebraic tricks, but I don't think they explain much, so I'll use the rings from visualizing addition, subtraction and bitwise complement. ~x + 1, in terms of such a ring, means "flip it, then draw a CCW arrow on it with a length of one tick". ~(x - 1) means "draw a CW arrow with a length of one tick, then flip". Picking CCW first is arbitrary, but the point is that the direction is reversed because flipping the ring also flips the arrow if it is drawn first, but not if it is drawn second. Equivalent to drawing an arrow, you may rotate the ring around its center.

So they're equivalent, but why do they negate. The same effect also explains
a - b = ~(~a + b), which when you substitute a = 0 almost directly gives -b = ~(b - 1). Or using the difference between one's complement and proper negation as I pointed out in that visualization post: the axis of flipping is offset by half a tick, so the effect of flipping introduces a difference of 1 which can be removed by rotating by a tick.

Bit-string notation

I first saw this notation in The Art of Computer Programming v4A, but it probably predates it. It provides a more "structural" view of negation: $$-(a10^k) =\; {\sim} (a10^k - 1) =\; {\sim} (a01^k) = ({\sim} a)10^k$$ Here juxtaposition is concatenation, and exponentiation is repetition and is done before concatenation. a is an arbitrary bit string that may be infinitely long. It does not really deal with the negation of zero, since the input is presumed to end in 10k, but the negation of zero is not very interesting anyway.

What this notation shows is that negation can be thought of as complementing everything to the left of the rightmost set bit, a property that is frequently useful when working with the rightmost bit. A mask of the rightmost set bit and everything to the right of it can be found with
x ^ (x - 1) or, on a modern x86 processor, blsmsk. That leads to negation by XOR: $$-x = x\oplus {\sim}\text{blsmsk}(x)$$ which is sort of cheating since ~blsmsk(x) = x ^ ~(x - 1) = x ^ -x, so this said that
-x = x ^ x ^ -x. It may still be useful occasionally, for example when a value of "known odd-ness" is being negated and then XORed with something, the negation can be merged into the XOR.

Negation by MUX

Using that mask from blsmsk, negation can be written as $$-x = \text{mux}(\text{blsmsk}(x), {\sim} x, x)$$ which combines with bit-level commutativity in some fun ways:

  • (~x + 1) + (x - 1) = mux(blsmsk(x), ~x, x) + mux(blsmsk(x), x, ~x) = ~x + x = -1
  • (~x + 1) | (x - 1) = ~x | x = -1
  • (~x + 1) ^ (x - 1) = ~x ^ x = -1
  • (~x + 1) & (x - 1) = ~x & x = 0
All of these have simpler explanations that don't involve bit-level commutativity, by rewriting them back in terms of negation. But I thought it was nice that it was possible this way too, because it makes it seem as though a +1 and a -1 on both sides of an OR, XOR or AND cancel out, which in general they definitely do not.

The formula that I've been using as an example for the proof-finder on,
(a & (a ^ a - 1)) | (~a & ~(a ^ a - 1)) == -a, is actually a negation-by-MUX, written using mux(m, x, y) = y & m | x & ~m.

Friday, 1 December 2017

Bit-level commutativity

By bit-level commutativity I mean that a binary operator has the property that swapping any subset of bits between the left and right operands does not change the result. The subset may be any old thing, so in general I will call an operator o bit-level commutative if it satisfies the following property $$\forall m,a,b: a \circ b = \text{mux}(m, a, b) \circ \text{mux}(m, b, a)$$ For example, by setting m = b we get a ⊗ b = (a & b) ⊗ (a | b), sort of "bitwise sorting" the operands, with zeroes moved to the left operand and ones moved to the right operand (if possible).

Anyway, obviously AND, OR and XOR (and their complemented versions) are all bit-level commutative, indeed any purely bitwise operation (expressible as a vectorized function that takes two booleans as input) that is commutative is necessarily also bit-level commutative, for obvious reasons. Interestingly, addition is also bit-level commutative, which may be less obvious (at least in a recent coding competition, it seemed that people struggled with this). It may help to consider addition on a slightly more digit-by-digit level: $$ a + b = \sum_i 2^i (a_i + b_i)$$ It should be clear from the bit-level "exploded" sum, that the individual bits ai and bi can be either swapped or not, independently for any i. This should get more obvious the more you think about what representing a number in a positional numeral system even means in the first place: it was always a sum, so adding two numbers is like taking the sum of two "big" sums, of course it does not matter which of the big sums any particular contribution comes from.

Alternatively, the old a + b = (a ^ b) + 2(a & b) (ie computing bit-level sums and then adding the carries separately) can explain it: both XOR and AND are bit-level commutative, so the whole expression is, too.

Anyway, a consequence is that a + b = (a & b) + (a | b), which I have more commonly seen derived as:

a + b = (a ^ b) + 2(a & b)  // add carries separately
      = (a | b) - (a & b) + 2(a & b) // see below
      = (a | b) + (a & b)
Where (a ^ b) = (a | b) - (a & b) can be explained as XOR being like OR, except that unlike OR it is 0 when both operands are set, so just subtract that case out. I always like having two (or more!) explanations from completely different directions like that.

Multiplication (including carryless multiplication and OR-multiplication) is of course not bit-level commutative. For example if one operand is zero and the other is odd and not 1, then the lowest bit could be swapped to make neither operand zero, and a non-zero result could be produced that way. Operations such as comparison and (by extension) min and max are obviously not bit-level commutative.

There is probably more to this, I may add some stuff later.

Sunday, 6 August 2017

Visualizing addition, subtraction and bitwise complement

A relatively well-known relation between addition and subtraction, besides the basic relations a - b = a + (-b) and a - b = -(-a + b), is a - b = ~(~a + b). But I suspect most people have simply accepted that as fact, or perhaps proved it from the 2's complement definition of negation. haroldbot can do the latter, though not as succinctly as I hoped.

But it also has a nice one-dimensional geometric interpretation, really analogous to the way a - b = -(-a + b) looks in ℤ.

As negation mirrors ℤ around the origin, complement mirrors the space of two's complement signed bitvectors around the "space" between -1 and 0. Clearly addition in the mirrored space corresponds to subtraction in the unmirrored space, so the obvious way to subtract is mirror, add, and mirror back. That's precisely what -(-a + b) does in ℤ and what ~(~a + b) does for bitvectors. An observant reader may notice that I convenient disregarded the finiteness of the number line of fixed-size bitvectors, that's actually not a problem but the visualization gets a bit trickier.

What is in a way more surprising is that a - b = -(-a + b) works for bitvectors, since negation does not neatly mirror the whole number line when you're talking about two's complement negation. It's around the origin again instead of around a "space", but the most negative number is unaffected.

When we remember that this number line is really a number ring (in the circular sense), that starts to make sense again. To complete this picture, you can think about holding a ring in your hands, flipping it over while holding it at two diametrically opposite points - zero and the most negative number. Of course this visualization also works for complement, just hold the ring at slightly different places: between negative one and zero, and between the minimum and maximum (which are adjacent, you could think of it as where the ring closes). There are images below, but you should probably only look at them if visualization failed to appear in your mind naturally - if you already have one, your own image is probably easier to think about.

But why

OK all this spatial insight is fun (maybe), but what was it actually good for. I've found that thinking about the complement operation this way helps me to relate it to addition-like arithmetic operations (add, subtract, min, max, compare, etc) since they're all simple operations with "arrows" around that ring that we just flipped in our minds.

So it helps to make sense of various "mutants" of De Morgan's Law, such as:

  • ~x < ~y == x > y
  • ~x > ~y == x < y
  • ~min(~x, ~y) == max(x, y)
  • ~max(~x, ~y) == min(x, y)
  • ~avg_up(~x, ~y) == avg_down(x, y) where avg_down is the average rounding down, see also VirtualDub: Weighted averaging in SSE (part 2)
  • ~avg_down(~x, ~y) == avg_up(x, y)
  • ~paddsb(~x, y) == psubsb(x, y) (signed saturation)
  • ~psubsb(~x, y) == paddsb(x, y) (signed saturation)
  • ~paddusb(~x, y) == psubusb(x, y) (unsigned saturation)
  • ~psubusb(~x, y) == paddusb(x, y) (unsigned saturation)

A similar visualization works for the signed/unsigned "conversion" x ^ msb == x + msb == x - msb (msb is a mask with only the most significant bit set), which corresponds to rotating the ring 180 degrees. This may help when thinking about things such as:

  • x <s y == (x ^ msb) <u (y ^ msb)
  • x <u y == (x ^ msb) <s (y ^ msb)
  • max_s(x, y) == max_u(x ^ msb, y ^ msb) ^ msb

The relationship between the different kinds of min/max can be summed up by a nice commutative diagram:

Hope it helps, for me this sort of thing has come in handy occasionally when writing SSE code.

Here are the images two's complement negation:

and for plain one's complement:

This is in the orientation that I usually use when I think about these operations this way, but there is not particular meaning to going counter-clockwise with 0 at/near the bottom.

Thursday, 24 November 2016

Parallel prefix/suffix operations

After a "brief" hiatus..

Parallel prefix/suffix operations is a family of operations that follow either this template:

x = x ⊕ (x >> 1);
x = x ⊕ (x >> 2);
x = x ⊕ (x >> 4);
x = x ⊕ (x >> 8);
x = x ⊕ (x >> 16);
Or this template:
x = x ⊕ (x << 1);
x = x ⊕ (x << 2);
x = x ⊕ (x << 4);
x = x ⊕ (x << 8);
x = x ⊕ (x << 16);
Where ⊕ is typically OR or XOR. I've never seen ⊕=AND show up naturally, but it might be useful for something.

There is some disagreement (stemming from the usual "which end of an integer is the front" question) about which of them should be called the "parallel prefix" and which the "parallel suffix", for the purpose of this post I'll be calling the first one (with the shifts to the right) "parallel prefix". Either way, these operations are bit-level analogues to the more well known prefix sum, prefix max, prefix product, etc. The word "parallel" in the name refers to the bit-level parallelism, which has the same structure as the simple (not work-efficient) parallel prefix sum algorithm.

Some members of the family

The best-known member of parallel prefix/suffix family is the parallel prefix with ⊕ = XOR (PP-XOR), which converts Gray code back to normally-ordered integers and produces the parity of an integer in the lowest bit.

The parallel suffix with ⊕ = XOR (PS-XOR) is a carryless multiplication by -1, which isn't very interesting by itself (probably), but gives a hint about the algebraic structure that these operations give rise to.

The parallel prefix with ⊕ = OR (PP-OR) often shows up whenever bitwise operations are interacting with ranges/intervals, since PP-OR(low ^ high) gives a mask of the bits that are not constrained by the interval. I have used this operation (though I did not name it then) several times in my series on propagating bounds through bitwise operations.
This operation lends itself to some optimizations (I put "performance" in the title, which I admit I've mostly ignored), for example on x64 you could implement it as

   lzcnt rax, rax  ; input in rax
   sbb rdx, rdx
   not rdx
   shrx rax, rdx, rax
   lzcnt rax, rax
   mov edx, 64
   sub edx, eax
   or rax, -1
   bzhi rax, rax, rdx
The each have their pros and cons, and hopefully there's a better way to do it, but I couldn't really find any. I made sure to avoid the infamous false dependency that popcnt, tzcnt and lzcnt all share on any Intel processor that implements them to date. Probably the biggest problem here is that both versions require BMI2, that can be avoided, eg (suggested by Jasper Neumann)
   xor edx, edx    // provide a zero register for cmov
   bsr ecx, eax
   mov eax, -1
   not ecx         // flags not modified
   cmovz eax, edx
   shr eax, cl


To start with the obvious, the prefix and suffix versions are exactly each others mirror image. So I'm going to look just at the suffix part of the family, for the prefix part just mirror everything.

PS-XOR(x) is clmul(x, -1)

Since every step is a clmul by 1 + 2k and if you clmul those constants together you get -1 (the two's complement -1, not the -1 in the ring formed by XOR and clmul over bitvectors of length k, which would just be 1 again), or from a more intuitive angle, what the PS-XOR is supposed to do in the first place is XOR each bit into all higher bits. So it inherits the properties, such as inversibility (-1 is odd). The inverse of PS-XOR is x ^ (x << 1).

It also inherits that PS-XOR(x) ^ PS-XOR(y) == PS-XOR(x ^ y) from the distributivity of clmul.

Since clmul is commutative and associative, the steps in PS-XOR can be reordered.

PS-OR(x) is or_mul(x, -1)

This isn't as nice (XOR and clmul form a ring structure) since OR doesn't form a group because it has no negation (this is just a fancy way of saying that you can't un-OR numbers, in the way you can un-ADD by subtraction or un-XOR with XOR again) and so it doesn't extended to a ring, but it can be extended into a semiring by defining the following multiplication:

static uint or_mul(uint a, uint b)
    uint r = 0;
    for (int i = 0; i < 32; i++)
        if ((a & 1) == 1)
            r |= b; // the only difference is here
        a >>= 1;
        b <<= 1;
    return r;
It's not fancy math notation but you get the idea.

This (with OR) forming a commutative semiring (proof "left as an exercise to the reader"), it has some nice properties:

  • or_mul(a, b) == or_mul(b, a) commutivity
  • or_mul(a, or_mul(b, c)) == or_mul(or_mul(a, b), c) associativity
  • or_mul(a, b | c) == or_mul(a, b) | or_mul(a, c) left distributivity
  • or_mul(a | b, c) == or_mul(a, c) | or_mul(b, c) right distributivity
But, of course, no multiplicative inverses. Except when multiplying by 1, but that doesn't count since it's the multiplicative identity.

PS-OR(x) == or_mul(x, -1), and the individual steps of the PS-OR are of the form x = or_mul(x, 1 + 2k). The or_mul of all those constants together is, unsurprisingly, -1 (though this notation is now slightly odd since this was all a semiring, I mean the element that has all bits set).

And again, PS-OR inherits PS-OR(x) | PS-OR(y) = PS-OR(x | y) from distributivity.

And again, since or_mul is commutative and associative, the steps in PS-OR can be reordered.

PS-OR(x) is also x | -x

This one doesn't have a nice mirrored equivalent, just the obvious one where you insert a bit-reversal before and after. It also doesn't have an obvious relation to or_mul. As for why it's the case, given that (in string notation) -(a10k) = (~a)10k; a10k | -(a10k) = (~a|a)10k = 110k, so it copies the lowest set bit into all higher bits, just as PS-OR does.

Sunday, 13 April 2014

A bitwise relational domain

This is the thing that I was referring to in my previous post where I said it didn't work out as well as I'd hoped. That's still the case. But it's not a complete failure, it's just not as good as I had hoped it might be, and it's still interesting (and, as far as I know, new).

The basic idea is really the same one as in Miné's A Few Graph-Based Relational Numerical Abstract Domains, namely a matrix where the element i,j says something about the relation between the variables i and j, and there's a closure operator that very closely resembles the Floyd–Warshall algorithm (or matrix multiplication, which comes down to the same thing). Of course saying that the difference of two variables must be in some set isn't great for bitwise operations, so my first thought was to change it to a xor, with Miné's bitfield domain (the one I've written several other posts about) as the Basis. For the closure operator, the multiplication is represented by the bitwise-xor from the bitfield domain, addition is bitwise-and from the bitfield domain. The rest is similarly easy.

The problem with that idea is that it tends finds almost no relations. It can represent some non-trivial relations, for example that one variable is (partially) equal to and other xor-ed with a constant, but it's just not enough.

The problem is that xor just doesn't remember enough. So the next idea I had was to abandon xor, and go with an operator that doesn't throw away anything, and use 4-tuples to represent that set instead of 2-tuples. Due to lack of inspiration, I named the elements of the tuple (a, b, c, d), and they work like this:

ax = 0, y = 0
bx = 0, y = 1
cx = 1, y = 0
dx = 1, y = 1
Meaning that if, say, bit 0 in a is set in the element at (x,y) then variables x and y can be even simultaneously. On the diagonal of the matrix, where x=y, b and c must both be zero, because a variable can not be unequal from itself.

The "multiplication" operator for the closure works like this:

aij &= aik & akj | bik & ckj
bij &= aik & bkj | bik & dkj
cij &= dik & ckj | cik & akj
dij &= dik & dkj | cik & bkj
For all k. This essentially "chains" i,k with k,j to get an implied constraint on i,j.

To show how powerful this is, here is what happens after asserting that z = x & y

And now some cool things can happen. For example, if you assert that z is odd, the closure operation will propagate that back to both x and y. Or if you assert that w = x | z, it will deduce that w == x.

There is a symmetry between the upper triangle and the lower triangle. They're not the same, but an element (a, b, c, d) is mirrored by an element (a, c, b, d). One of those triangles can easily be left out, to save some space. Elements that are (-1, -1, -1, -1) could also be left out, but that doesn't typically save many elements - if something non-relational is known about a variable (that is, there is an element on the diagonal that isn't (-1, 0, 0, -1)), then everything in the same row and everything in the same column will have that information in it as well. But an element (i,j) exactly equal to (ai,i & aj,j, ai,i & dj,j, di,i & aj,j, di,i & dj,j) can be left out. That removes all elements that describe a trivial relation of the kind that says that the combination of x and y must be in the Cartesian product of the sets that x and y can be in, which is completely redundant.

But it does have problems. If you xor (or add or subtract) two unconstrained variables, nothing happens. That sort of 3-variable relation isn't handled at all. This led me to explore a 3D variant of this domain (with 8-tuples), which is really great if you only consider its expressive power, but it likes to take a large amount of space, and its closure operator is very inefficient.

The algorithms I presented earlier for addition in the bitfield domain still work in this relational domain, and with just a couple of simple changes they can make use of some relational information. They want to know something about p = x ^ y and g = x & y, relational information can be available for both of them. For example, if it is known that x & y == 0 (indicated by a 4-tuple (a, b, c, 0)), g will be known to be 0, all carries will be known to be 0, and two relations will be made that look exactly the same as if you had asserted z = x | y. Of course that only works if x & y == 0 is known before the addition.

Assertions such as x & y == 0 (more generally: x ⊗ y is in (z, o) where (z, o) is an element from the unrelational bitfield domain) can be done directly (without inventing a temporary to hold x ⊗ y and asserting on it), and actually that's simpler and far more efficient.

Tuesday, 4 February 2014

Addition in the bitfield domain, alternative algorithm

This is obviously a continuation of this post, I could have published this one first but it didn't seem that interesting to me at the time. But it's simpler to understand, it's (as far as I know) "new" in the sense that it hasn't been written down anywhere public, and the thing I really wanted to post about didn't really work out as well as I had hoped.. so here it is.

The idea behind this algorithm is to take the software version of a Kogge-Stone adder, and replace all the variables by z/o pairs and all the operators by their "lifted" equivalents. So just take this:

uint p = x ^ y;
uint g = x & y;

g |= p & (g << 1);
p &= p << 1;

g |= p & (g << 2);
p &= p << 2;

g |= p & (g << 4);
p &= p << 4;

g |= p & (g << 8);
p &= p << 8;

g |= p & (g << 16);

uint result = x ^ y ^ (g << 1);
Do the replacement:
// p = x ^ y
uint pz = xz & yz | xo & yo;
uint po = xz & yo | xo & yz;
// g = x & y
uint gz = xz | yz;
uint go = xo & yo;
// g |= p & (g << 1)
gz = (gz << 1 | 1 | pz) & gz;
go = (go << 1) & po | go;
// p &= p << 1
pz = pz << 1 | 1 | pz;
po = po << 1 & po;
// g |= p & (g << 2)
gz = (~(~gz << 2) | pz) & gz;
go = (go << 2) & po | go;
// p &= p << 2
pz = ~(~pz << 2) | pz;
po = po << 2 & po;
// g |= p & (g << 4)
gz = (~(~gz << 4) | pz) & gz;
go = (go << 4) & po | go;
// p &= p << 4
pz = ~(~pz << 4) | pz;
po = po << 4 & po;
// g |= p & (g << 8)
gz = (~(~gz << 8) | pz) & gz;
go = (go << 8) & po | go;
// p &= p << 8
pz = ~(~pz << 8) | pz;
po = po << 8 & po;
// g |= p & (g << 16)
gz = (~(~gz << 16) | pz) & gz;
go = (go << 16) & po | go;
// g <<= 1
gz = ~(~gz << 1);
go = go << 1;
// z = x ^ y
uint zz = xz & yz | xo & yo;
uint zo = xz & yo | xo & yz;
// result = z ^ g
uint resz = zz & gz | zo & go;
uint reso = zz & go | zo & gz;

It got longer, but not actually more complicated - it's still the same algorithm, and you could even write it exactly the same way if you used operator overloading.

This algorithm doesn't need a special case for "empty" inputs, because the two lifted XORs at the end preserve emptiness (though the lifted left shifts in the calculation of g do not).