tag:blogger.com,1999:blog-14659869424355382082017-07-20T08:55:23.137-07:00Bits, Math and PerformanceBits and math come together to form interesting algorithms, with a small focus on performance.Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comBlogger20125tag:blogger.com,1999:blog-1465986942435538208.post-92215643066118768772016-11-24T10:10:00.001-08:002016-12-13T00:37:17.976-08:00Parallel prefix/suffix operations<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><p>After a "brief" hiatus..</p><p>Parallel prefix/suffix operations is a family of operations that follow either this template: <pre class="brush:csharp">x = x ⊕ (x >> 1);<br />x = x ⊕ (x >> 2);<br />x = x ⊕ (x >> 4);<br />x = x ⊕ (x >> 8);<br />x = x ⊕ (x >> 16);</pre>Or this template: <pre class="brush:csharp">x = x ⊕ (x << 1);<br />x = x ⊕ (x << 2);<br />x = x ⊕ (x << 4);<br />x = x ⊕ (x << 8);<br />x = x ⊕ (x << 16);</pre>Where ⊕ is typically OR or XOR. I've never seen ⊕=AND show up naturally, but it might be useful for something.</p><p>There is some disagreement (stemming from the usual "which end of an integer is the front" question) about which of them should be called the "parallel prefix" and which the "parallel suffix", for the purpose of this post I'll be calling the first one (with the shifts to the right) "parallel prefix". Either way, these operations are bit-level analogues to the more well known prefix sum, prefix max, prefix product, etc. The word "parallel" in the name refers to the bit-level parallelism, which has the same structure as the simple (not work-efficient) <a href="http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html">parallel prefix sum algorithm</a>.</p><h3>Some members of the family</h3><p>The best-known member of parallel prefix/suffix family is the parallel prefix with ⊕ = XOR (PP-XOR), which converts Gray code back to normally-ordered integers and produces the parity of an integer in the lowest bit.</p><p>The parallel <i>suffix</i> with ⊕ = XOR (PS-XOR) is a carryless multiplication by -1, which isn't very interesting by itself (probably), but gives a hint about the algebraic structure that these operations give rise to.</p><p>The parallel prefix with ⊕ = OR (PP-OR) often shows up whenever bitwise operations are interacting with ranges/intervals, since <code>PP-OR(low ^ high)</code> gives a mask of the bits that are not constrained by the interval. I have used this operation (though I did not name it then) several times in my series on <a href="http://bitmath.blogspot.nl/2012/09/calculating-lower-bound-of-bitwise-or.html">propagating bounds through bitwise operations</a>.<br/>This operation lends itself to some optimizations (I put "performance" in the title, which I admit I've mostly ignored), for example on x64 you could implement it as <pre> lzcnt rax, rax ; input in rax<br /> sbb rdx, rdx<br /> not rdx<br /> shrx rax, rdx, rax</pre>Or: <pre> lzcnt rax, rax<br /> mov edx, 64<br /> sub edx, eax<br /> or rax, -1<br /> bzhi rax, rax, rdx</pre>The each have their pros and cons, and hopefully there's a better way to do it, but I couldn't really find any. I made sure to avoid the infamous false dependency that <code>popcnt</code>, <code>tzcnt</code> and <code>lzcnt</code> all share on any Intel processor that implements them to date. Probably the biggest problem here is that both versions require BMI2, that can be avoided, eg (suggested by Jasper Neumann) <pre> xor edx, edx // provide a zero register for cmov<br /> bsr ecx, eax<br /> mov eax, -1<br /> not ecx // flags not modified<br /> cmovz eax, edx<br /> shr eax, cl</pre></p><h3>Properties/structure</h3><p>To start with the obvious, the prefix and suffix versions are exactly each others mirror image. So I'm going to look just at the suffix part of the family, for the prefix part just mirror everything.</p><h4>PS-XOR(x) is clmul(x, -1)</h4><p>Since every step is a clmul by <code>1 + 2<sup>k</sup></code> and if you clmul those constants together you get -1 (the two's complement -1, not the -1 in the ring formed by XOR and clmul over bitvectors of length k, which would just be 1 again), or from a more intuitive angle, what the PS-XOR is supposed to do in the first place is XOR each bit into all higher bits. So it inherits the properties, such as inversibility (-1 is odd). The inverse of PS-XOR is <code>x ^ (x << 1)</code>.</p><p>It also inherits that <code>PS-XOR(x) ^ PS-XOR(y) == PS-XOR(x ^ y)</code> from the distributivity of clmul.</p><p>Since clmul is commutative and associative, the steps in PS-XOR can be reordered.</p><h4>PS-OR(x) is or_mul(x, -1)</h4><p>This isn't as nice (XOR and clmul form a ring structure) since OR doesn't form a group because it has no negation (this is just a fancy way of saying that you can't un-OR numbers, in the way you can un-ADD by subtraction or un-XOR with XOR again) and so it doesn't extended to a ring, but it can be extended into a semiring by defining the following multiplication: <pre class="brush:csharp">static uint or_mul(uint a, uint b)<br />{<br /> uint r = 0;<br /> for (int i = 0; i < 32; i++)<br /> {<br /> if ((a & 1) == 1)<br /> r |= b; // the only difference is here<br /> a >>= 1;<br /> b <<= 1;<br /> }<br /> return r;<br />}</pre>It's not fancy math notation but you get the idea.</p><p>This (with OR) forming a commutative semiring (proof "left as an exercise to the reader"), it has some nice properties: <ul><li><code>or_mul(a, b) == or_mul(b, a)</code> commutivity</li><li><code>or_mul(a, or_mul(b, c)) == or_mul(or_mul(a, b), c)</code> associativity</li><li><code>or_mul(a, b | c) == or_mul(a, b) | or_mul(a, c)</code> left distributivity</li><li><code>or_mul(a | b, c) == or_mul(a, c) | or_mul(b, c)</code> right distributivity</li></ul>But, of course, no multiplicative inverses. Except when multiplying by 1, but that doesn't count since it's the multiplicative identity.</p><p><code>PS-OR(x) == or_mul(x, -1)</code>, and the individual steps of the PS-OR are of the form <code>x = or_mul(x, 1 + 2<sup>k</sup>)</code>. The or_mul of all those constants together is, unsurprisingly, -1 (though this notation is now slightly odd since this was all a semiring, I mean the element that has all bits set).</p><p>And again, PS-OR inherits <code>PS-OR(x) | PS-OR(y) = PS-OR(x | y)</code> from distributivity.</p><p>And again, since or_mul is commutative and associative, the steps in PS-OR can be reordered.</p><h4>PS-OR(x) is also <code>x | -x</code></h4><p>This one doesn't have a nice mirrored equivalent, just the obvious one where you insert a bit-reversal before and after. It also doesn't have an obvious relation to or_mul. As for why it's the case, given that (in string notation) -(a10<sup>k</sup>) = (~a)10<sup>k</sup>; a10<sup>k</sup> | -(a10<sup>k</sup>) = (~a|a)10<sup>k</sup> = 1<sup>∞</sup>10<sup>k</sup>, so it copies the lowest set bit into all higher bits, just as PS-OR does.</p></span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-32212003611256852252014-04-13T10:27:00.001-07:002014-04-13T10:42:24.707-07:00A bitwise relational domain<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><p>This is the thing that I was referring to in my <a href="http://bitmath.blogspot.com/2014/02/addition-in-bitfield-domain-alternative.html">previous post</a> where I said it didn't work out as well as I'd hoped. That's still the case. But it's not a complete failure, it's just not as good as I had hoped it might be, and it's still interesting (and, as far as I know, new).</p> <p>The basic idea is really the same one as in Miné's <a href="http://www.di.ens.fr/~mine/publi/article-mine-sas02.pdf">A Few Graph-Based Relational Numerical Abstract Domains</a>, namely a matrix where the element <code>i,j</code> says something about the relation between the variables <code>i</code> and <code>j</code>, and there's a closure operator that very closely resembles the Floyd–Warshall algorithm (or matrix multiplication, which comes down to the same thing). Of course saying that the difference of two variables must be in some set isn't great for bitwise operations, so my first thought was to change it to a xor, with Miné's <a href="http://www.di.ens.fr/~mine/publi/article-mine-wing12.pdf">bitfield domain</a> (the one I've written several other posts about) as the Basis. For the closure operator, the multiplication is represented by the bitwise-xor from the bitfield domain, addition is bitwise-and from the bitfield domain. The rest is similarly easy.</p> <p>The problem with that idea is that it tends finds almost no relations. It can represent some non-trivial relations, for example that one variable is (partially) equal to and other xor-ed with a constant, but it's just not enough.</p> <p>The problem is that xor just doesn't remember enough. So the next idea I had was to abandon xor, and go with an operator that doesn't throw away anything, and use 4-tuples to represent that set instead of 2-tuples. Due to lack of inspiration, I named the elements of the tuple (a, b, c, d), and they work like this: <table><tr><td width="25%">a</td><td>x = 0, y = 0</td></tr><tr><td>b</td><td>x = 0, y = 1</td></tr><tr><td>c</td><td>x = 1, y = 0</td></tr><tr><td>d</td><td>x = 1, y = 1</td></tr></table>Meaning that if, say, bit 0 in a is set in the element at (x,y) then variables x and y can be even simultaneously. On the diagonal of the matrix, where x=y, b and c must both be zero, because a variable can not be unequal from itself.</p> <p>The "multiplication" operator for the closure works like this: <pre>aij &= aik & akj | bik & ckj<br />bij &= aik & bkj | bik & dkj<br />cij &= dik & ckj | cik & akj<br />dij &= dik & dkj | cik & bkj</pre>For all <code>k</code>. This essentially "chains" <code>i,k</code> with <code>k,j</code> to get an implied constraint on <code>i,j</code>.</p> <p>To show how powerful this is, here is what happens after asserting that <code>z = x & y</code><style>#t_00 { border: 1px solid black; border-collapse: collapse; } #t_00 tr { border: 1px solid black; } #t_00 td { border: 1px solid black; } </style><table id="t_00"><tr><td></td><td><center>x</center></td><td><center>y</center></td><td><center>z</center></td></tr><tr><td>x</td><td>-1,0,0,-1</td><td>-1,-1,-1,-1</td><td>-1,-1,0,-1</td></tr><tr><td>y</td><td>-1,-1,-1,-1</td><td>-1,0,0,-1</td><td>-1,-1,0,-1</td></tr><tr><td>z</td><td>-1,0,-1,-1</td><td>-1,0,-1,-1</td><td>-1,0,0,-1</td></tr></table>And now some cool things can happen. For example, if you assert that <code>z</code> is odd, the closure operation will propagate that back to both <code>x</code> and <code>y</code>. Or if you assert that <code>w = x | z</code>, it will deduce that <code>w == x</code>.</p> <p>There is a symmetry between the upper triangle and the lower triangle. They're not the same, but an element (a, b, c, d) is mirrored by an element (a, c, b, d). One of those triangles can easily be left out, to save some space. Elements that are (-1, -1, -1, -1) could also be left out, but that doesn't typically save many elements - if something non-relational is known about a variable (that is, there is an element on the diagonal that isn't (-1, 0, 0, -1)), then everything in the same row and everything in the same column will have that information in it as well. But an element <code>(i,j)</code> exactly equal to <code>(a<sub>i,i</sub> & a<sub>j,j</sub>, a<sub>i,i</sub> & d<sub>j,j</sub>, d<sub>i,i</sub> & a<sub>j,j</sub>, d<sub>i,i</sub> & d<sub>j,j</sub>)</code> can be left out. That removes all elements that describe a trivial relation of the kind that says that the combination of x and y must be in the Cartesian product of the sets that x and y can be in, which is completely redundant.</p> <p>But it does have problems. If you xor (or add or subtract) two unconstrained variables, nothing happens. That sort of 3-variable relation isn't handled at all. This led me to explore a 3D variant of this domain (with 8-tuples), which is really great if you only consider its expressive power, but it likes to take a large amount of space, and its closure operator is very inefficient.</p> <p>The algorithms I <a href="http://bitmath.blogspot.com/2013/08/addition-in-bitfield-domain.html">presented</a> <a href="http://bitmath.blogspot.com/2014/02/addition-in-bitfield-domain-alternative.html">earlier</a> for addition in the bitfield domain still work in this relational domain, and with just a couple of simple changes they can make use of some relational information. They want to know something about <code>p = x ^ y</code> and <code>g = x & y</code>, relational information can be available for both of them. For example, if it is known that <code>x & y == 0</code> (indicated by a 4-tuple (a, b, c, 0)), <code>g</code> will be known to be 0, all carries will be known to be 0, and two relations will be made that look exactly the same as if you had asserted <code>z = x | y</code>. Of course that only works if <code>x & y == 0</code> is known <i>before</i> the addition.</p> <p>Assertions such as <code>x & y == 0</code> (more generally: x ⊗ y is in (z, o) where (z, o) is an element from the unrelational bitfield domain) can be done directly (without inventing a temporary to hold x ⊗ y and asserting on it), and actually that's simpler and far more efficient.</p> </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-37904348099526440692014-02-04T01:55:00.000-08:002014-02-04T01:56:18.134-08:00Addition in the bitfield domain, alternative algorithm<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><p>This is obviously a continuation of <a href="http://bitmath.blogspot.com/2013/08/addition-in-bitfield-domain.html">this post</a>, I could have published this one first but it didn't seem that interesting to me at the time. But it's simpler to understand, it's (as far as I know) "new" in the sense that it hasn't been written down anywhere public, and the thing I really wanted to post about didn't really work out as well as I had hoped.. so here it is.</p><p>The idea behind this algorithm is to take the software version of a <a href="http://en.wikipedia.org/wiki/Kogge-Stone_Adder">Kogge-Stone adder</a>, and replace all the variables by z/o pairs and all the operators by their "lifted" equivalents. So just take this: <pre class="brush:csharp">uint p = x ^ y;<br />uint g = x & y;<br /><br />g |= p & (g << 1);<br />p &= p << 1;<br /><br />g |= p & (g << 2);<br />p &= p << 2;<br /><br />g |= p & (g << 4);<br />p &= p << 4;<br /><br />g |= p & (g << 8);<br />p &= p << 8;<br /><br />g |= p & (g << 16);<br /><br />uint result = x ^ y ^ (g << 1);</pre>Do the replacement: <pre class="brush:csharp">// p = x ^ y<br />uint pz = xz & yz | xo & yo;<br />uint po = xz & yo | xo & yz;<br />// g = x & y<br />uint gz = xz | yz;<br />uint go = xo & yo;<br />// g |= p & (g << 1)<br />gz = (gz << 1 | 1 | pz) & gz;<br />go = (go << 1) & po | go;<br />// p &= p << 1<br />pz = pz << 1 | 1 | pz;<br />po = po << 1 & po;<br />// g |= p & (g << 2)<br />gz = (~(~gz << 2) | pz) & gz;<br />go = (go << 2) & po | go;<br />// p &= p << 2<br />pz = ~(~pz << 2) | pz;<br />po = po << 2 & po;<br />// g |= p & (g << 4)<br />gz = (~(~gz << 4) | pz) & gz;<br />go = (go << 4) & po | go;<br />// p &= p << 4<br />pz = ~(~pz << 4) | pz;<br />po = po << 4 & po;<br />// g |= p & (g << 8)<br />gz = (~(~gz << 8) | pz) & gz;<br />go = (go << 8) & po | go;<br />// p &= p << 8<br />pz = ~(~pz << 8) | pz;<br />po = po << 8 & po;<br />// g |= p & (g << 16)<br />gz = (~(~gz << 16) | pz) & gz;<br />go = (go << 16) & po | go;<br />// g <<= 1<br />gz = ~(~gz << 1);<br />go = go << 1;<br />// z = x ^ y<br />uint zz = xz & yz | xo & yo;<br />uint zo = xz & yo | xo & yz;<br />// result = z ^ g<br />uint resz = zz & gz | zo & go;<br />uint reso = zz & go | zo & gz;</pre></p><p>It got longer, but not actually more complicated - it's still the same algorithm, and you could even write it exactly the same way if you used operator overloading.</p><p>This algorithm doesn't need a special case for "empty" inputs, because the two lifted XORs at the end preserve emptiness (though the lifted left shifts in the calculation of g do not).</p></span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-31849968245946718292013-09-28T07:10:00.000-07:002013-09-28T07:14:56.953-07:00Determining the cardinality of a set described by a mask and bounds<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/javascript">MathJax.Hub.Config({ "HTML-CSS": { scale: 175 } }); </script><span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><p>In other words, calculating this \begin{equation} | \left\{ x | (x \& m) = v \land x \ge a \land x \le b \right\} | \end{equation}</p><p>A <a href="http://en.wikipedia.org/wiki/Binary_decision_diagram">BDD</a> (which haroldbot uses to solve this) has no trouble with that at all (form the query like <code>unsigned solve x & m = v && x >= a && x <= b</code>), but the structure the problem is so nice that I thought it should be possible to do better. And it is, though the solution I'm about to present is probably far from optimal. I don't know how to significantly improve on it, but I just get a very non-scientific gut-feeling that there should be a fundamentally better way to do it.</p><p>The cardinality of the set without the bounds (only the mask) is obviously trivial to compute (as <code>1 << popcnt(~m)</code>), and the bounds divide that set in three parts: items that are lower than the lower bound ("left part"), items that are higher than the upper bound ("right part")), and items that are actual set of interest ("middle part"). The main idea it is built on is that it's relatively easy to solve the problem if it didn't have a lower bound and the upper bound was a power of two, with a formula roughly similar to the one above. Using that, all the powers of two that fit before lower bound can be counted from high to low, giving the cardinality of the "left part". The same can be done for the "right part", actually in exactly the same way, by complementing the mask and the upper bound. Obviously the cardinality of the "middle part" can be computed from this.</p><p>And here's the code. It doesn't like it when the cardinality is 2<sup>32</sup>, and watch out for weird corner cases such as when the lower bound is bigger than the upper bound (why would you even do that?). It usually works, that's about as much as I can say - I didn't prove it correct. <pre class="brush: csharp">static uint Cardinality(uint m, uint v, uint a, uint b)<br />{<br /> // count x such that a <= x <= b && (x & m) == v<br /> // split in three parts:<br /> // left = 0 <= x < a<br /> // right = b < x <= -1<br /> // the piece in the middle is (1 << popcnt(~m)) - left - right<br /> uint left = 0;<br /> uint x = 0;<br /> for (int i = 32 - 1; i >= 0; i--)<br /> {<br /> uint mask = 1u << i;<br /> if (x + mask <= a)<br /> {<br /> x += mask;<br /> uint mask2 = 0 - mask;<br /> if ((x - 1 & m & mask2) == (v & mask2))<br /> { <br /> uint amount = 1u << popcnt(~(m | mask2));<br /> left += amount;<br /> }<br /> }<br /> }<br /> uint right = 0;<br /> uint y = 0;<br /> for (int i = 32 - 1; i >= 0; i--)<br /> {<br /> uint mask = 1u << i;<br /> if (y + mask <= ~b)<br /> {<br /> y += mask;<br /> uint mask2 = 0 - mask;<br /> if ((y - 1 & m & mask2) == (~v & m & mask2))<br /> {<br /> uint amount = 1u << popcnt(~(m | mask2));<br /> right += amount;<br /> }<br /> }<br /> }<br /> uint res = (uint)((1UL << popcnt(~m)) - (left + right));<br /> return res;<br />}</pre>The loops can be merged of course, but for clarity they're separate here.</p><p>If you have any improvements, please let me know.</p></span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-21592148575471653172013-08-22T12:13:00.004-07:002013-08-22T12:13:54.288-07:00Addition in the bitfield domain<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><p>Bitfield domains are abstract domains that track the possible values of every bit independently. For example, they can express things like "all values where the third bit is set".</p><p>There are two obvious "simplest useful" bitfield domains: <ol><li>The first one is probably the most obvious one: a bitvector <code>m</code> that has a 1 at every position that has a known value, and a bitvector <code>v</code> with the values of the known bits (ones at unknown positions are ignored). This domain turns out to be largely inconvenient, for several reasons.</li><li>The second one is perhaps slightly less intuitive: the one defined in <a href="http://www.di.ens.fr/~mine/publi/article-mine-wing12.pdf">Miné-WING12, section 2.8</a>, namely a bitvector <code>z</code> that has a 1 at every position that <i>can</i> be zero, and a bitvector <code>o</code> with a 1 for every position that can be one. This domain turns out to be pretty good, but the "<code>o</code>" naming is sometimes confusing.</li></ol></p><p>So, what's so bad about the first domain?</p><p>Firstly, this: <table cellspacing="10"><tr><td/><td>Empty set</td><td>Universal set</td></tr><tr><td>first</td><td><span style="color:red">not representable</span></td><td><code>m = 0 (r)</code></td></tr><tr><td>second</td><td><code>(z | o) != -1 (r)</code></td><td><code>(z & o) == -1</code></td></tr></table>The entries marked (r) have redundant representations, which is usually not a problem, but it can be a little annoying at times. This table means that in general, they're not just two different representations of the same thing. Still, every value of the first domain can be converted to the second domain by <code>z = ~m | ~v; o = ~m | (m & v)</code>, and most values of the second domain can be converted to the first domain by <code>m = z ^ o; v = o</code> or <code>m = z ^ o; v = ~z</code>, though that converts empty sets into non-empty sets.</p><p>But that's not all. The other problem with the first domain is the considerably more annoying implementation of some abstract operations. For example <dl><dt>bitwise AND (first domain)</dt><dd><code>m = (a.m & b.m) | (a.m & ~a.v) | (b.m & ~b.v);<br/>v = a.v & b.v;</code></dd><dt>bitwise AND (second domain)</dt><dd><code>z = a.z | b.z;<br/>o = a.o & b.o;</code></dd><dt>bitwise OR (first domain)</dt><dd><code>m = (a.m & b.m) | (a.m & a.v) | (b.m & b.v);<br/>v = a.v | b.v;</code></dd><dt>bitwise OR (second domain)</dt><dd><code>z = a.z & b.z;<br/>o = a.o | b.o;</code></dd><dt>Union (first domain)</dt><dd><code>m = a.m & b.m & (~a.v ^ b.v);<br/>v = a.v;</code></dd><dt>Union (second domain)</dt><dd><code>z = a.z | b.z;<br/>o = a.o | b.o;</code></dd><dt>Intersect (first domain)</dt><dd><code>if ((a.v ^ b.v) & a.m & b.m) Empty<br/>m = a.m | b.m;<br/>v = a.v | b.v;</code></dd><dt>Intersect (second domain)</dt><dd><code>z = a.z & b.z;<br/>o = a.o & b.o;</code></dd></dl>But then, sometimes it doesn't matter at all, for example <dl><dt>left shift by a constant (first domain)</dt><dd><code>m = ~(~m << n);<br/>v = v << n;</code></dd><dt>left shift by a constant (second domain)</dt><dd><code>z = ~(~z << n);<br/>o = o << n;</code></dd></dl>And rarely, the first domain is clearly better <dl><dt>bitwise XOR (first domain)</dt><dd><code>m = a.m & b.m;<br/>v = a.v ^ b.v;</code></dd><dt>bitwise XOR (second domain)</dt><dd><code>z = (a.z & b.z) | (a.o & b.o);<br/>o = (a.o & b.z) | (a.z & b.o);</code></dd></dl>And then there's bitwise complement, which is special. In the first domain, it only costs a single "not". In the second domain, you just swap the <code>z</code> with the </code>o</code>. So which domain is better for bitwise complement? It depends. If you really have to <i>do</i> the swap, that's probably worse (hey Intel and AMD, please implement <code>xchg</code> with register renaming like you do with <code>fxch</code>). But you may be able to do a "virtual swap", where you just swap the meanings of the fields instead of their contents. That's free, but it's usually not applicable. However, you can use it to define subtraction in terms of addition at no extra cost, in the worst case by swapping a couple of things manually.</p><p>Incidentally, one interesting thing about the first domain is that the operations for <code>v</code> is the same as the concrete operation that you're abstracting. That's because whatever garbage happens to be in the unknown bits, by definition they can not affect bits that are known in the result, so the whole "bit is unknown"-business can be ignored for <code>v</code> and instead all the difficulty is moved to the computation of <code>m</code>.</p><p>That's all very nice, but the interesting part is:</p><h1>Addition, as promised.</h1><p>All abstract operations up till now were very simple. Addition, however, is a little tricky. First I'll describe the addition algorithm for the first domain .. and then I won't describe one for the second domain (it can easily be defined in terms of the addition algorithm for the first domain though).</p><p>It's based on the concepts of a <a href="http://en.wikipedia.org/wiki/Carry-lookahead_adder">carry look-ahead adder</a>, with "propagates" and "generates" information, but of course it's not actually a CLA because the input isn't fully known and the result is a mask of where the result of the addition would be known.</p><p>The basic idea, that the entire algorithm is based around, is that if a carry is known somewhere, a whole stretch of positions that would propagate can be selected from the point of the known carry upwards. That's seems very easy: <code>(p + g & ~p) - g</code> but that's not quite right and most of the algorithm have to do with fixing that up.</p><p>For example, if there is more than one generating bit in the same run of propagators, then things go south. But that can be fixed, just select only the rightmost ends of all runs of generators:<br/><code>g & ~(g << 1)</code></p><p>That actually doesn't even solve the problem though. The runs of generators are broken now, but a single run of propagators could still span all of them. So instead of checking whether <code>p + g</code> has bits that weren't in <code>p</code>, check whether it has bits there are not in <code>p</code> or that are <i>both</i> in <code>p</code> <i>and</i> in <code>g</code> (the new <code>g</code>).</p><p>The fixups just keep piling up really, so I'm almost certain that at least a couple of operations can be shaved off with some further cleverness. Anyway, here's what I have now: <pre class="brush:csharp">av &= am;<br />bv &= bm;<br />uint an = ~av & am; // not set in a<br />uint bn = ~bv & bm; // not set in b<br />uint g0 = an & bn; // generates a 0-carry<br />uint g1 = av & bv; // generates a 1-carry<br />uint p = an ^ bn; // propagates a carry<br />uint pg1 = av | bv; // propagates or generates a 1-carry<br />uint g0l = ~g0 & ((g0 << 1) | 1); // the bit directly to the left of every run of 1s<br />uint g1f = g1 & ~(g1 << 1); // the lowest bit of every run of 1s<br />uint nc = (p + g0l & ~p) - g0l | g0; // bits that have a carry-out of 0<br />uint m1 = ~pg1 | (pg1 & g1f);<br />uint c = (pg1 + g1f & m1) - g1f; // bits that have a carry-out of 1<br />uint cm = (c | nc) << 1 | 1; // bits that have a known carry-in<br />uint m = cm & am & bm; // result is known if carry-in and both inputs are known<br />uint v = av + bv;</pre></p><p>Subtraction (for both discussed bitfield domains) can be implemented using the identity<br/><code>a - b = ~(~a + b)</code>, either by actually doing the complements (which are trivial don't lose any information) or by rewriting a couple of things in the addition algorithm.</p></span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-72980651341267922382013-08-10T07:50:00.000-07:002015-07-10T11:25:30.758-07:00Announcing haroldbot<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">haroldbot <i>was</i> an ircbot (hence the name) that solves some bitmath problems. The title is actually a lie - haroldbot has been around for a while now. But now it finally got its own website.<br/><span style="font-size: 150%"><a href="http://haroldbot.nl">haroldbot.nl</a></span><br/>Check it out, if you work with bits you will probably find this useful. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-51070387291933529082013-05-30T09:54:00.000-07:002017-02-26T10:20:18.263-08:00Carryless multiplicative inverse<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">Note: this post is neither about the normal multiplicative inverse, nor the modular multiplicative inverse. <a href="http://bitmath.blogspot.com/2012/09/divisibility-and-modular-multiplication.html">This other post</a> has information about the modular multiplicative inverse, which might be what you were looking for.<br/><br/>Mathematicians may call carryless multiplication "multiplication in GF(2^n)", but that doesn't explain how it works - recall the shift-and-add algorithm for multiplication: <pre class="brush: csharp">static uint mul(uint a, uint b)<br />{<br /> uint r = 0;<br /> while (b != 0)<br /> {<br /> if ((a & 1) != 0)<br /> r += b;<br /> a >>= 1;<br /> b <<= 1;<br /> }<br /> return r;<br />}</pre>Carryless multiplication is a very simple variation on that: do the addition without carries. That's just a XOR. <pre class="brush: csharp">static uint cl_mul(uint a, uint b)<br />{<br /> uint r = 0;<br /> while (b != 0)<br /> {<br /> if ((a & 1) != 0)<br /> r ^= b; // carryless addition is xor<br /> a >>= 1;<br /> b <<= 1;<br /> }<br /> return r;<br />}</pre>It has some applications in complicated cryptography related algorithms, but it also seems like this should be an interesting and powerful operation when working with bits, and it may well be, but I know of almost no uses for it (<del>besides, Intel's implementation is so slow that it often wouldn't help</del> Intel made it over twice as fast in Haswell). But anyway, let's just start with its basic properties: like normal multiplication, it's commutative and associative. It's also distributive, but over xor instead of over addition. None of this is very surprising.<br/><br/>As an aside, using associativity, it can be shown that the parallel suffix with XOR (which does have some known uses in bitmath, for example in implementing <a href="http://programming.sirrida.de/bit_perm.html#c_e">compress_right</a> in software), code shown below, is equivalent to a carryless multiplication by -1. <pre>// parallel suffix with XOR<br />x ^= x << 1;<br />x ^= x << 2;<br />x ^= x << 4;<br />x ^= x << 8;<br />x ^= x << 16;</pre>Every step is clearly a carryless multiplication, by 3, 5, 17, 257, and 65537 respectively. So it's equivalent to:<br/><code>clmul(clmul(clmul(clmul(clmul(x, 3), 5), 17), 257), 65537)</code> which can be rearranged (using associativity) to:<br/><code>clmul(x, clmul(clmul(clmul(clmul(3, 5), 17), 257), 65537))</code> which works out to <code>clmul(x, -1)</code>. Of course it was supposed to work out that way, because every bit of the result should be the XOR of all bits up to (and including) that bit, but it's nice that it also follows easily from a basic property. Incidentally if you have a full-width carryless multiplication, multiplying by -1 also computes the parallel <i>prefix</i> with XOR in the upper bits (the upper bit of the low word, which is the parity of the input, is shared by the suffix and the prefix.)<br/><br/>Carryless multiplication also shares an other property with normal multiplication: there are multiplicative inverses modulo 2<sup>n</sup> (and also modulo other numbers, but 2<sup>n</sup> is of particular interest since we're working in that by default anyway). Again there are only inverses for odd numbers, and it's equally obvious (as for normal multiplication) why that should be so - an even multiplier will throw out at least one high order bit. First, here's an example of carrlessly multiplying <code>x</code> by -1 and then carrylessly multiplying that by 3.<br/><pre>x = {d}{c}{b}{a} // the letters are bits<br />y = cl_mul(x, -1) = {d^c^b^a}{c^b^a}{b^a}{a}<br />z = cl_mulinv(-1) = 0011<br />cl_mul(y, z) = {d^c^b^a ^ c^b^a}{c^b^a ^ b^a}{b^a ^ a}{a} = {d}{c}{b}{a}</pre>Ok, so that worked out well, and it also gives part the answer to exercise 3 in chapter 5 of Hacker's Delight (about whether parallel prefix/suffix with XOR is invertible and how) because a carryless multiplication by -1 is the same as the parallel suffix with XOR. A carryless multiplication of <code>y</code> by 3 is of course just <code>y ^ (y << 1)</code>.<br/><br/>But back to actually computing the inverse. The inverse had better be odd, so bit 0 is already known, and for all the other bits follow these steps <ol><li>if the remainder is 1, stop</li><li>if bit <code>k</code> is 0, go to step 5</li><li>set bit <code>k</code> in the inverse</li><li>xor the remainder with <code>input << k</code></li><li>increment k and go to step 1</li></ol>Step 4 always resets the offending bit because the input had to be odd, so it's obvious that the remainder always ends up being 1 eventually, and so the algorithm always terminates. Moreover, even in the worst case it only has to process every bit but one, and continuing after the remainder becomes 1 simply does nothing, so step 1 could read "if <code>k</code> is 32" (or some other number, depending on how many bits your ints are wide), which is easier to unroll and better suited for a hardware implementation (not that I've seen this operation implemented in hardware anywhere).<br/>For example, in C# it could look like this: <pre class="brush: csharp">static uint clmulinv(uint x)<br />{<br /> uint inv = 1;<br /> uint rem = x;<br /> for (int i = 1; i < 32; i++)<br /> {<br /> if (((rem >> i) & 1) != 0)<br /> {<br /> rem ^= x << i;<br /> inv |= 1u << i;<br /> }<br /> }<br /> return inv;<br />}</pre><br/>A variation of the algorithm to find a multiplicative inverse modulo a power of two (see <tt>inv</tt> <a href="http://bitmath.blogspot.com/2012/09/divisibility-and-modular-multiplication.html">here</a>) also works, which is useful when clmul is fast: <pre class="brush: csharp">static uint clmulinv(uint d)<br />{<br /> uint x = 1;<br /> for (int i = 0; i < 5; i++)<br /> {<br /> x = clmul(x, clmul(x, d));<br /> }<br /> return x;<br />}</pre><br/><br/>Some sample inverses <pre>1 0x00000001<br />3 0xFFFFFFFF<br />5 0x55555555<br />7 0xDB6DB6DB<br />9 0x49249249<br />11 0x72E5CB97<br />13 0xD3A74E9D<br />15 0x33333333</pre></span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com2tag:blogger.com,1999:blog-1465986942435538208.post-25656963896824613342013-04-11T07:45:00.000-07:002013-04-11T09:18:49.563-07:00Improving bounds when some bits have a known value<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/javascript">MathJax.Hub.Config({ "HTML-CSS": { scale: 175 } }); </script><span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">This problem is closely related the series of problems discussed in <a href="http://bitmath.blogspot.com/2012/09/calculating-lower-bound-of-bitwise-or.html">calculating the lower bound of the bitwise OR of two bounded variables</a> (and some of the posts after that one), and the algorithm is very closely related, too. The question in this post is, suppose some bits may be known to be zero and some may be known to be one, is there a better lower/upper bound than the given one, and if so, what is it? That is, calculate \begin{equation} \min _{x \in [a, b] \wedge (x | \sim z) = x \wedge (x \& \sim o) = 0 } x \end{equation} and \begin{equation} \max _{x \in [a, b] \wedge (x | \sim z) = x \wedge (x \& \sim o) = 0 } x \end{equation} where <code>z</code> is a bitmask containing the bits that are allowed to be 0, and <code>o</code> is a bitmask containing the bits that are allowed to be 1.<br/><br/>The idea behind the algorithms is to do a binary search (the one-sided variant) over the numbers that the masks allow, for the lowest value bigger than or equal to the original lower bound (or smaller than or equal to the original upper bound, for the new upper bound). Just as in the case of propagating bounds through XOR, it may take more than one step, so there aren't many shortcuts. I called them both "reduce" even though <code>ReduceMin</code> actually increases the value, because their purpose is to reduce the range <code>[min, max]</code>. <pre class="brush: csharp">static uint ReduceMin(uint min, uint z, uint o)<br />{<br /> uint mask = z & o; // note: mask is a subset of r<br /> uint r = o;<br /> uint m = 0x80000000 >> nlz(mask);<br /> while (m != 0)<br /> {<br /> // reset the bit if it can be freely chosen<br /> uint newr = r ^ (m & mask);<br /> if (newr >= min)<br /> // keep the change if still within bounds<br /> r = newr;<br /> m >>= 1;<br /> }<br /> return r;<br />}</pre><pre class="brush: csharp">static uint ReduceMax(uint max, uint z, uint o)<br />{<br /> uint mask = z & o;<br /> uint r = ~z;<br /> uint m = 0x80000000 >> nlz(mask);<br /> while (m != 0)<br /> {<br /> // set the bit if it can be freely chosen<br /> uint newr = r | (m & mask);<br /> if (newr <= max)<br /> // keep the change if still within bounds<br /> r = newr;<br /> m >>= 1;<br /> }<br /> return r;<br />}</pre><br/>There is one shortcut (that I know of): using <code>nlz</code> on every iteration, thereby skipping iterations where the current bit isn't even changed. With the implementation of <code>nlz</code> I was working with, that wasn't worth it, so whether it's actually a real shortcut or not is up for debate.<br/><br/>Occasionally the new lower bound can be higher than the new upper bound, that means the set of values was actually empty. If you were working with clockwise intervals, that changes to "if the new bounds aren't ordered the same way as the old ones" - ie if the interval was proper and the new one isn't or vice versa, the set is empty. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-85314416871476424592012-11-29T03:47:00.001-08:002013-04-11T07:46:44.024-07:00Tesseral arithmetic - useful snippets<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><small>This post doesn't introduce anything new, and is, in my opinion, boring. Feel free to skip.</small><br/><br/>My <a href="http://bitmath.blogspot.com/2012/11/tesseral-arithmetic.html">previous post</a> didn't have too many useful snippets in it (mainly useful techniques to make your own snippets), and I thought I could improve on that. This post is not a good read in isolation - it's probably a good idea to read my previous post first, if you haven't already.<br/><br/>Tesseral addition (see previous post) was nice, but very often you only need to increment/decrement one dimension of a coordinate (for example when iterating over a portion of a Z-ordered grid in reading order), equivalent to adding/subtracting <code>(1, 0)</code> or <code>(0, 1)</code> to/from a coordinate. Since only one part of the coordinate changes, only about half as much code is necessary. Also, since the thing being added to the coordinate is a constant, one of the masking operations can be merged with it. <pre class="brush: csharp">static uint IncX(uint z)<br />{<br /> uint xsum = (z | 0xAAAAAAAA) + 1;<br /> return (xsum & 0x55555555) | (z & 0xAAAAAAAA);<br />}<br /><br />static uint IncY(uint z)<br />{<br /> uint ysum = (z | 0x55555555) + 2;<br /> return (ysum & 0xAAAAAAAA) | (z & 0x55555555);<br />}<br /><br />static uint DecX(uint z)<br />{<br /> uint xsum = (z & 0x55555555) - 1;<br /> return (xsum & 0x55555555) | (z & 0xAAAAAAAA);<br />}<br /><br />static uint DecY(uint z)<br />{<br /> uint ysum = (z & 0xAAAAAAAA) - 2;<br /> return (ysum & 0xAAAAAAAA) | (z & 0x55555555);<br />}</pre><br/>My previous post only had <code>TesseralMin</code>, not the corresponding <code>TesseralMax</code>, so here you go: <pre class="brush: csharp">public static uint TesseralMax(uint z, uint w)<br />{<br /> uint xdiff = (z & 0x55555555) - (w & 0x55555555);<br /> uint ydiff = (z >> 1 & 0x55555555) - (w >> 1 & 0x55555555);<br /> uint maskx = (uint)((int)xdiff >> 31);<br /> uint masky = (uint)((int)ydiff >> 31);<br /> uint xmin = (~maskx & z) | (maskx & w);<br /> uint ymin = (~masky & z) | (masky & w);<br /> return new T((xmin & 0x55555555) | (ymin & 0xAAAAAAAA));<br />}</pre>Note that the only difference is that the mask and the complemented mask have switched places.<br/><br/>This <code>TesseralMax</code> and the <code>TesseralMin</code> from the previous post can be combined with the increments and decrements (and with full tesseral addition, but that's less frequently useful) to form saturating increments and decrements, useful for sampling around a position on a Z-ordered grid without getting out of bounds. <pre class="brush: csharp">static uint IncXSat(uint z, uint xmax)<br />{<br /> uint xsum = ((z | 0xAAAAAAAA) + 1) & 0x55555555;<br /> uint xdiff = xsum - xmax;<br /> uint maskx = (uint)((int)xdiff << 1 >> 31);<br /> uint xsat = (maskx & xsum) | (~maskx & xmax);<br /> return xsat | (z & 0xAAAAAAAA);<br />}<br /><br />static uint IncYSat(uint z, uint ymax)<br />{<br /> uint ysum = ((z | 0x55555555) + 2) & 0xAAAAAAAA;<br /> uint ydiff = ysum - ymax;<br /> uint masky = (uint)((int)ydiff >> 31);<br /> uint ysat = (masky & ysum) | (~masky & ymax);<br /> return ysat | (z & 0x55555555);<br />}<br /><br />static uint DecXSat(uint z, uint xmin)<br />{<br /> uint xsum = ((z & 0x55555555) - 1) & 0x55555555;<br /> uint xdiff = xsum - xmin;<br /> uint maskx = (uint)((int)xdiff << 1 >> 31);<br /> uint xsat = (~maskx & xsum) | (maskx & xmin);<br /> return xsat | (z & 0xAAAAAAAA);<br />}<br /><br />static uint DecYSat(uint z, uint ymin)<br />{<br /> uint ysum = ((z & 0xAAAAAAAA) - 2) & 0xAAAAAAAA;<br /> uint ydiff = ysum - ymin;<br /> uint masky = (uint)((int)ydiff >> 31);<br /> uint ysat = (~masky & ysum) | (masky & ymin);<br /> return ysat | (z & 0x55555555);<br />}</pre>Merging them this way is nice, because only "half" of a <code>TesseralMin</code> or <code>TesseralMax</code> is necessary that way. On the other hand, they do have the overflow problem again, though that usually won't be a problem.<br/><br/><a href="http://bitmath.blogspot.com/2013/04/improving-bounds-when-some-bits-have.html">Next time</a>, back to "stuff with bounds". </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-14331192293311816402012-11-27T02:26:00.001-08:002013-03-03T07:38:19.978-08:00Tesseral arithmetic<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><small>Introductions are boring, feel free to skip to the <a href="#interesting">interesting stuff</a></small><br/><br/><a href="http://www.geog.ubc.ca/courses/klink/gis.notes/ncgia/u37.html#SEC37.4.4">Tesseral arithmetic</a> is a type of arithmetic that operates on interleaved coordinates. That may not seem very useful, so first, when would you want to do that?<br/><br/>The <a href="http://en.wikipedia.org/wiki/Z-order_curve">Z-order curve</a> is a space-filling curve (also known as Morton order, Morton coordinates, etc) that is closely related to quad trees (and octrees) and (in some contexts) improves the locality of reference when working with multidimensional data.<br/><br/>In essence, it maps multidimensional coordinates to single-dimensional coordinates, which can be used to address memory, and it does so in a way that sometimes leads to better locality of reference than concatenating the parts of a coordinate into a longer one. The trick is to <a href="http://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN">interleave the bits</a>. While that is not the best (ie. optimal locality of reference) mapping, but it's interesting that it works so well for such a simple trick.<br/><br/>But where it really gets interesting is when you have interleaved coordinates and you want to do math with them. You could unpack them, do your math, and then repack, but if you follow the previous link you can see that while unpacking and packing are simple and fast relative to the mappings of other space-filling curves, unpacking and packing would add a lot of overhead to what would otherwise be simple math.<br/><br/>That's where tesseral arithmetic comes in.<br/><br/>Bitwise AND, OR and XOR still work the same way, because the bits of the result only depend on the corresponding bits in the inputs. Shifts are simple - the shift count must be multiplied by two. So for example <code>x ^ (x << 1)</code> becomes <code>x ^ (x << 2)</code> in tesseral arithmetic.<br/><br/><a name="interesting"></a>Addition is more trouble. The carries in normal addition propagate into bits they shouldn't be affecting in tesseral arithmetic. But consider what would happen if the bit pairs at odd positions would each sum to 1. A carry coming into an odd position would always be passed on, and no extra carries would be generated from odd positions. So if the bits at odd positions are just right, the bits at the even positions are summed tesserally, with the carry moving two places instead of one. Obviously this extends to the odd bits as well, when the bits at even positions are just right. This actually makes tesseral addition quite simple: <pre class="brush: csharp">static uint TesseralAdd(uint z, uint w)<br />{<br /> uint xsum = (z | 0xAAAAAAAA) + (w & 0x55555555);<br /> uint ysum = (z | 0x55555555) + (w & 0xAAAAAAAA);<br /> return (xsum & 0x55555555) | (ysum & 0xAAAAAAAA);<br />}</pre>Unsurprisingly, the same principle applies to subtraction. In subtraction, borrows are passed on unmodified through a pair of bits if they sum to zero, or in other words, if both are zero. In a way that's conceptually even simpler than addition. <pre class="brush: csharp">static uint TesseralSubtract(uint z, uint w)<br />{<br /> uint xdiff = (z & 0x55555555) - (w & 0x55555555);<br /> uint ydiff = (z & 0xAAAAAAAA) - (w & 0xAAAAAAAA);<br /> return (xdiff & 0x55555555) | (ydiff & 0xAAAAAAAA);<br />}</pre>But multiplication isn't that nice. The problem is that multiplication is basically build out of a lot of shifts and additions (it's not implemented that way in hardware anymore) and the additions aren't tesseral nor can they be made tesseral.<br/>Unless, of course, we implement multiplication in software: <pre class="brush: csharp">static uint TesseralMultiply(uint z, uint w)<br />{<br /> uint x = z & 0x55555555;<br /> uint y = w & 0x55555555;<br /> uint xres = 0;<br /> while (x != 0)<br /> {<br /> if ((x & 1) != 0)<br /> xres = (xres | 0xAAAAAAAA) + y;<br /> y <<= 2;<br /> x >>= 2;<br /> }<br /><br /> x = z & 0xAAAAAAAA;<br /> y = w & 0xAAAAAAAA;<br /> uint yres = 0;<br /> while (x != 0)<br /> {<br /> if ((x & 2) != 0)<br /> yres = (yres | 0x55555555) + y;<br /> y <<= 2;<br /> x >>= 2;<br /> }<br /><br /> return (xres & 0x55555555) | (yres & 0xAAAAAAAA);<br />}</pre>But that doesn't achieve the goal of being faster than unpacking, doing math, and repacking. If anyone has a better idea, please let me know.<br/><br/>So ok, no tricks multiplication or division. But we're not done. As I hinted in my previous post, many bitwise tricks extend to tesseral arithmetic. For example, taking the absolute value of both parts of the coordinate simultaneously, using the same trick as in my <a href="http://bitmath.blogspot.nl/2012/11/the-basics-of-working-with-signbit.html">previous post (working with the signbit)</a>. The basic principle is simple: replace all operations by their tesseral counterparts. Then look for simplifications and other improvements. <pre class="brush: csharp">static uint TesseralAbs(uint z)<br />{<br /> uint maskx = (uint)((int)z << 1 >> 31);<br /> uint masky = (uint)((int)z >> 31);<br /><br /> // this is a simplified tesseral addition (followed by a xor)<br /> uint xabs = (z & 0x55555555) + maskx ^ maskx;<br /> uint yabs = (z & 0xAAAAAAAA) + masky ^ masky;<br /><br /> return (xabs & 0x55555555) | (yabs & 0xAAAAAAAA);<br />}</pre>The mask is known to be either all ones or all zeroes. It may seem at first as though that means we'd have to OR it with something to make the "in between" bits sum to one, but when the mask is zero there are no carries to pass on anyway. So the OR can be skipped.<br/><br/>But calculating absolute values of coordinates doesn't happen that often. So let's calculate an element-wise minimum, using the same basic principle as before, replace normal operators by tesseral operators. This time however, a substantial improvement over the non-tesseral version is possible. <pre class="brush: csharp">static uint TesseralMin(uint z, uint w)<br />{<br /> // these are tesseral subtractions, of course<br /> uint xdiff = (z & 0x55555555) - (w & 0x55555555);<br /> uint ydiff = (z >> 1 & 0x55555555) - (w >> 1 & 0x55555555);<br /><br /> uint maskx = (uint)((int)xdiff >> 31);<br /> uint masky = (uint)((int)ydiff >> 31);<br /><br /> uint xmin = (maskx & z) | (~maskx & w);<br /> uint ymin = (masky & z) | (~masky & w);<br /><br /> return (xmin & 0x55555555) | (ymin & 0xAAAAAAAA);<br />}</pre>And there's something very nice about how that worked out. In the normal <code>min</code>, there was a problem with overflow. That doesn't happen here, because for <code>xdiff</code> there was an extra bit anyway, and for <code>ydiff</code> that extra bit could easily be arranged by shifting right by 1. That makes the comparison unsigned, though, because the "extra bit" is zero, not a sign-extended bit.<br/><br/>So that's it for this post. Many other bitwise tricks can be extended to tesseral math, using the same basic principle. And of course this all generalizes to higher dimensions as well.<br/><br/>In the <a href="http://bitmath.blogspot.com/2012/11/tesseral-arithmetic-useful-snippets.html">next post</a>, I'll have some more useful snippets for tesseral arithmetic. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-44703465905904604972012-11-18T02:36:00.000-08:002013-05-22T08:38:03.571-07:00The basics of working with the signbit<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><small>this is a filler (in that it is much easier than the usual material), but it seems like most readers only read the fillers anyway</small><br/><br/>When I write signbit, I mean the upper bit in a bit string that is interpreted as a <a href="http://en.wikipedia.org/wiki/Two%27s_complement">two's complement</a> signed integer.<br/><br/>Central to working with the signbit is the idea that signed shift right aka <a href="http://en.wikipedia.org/wiki/Arithmetic_shift">arithmetic shift right</a> copies the signbit to other bits, and specifically, a signed shift right by 31 (or 63 or in general, one less than the size of your numbers) broadcasts the signbit to all other bits.<br/><br/>Perhaps the most obvious thing you can do with that is broadcasting an <i>arbitrary</i> bit to all other bits. Simply shift that bit into the signbit, and then shift right by 31: <pre class="brush: csharp">static int broadcastbit(int value, int bitindex)<br />{<br /> // put the target bit in the sign<br /> int temp = value << (31 - bitindex);<br /> // copy it to all bits<br /> return temp >> 31;<br />}</pre>In C, that's undefined behaviour (UB). Letting a left shift overflow (which could easily happen here) is UB, and signed right shift is UB in any case. But this is C# code (the source of this page will tell you so) where it's perfectly well-defined. And anyway, this is the kind of UB that is safe to use; the expected thing happens when you combine a sane compiler with a typical platform (say, MSVC on x86). But, of course, purists won't like it and on platforms without arithmetic right shift it's probably not going to work.<br/><br/>That actually applies to most of this blog, I suppose.<br/><br/>On to other tricks. This one is slightly harder to grasp, but more useful: calculating the absolute value of an integer without branching. First, the simple to understand version. <pre class="brush: csharp">static int abs(int value)<br />{<br /> // make a mask that is all ones if negative, or all zeroes if non-negative<br /> int mask = value >> 31;<br /> // select -value if negative, or value if non-negative<br /> return (mask & -value) | (~mask & value);<br />}</pre>That's just the usual branchless selection between two things.<br/><br/>The better way to do this has to do with how negation works. The negation of a number <code>x</code> is <code>~x + 1</code> (first definition) or <code>~(x - 1)</code> (second definition). Those definitions are, of course, equivalent. The trick (and you may have seen this coming), is to make the complement and the increment/decrement conditional based on the mask. <pre class="brush: csharp">static int abs(int value)<br />{<br /> // make a mask that is all ones if negative, or all zeroes if non-negative<br /> int mask = value >> 31;<br /> // conditionally complement and subtract -1 (first definition)<br /> return (value ^ mask) - mask;<br /> // conditionally add -1 and complement (second definition)<br /> return (value + mask) ^ mask;<br />}</pre>I've heard that the version of <code>abs</code> using the first definition is patented. That probably doesn't hold up (there will be a mountain of prior art and it's an obvious trick that anyone could derive), and no one's going to find out you're using it much less sue you for it, but you could use the version using the second definition just to be on the safe side.<br/><br/>One good thing about the simple version of <code>abs</code> is that it's using a generic branchless selection. That means you're not limited to choosing between <code>value</code> and <code>-value</code>, you can select <i>anything</i>. For example, you can subtract two numbers and use the sign of the difference to select the (unsigned) smallest one. That doesn't always work. The subtraction must not overflow, otherwise it selects the wrong one. The problem goes away if the inputs are smaller than <code>int</code>s, for example if they are bytes. <pre class="brush: csharp">static byte min(byte x, byte y)<br />{<br /> int difference = x - y;<br /> // make a mask that is all ones if x < y, or all zeroes if x >= y<br /> int mask = difference >> 31;<br /> // select x if x < y, or y if x >= y<br /> return (byte)((mask & x) | (~mask & y));<br /> // alternative: use arithmetic to select the minimum<br /> return (byte)(y + (difference & mask));<br />}</pre>The weird mixing of signed and unsigned may be confusing. Try to think of numbers as pure bit strings and only look at the type when an operator depends on it. That's closer to what actually happens in a computer, and it's less confusing that way.<br/><br/>The problem also goes away if you can use the carry flag instead of the signbit, because then you're not using a bit of the result to hold a flag but a separate thing, and thus doesn't "eat into the range of values". But high level languages are too good for the carry flag or something like that, and don't enable you to use it. So here's <code>min</code> in x86 assembly: <pre> ; inputs are in eax and edx, result in eax<br /> sub eax, edx<br /> sbb ecx, ecx ; makes ecx all ones if carry (ie. if eax < edx)<br /> and eax, ecx<br /> add eax, edx<br /></pre>Whether this or the more usual branchless version with <code>cmov</code> is faster depends on the processor.<br/><br/>And that has nothing to do with the signbit anymore, I know.<br/><br/>These tricks, and many others, also extend to <a href="http://www.geog.ubc.ca/courses/klink/gis.notes/ncgia/u37.html#SEC37.4.4">tesseral arithmetic</a>, which I'll cover in my <a href="http://bitmath.blogspot.nl/2012/11/tesseral-arithmetic.html">next post</a>, which isn't a filler. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-43115745310204040242012-10-01T03:38:00.000-07:002013-03-03T07:46:57.747-08:00Calculating the lower and upper bounds of the bitwise XOR of two bounded variables<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/javascript">MathJax.Hub.Config({ "HTML-CSS": { scale: 175 } }); </script><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">Again, the same sort of deal as before. This time with the eXclusive OR (XOR). So the goal is to calculate \begin{equation} \min _{x \in [a, b], y \in [c, d]} x \oplus y \end{equation} And \begin{equation} \max _{x \in [a, b], y \in [c, d]} x \oplus y \end{equation} See how I <i>didn't</i> abuse the notation so badly for once?<br/><br/>Anyway, let's get to it. I'll be honest here, I couldn't solve this one. So no clever bithacks this time. Disappointing, I know. But I still have something relevant to write, so first I'll reproduce Warren's solution from Hacker's Delight and then there will be something relevant. <pre class="brush: cpp">unsigned minXOR(unsigned a, unsigned b, <br /> unsigned c, unsigned d) {<br /> unsigned m, temp;<br /> <br /> m = 0x80000000;<br /> while (m != 0) {<br /> if (~a & c & m) {<br /> temp = (a | m) & -m;<br /> if (temp <= b) a = temp;<br /> } <br /> else if (a & ~c & m) {<br /> temp = (c | m) & -m;<br /> if (temp <= d) c = temp;<br /> }<br /> m = m >> 1;<br /> }<br /> return a ^ c;<br />}<br /><br />unsigned maxXOR(unsigned a, unsigned b, <br /> unsigned c, unsigned d) {<br /> unsigned m, temp;<br /> <br /> m = 0x80000000;<br /> while (m != 0) {<br /> if (b & d & m) {<br /> temp = (b - m) | (m - 1);<br /> if (temp >= a) b = temp;<br /> temp = (d - m) | (m - 1); <br /> if (temp >= c) d = temp;<br /> }<br /> m = m >> 1;<br /> }<br /> return b ^ d;<br />}</pre><br/>This time, there is no <code>break</code>. That's because after handling one bit, there may still be more opportunities to decrease (or increase) the value even further. Worse, these opportunities are not necessarily the same as they were before the upper bits were handled. Handling a bit can create more opportunities and at the same time changes the bound such that further changes may not be allowed at all. That's why there is no clever bithack - you <i>have</i> to iterate in some way. Of course you can still skip to a relevant bit in one go, but it's still a loop.<br/><br/>So the "something relevant" I promised is not about how to do it better. Instead, it's about how to extend it to signed values. Warren notes that this can't be done using algebraic equivalences, but that doesn't mean it can't be done.<br/><br/>What has to change to make a signed <code>minXOR</code>? Firstly, the comparison of the potential new lower bound to the corresponding upper bound must be signed, otherwise it would even declare the old bound to be invalid if it crosses zero. And secondly, the signbit needs special treatment, treating it the same as other bits would make the bound go the wrong way (eg setting it as in <code>minXOR</code> doesn't increase a bound, it decreases it, and obviously you can't decrease the lower bound). The same argument about the signbit holds for the other operators, but the functions that preprocess their inputs so they become signed (see Hacker's Delight, Table 4-1) already make sure that that situation never occurs. <pre class="brush: cpp">int32_t minXORs_internal(int32_t a, int32_t b, int32_t c, int32_t d)<br />{<br /> int32_t temp;<br /> int32_t m = 0x40000000;<br /> while (m != 0)<br /> {<br /> if (~a & c & m)<br /> {<br /> temp = (a | m) & -m;<br /> if (temp <= b) a = temp;<br /> }<br /> else if (a & ~c & m)<br /> {<br /> temp = (c | m) & -m;<br /> if (temp <= d) c = temp;<br /> }<br /> m = m >> 1;<br /> }<br /> return a ^ c;<br />}<br /><br />int32_t minXORs(int32_t a, int32_t b, int32_t c, int32_t d)<br />{<br /> int32_t r1;<br /> if (a < 0 && b < 0 && c < 0 && d >= 0) r1 = minXORs_internal(a, b, 0, d);<br /> else if (a < 0 && b >= 0 && c < 0 && d < 0) r1 = minXORs_internal(0, b, c, d);<br /> else if (a < 0 && b >= 0 && c < 0 && d >= 0) r1 = <br /> min(minXORs_internal(0, b, c, d),<br /> minXORs_internal(a, b, 0, d));<br /> else r1 = minXORs_internal(a, b, c, d);<br /><br /> return r1;<br />}</pre><br/>What's going on there is that if both lower bounds have their sign set, it tries to avoid cancelling the sign out with the other bound. Having the sign set is obviously always lower than having the sign not set. In the third case, there are two ways to avoid cancelling the sign, which must both be checked to see which way gives the lowest result. For all other cases, the sign is either forced to cancel, or just not set in the first place, so it's all handled by <code>minXORs_internal</code>.<br/><br/>The code for <code>maxXORs</code> is very similar, except of course it doesn't try to <i>avoid</i> cancelling signs, it tried to <i>make</i> them cancel.<br/><br/><a href="http://bitmath.blogspot.com/2012/11/the-basics-of-working-with-signbit.html">Next post</a>, something simpler and more general, so probably more usefull. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-44531270980777091532012-09-16T11:37:00.000-07:002012-10-01T03:39:18.483-07:00Calculating the lower and upper bound of the bitwise OR of two variables that are bounded and may have bits known to be zero<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/javascript">MathJax.Hub.Config({ "HTML-CSS": { scale: 175 } }); </script><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">This new problem clearly is related to two of my <a href="http://bitmath.blogspot.com/2012/09/calculating-lower-bound-of-bitwise-or.html">previous</a> <a href="http://bitmath.blogspot.com/2012/09/calculating-upper-bound-of-bitwise-or.html">posts</a>. But this time, there is slightly more information. It may look like a contrived, purely theoretical, problem, but it actually has applications in abstract interpretation. Static knowledge about the values that variables could have at runtime often takes the form of a range and a number that the variable is known to be a multiple of, which is most commonly a power of two.<br/><br/>The lower bound will be \begin{equation} \min _{x \in [a, b] \wedge m\backslash x, y \in [c, d] \wedge n\backslash y} x | y \end{equation} And the upper bound will be \begin{equation} \max _{x \in [a, b] \wedge m\backslash x, y \in [c, d] \wedge n\backslash y} x | y \end{equation} Where <code>m\x</code> means "<code>x</code> is divisible by <code>m</code>".<br/><br/>So how can we calculate them faster than direct evaluation? I don't know, and to my knowledge, no one else does either. But if sound (ie only <i>over</i>approximating) but non-tight bounds are OK, then there is a way. Part of the trick is constraining <code>m</code> and <code>n</code> to be powers of two. It's safe to use <code>m = m & -m</code>. That should look familiar - it's extracting the rightmost bit of <code>m</code>. An other explanation of "the rightmost bit of <code>m</code>" is "the highest power of two that divides <code>m</code>". That doesn't rule out any values of <code>x</code> that were valid before, so it's a sound approximation.<br/><br/>Strangely, for <code>minOR</code>, if the bounds are pre-rounded to their corresponding powers of two, there is absolutely no difference in the code whatsoever. It is possible to set a bit that is known to be zero in that bound, but that can only happen if that bit is one in the other bound anyway, so it doesn't affect the result. The other case, setting a bit that is not known to be zero, is the same as it would be with only the range information.<br/><br/><code>maxOR</code> is a problem though. In <code>maxOR</code>, bits at the right are set which may be known to be zero. Some of those bits may have to be reset. But how many? To avoid resetting too many bits, we have to round the result down to a multiple of <code>min(m, n)</code>. That's clearly sound - if a bit can't be one in both <code>x</code> and <code>n</code>, obviously it can't be one in the result. But it turns out not to be tight - for example for <code>[8, 9] 1\x</code> and <code>[0, 8] 4\y</code>, it computes 0b1111, even though the last two bits can only be 0b00 or 0b01 (<code>y</code> does not contribute to these bits, and the range of <code>x</code> is so small that the bits only have those values) so the tight upper bound is 0b1101. If that's acceptable, the code would be <pre class="brush: csharp">static uint maxOR(uint a, uint b, uint c, uint d, uint m, uint n)<br />{<br /> uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);<br /> uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);<br /> uint resettable = b & d & (resettableb | resettabled);<br /> uint target = resettable == 0 ? 0 : 1u << bsr(resettable);<br /> uint targetb = target & resettableb;<br /> uint targetd = target & resettabled & ~resettableb;<br /> uint newb = b | (targetb == 0 ? 0 : targetb - 1);<br /> uint newd = d | (targetd == 0 ? 0 : targetd - 1);<br /> uint mask = (m | n) & (0 - (m | n));<br /> return (newb | newd) & (0 - mask);<br />}</pre>Which also uses a sneaky way of getting <code>min(m, n)</code> - by ORing them and then taking the rightmost bit. Because why not.</br><br/>I haven't (yet?) found a nice way to calculate the tight upper bound. Even if I do, that still leaves things non-tight when the old <code>m</code> or <code>n</code> were not powers of two.<br/><br/><a href="http://bitmath.blogspot.com/2012/10/calculating-lower-and-upper-bounds-of.html">Next post</a>, xor, which has some unique difficulties. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-18239386064042961472012-09-14T11:03:00.000-07:002012-09-18T09:14:22.503-07:00Calculating the lower and upper bounds of the bitwise AND of two bounded variables<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/javascript">MathJax.Hub.Config({ "HTML-CSS": { scale: 175 } }); </script><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">This post is the closely related the <a href="http://bitmath.blogspot.com/2012/09/calculating-upper-bound-of-bitwise-or.html">previous post</a> and the <a href="http://bitmath.blogspot.com/2012/09/calculating-lower-bound-of-bitwise-or.html">post before it</a>, so I strongly suggest you read those two first.<br><br/>It's the same idea as before, but with bitwise AND instead of OR. That leads to some interesting symmetries. First, the definitions. The lower bound will be \begin{equation} \min _{x \in [a, b], y \in [c, d]} x \& y \end{equation} And the upper bound will be \begin{equation} \max _{x \in [a, b], y \in [c, d]} x \& y \end{equation} The algorithms given by Warren are <pre class="brush: cpp">unsigned minAND(unsigned a, unsigned b, <br /> unsigned c, unsigned d) {<br /> unsigned m, temp; <br /> <br /> m = 0x80000000; <br /> while (m != 0) {<br /> if (~a & ~c & m) {<br /> temp = (a | m) & -m; <br /> if (temp <= b) {a = temp; break;} <br /> temp = (c | m) & -m; <br /> if (temp <= d) {c = temp; break;} <br /> } <br /> m = m >> 1; <br /> } <br /> return a & c; <br />}</pre><pre class="brush: cpp">unsigned maxAND(unsigned a, unsigned b, <br /> unsigned c, unsigned d) {<br /> unsigned m, temp; <br /> <br /> m = 0x80000000; <br /> while (m != 0) {<br /> if (b & ~d & m) {<br /> temp = (b & ~m) | (m - 1); <br /> if (temp >= a) {b = temp; break;} <br /> } <br /> else if (~b & d & m) {<br /> temp = (d & ~m) | (m - 1); <br /> if (temp >= c) {d = temp; break;} <br /> } <br /> m = m >> 1; <br /> } <br /> return b & d; <br />}</pre>Obviously, the follow the same basic idea. Try to set a bit so you can reset the bits to the right of it in the lower bound, or try to reset a bit so you can set the bits to the right of it in the upper bound. The same reasoning about starting at <code>0x80000000 >> nlz(~a & ~c)</code> or <code>0x80000000 >> nlz(b ^ d)</code> applies, and the same reasoning about "bits at and to the right of <code>a ^ b</code>" applies as well. I'll skip the "sparse loops" this time, they're nice enough but mainly instructive, and repeating the same idea twice doesn't make it twice as instructive. So straight to the loopless algorithms: <pre class="brush: csharp">static uint minAND(uint a, uint b, uint c, uint d)<br />{<br /> uint settablea = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);<br /> uint settablec = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);<br /> uint settable = ~a & ~c & (settablea | settablec);<br /> uint target = settable == 0 ? 0 : 1u << bsr(settable);<br /> uint targeta = target & settablea;<br /> uint targetc = target & settablec & ~settablea;<br /> uint newa = a & (targeta == 0 ? 0xFFFFFFFF : 0-targeta);<br /> uint newc = c & (targetc == 0 ? 0xFFFFFFFF : 0-targetc);<br /> return newa & newc;<br />}</pre><pre class="brush: csharp">static uint maxAND(uint a, uint b, uint c, uint d)<br />{<br /> uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);<br /> uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);<br /> uint candidatebitsb = b & ~d & resettableb;<br /> uint candidatebitsd = ~b & d & resettabled;<br /> uint candidatebits = candidatebitsb | candidatebitsd;<br /> uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);<br /> uint targetb = target & b;<br /> uint targetd = target & d & ~b;<br /> uint newb = b | (targetb == 0 ? 0 : targetb - 1);<br /> uint newd = d | (targetd == 0 ? 0 : targetd - 1);<br /> return newb & newd;<br /></pre>Symmetry everywhere. But not really anything to new to explain.<br/><br/><a href="http://bitmath.blogspot.com/2012/09/calculating-lower-and-upper-bound-of.html">Next post</a>, something new to explain. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-82416886179764428172012-09-14T05:49:00.000-07:002012-09-18T10:15:23.358-07:00Calculating the upper bound of the bitwise OR of two bounded variables<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/javascript">MathJax.Hub.Config({ "HTML-CSS": { scale: 175 } }); </script><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">This post is the closely related the <a href="http://bitmath.blogspot.com/2012/09/calculating-lower-bound-of-bitwise-or.html">previous one</a>, so I strongly suggest you read that one first.<br/><br/>The only difference with the previous post, is that this time, we're interested in the upper bound instead of the lower bound. In other words, evaluate<br/>\begin{equation} \max _{x \in [a, b], y \in [c, d]} x | y \end{equation} The algorithm given by Warren in Hackers Delight is <pre class="brush: cpp">unsigned maxOR(unsigned a, unsigned b, <br /> unsigned c, unsigned d) {<br /> unsigned m, temp; <br /> <br /> m = 0x80000000; <br /> while (m != 0) {<br /> if (b & d & m) {<br /> temp = (b - m) | (m - 1); <br /> if (temp >= a) {b = temp; break;} <br /> temp = (d - m) | (m - 1); <br /> if (temp >= c) {d = temp; break;} <br /> } <br /> m = m >> 1; <br /> } <br /> return b | d; <br />}</pre>And it's really the same sort of idea as the algorithm to calculate the minimum, except this time we're looking for a place where both <code>b</code> and <code>d</code> are one, so we can try to reset that bit and set all the bits to the right of it.<br/><br/>Warren notes that <code>m</code> can start at <code>0x80000000 >> nlz(b & d)</code>, and once again the same principle holds: it's enough to <i>only</i> look at those bits which are one in <code>b & d</code>, and they can be visited from high to low with <code>bsr</code><pre class="brush: csharp">static uint maxOR(uint a, uint b, uint c, uint d)<br />{<br /> while (bits != 0)<br /> {<br /> uint m = 1u << bsr(bits);<br /><br /> uint temp;<br /> temp = (b - m) | (m - 1);<br /> if (temp >= a) { b = temp; break; }<br /> temp = (d - m) | (m - 1);<br /> if (temp >= c) { d = temp; break; }<br /><br /> bits ^= m;<br /> }<br /> return b | d;<br />}</pre>And also, again, we can use that the bit we're looking for in <code>b</code> must be at or to the right of the leftmost bit in <code>a ^ b</code> (<code>c ^ d</code> for <code>d</code>), and that the selected bit doesn't actually have to be changed. <pre class="brush: csharp">static uint maxOR(uint a, uint b, uint c, uint d)<br />{<br /> uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);<br /> uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);<br /> uint candidatebits = b & d & (resettableb | resettabled);<br /> uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);<br /> uint targetb = target & resettableb;<br /> uint targetd = target & resettabled & ~resettableb;<br /> uint newb = b | (targetb == 0 ? 0 : targetb - 1);<br /> uint newd = d | (targetd == 0 ? 0 : targetd - 1);<br /> return newb | newd;<br />}</pre>Most of the code should be obvious after a moments thought, but something interesting and non-symmetric happens for <code>targetd</code>. There, I had to make sure that a change is not made to <i>both</i> bounds (that would invalidate the whole idea of "being able to make the change without affecting that bit in the result"). In <code>minOR</code> that happened automatically because it looked at positions where the bits were different, so both <code>target</code>s couldn't both be non-zero. Here, one of the bounds has to be explicitly prioritized before the other.<br/><br/><a href="http://bitmath.blogspot.com/2012/09/calculating-lower-and-upper-bounds-of.html">Next post</a>, maybe the same sort of thing but for bitwise AND. Then again, maybe not. I'll see what I can come up with.<br/>edit: bitwise AND it is. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-83582075440097110822012-09-13T16:19:00.001-07:002012-11-19T08:51:54.895-08:00Calculating the lower bound of the bitwise OR of two bounded variables<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/javascript">MathJax.Hub.Config({ "HTML-CSS": { scale: 175 } }); </script><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">What does that even mean?<br/><br/>Suppose you have the variables <code>x in [a, b]</code> and <code>y in [c, d]</code>. The question then is: what is the lowest possible value of <code>x | y</code> where <code>x</code> and <code>y</code> are both in their corresponding ranges. In other words, evaluate<br/>\begin{equation} \min _{x \in [a, b], y \in [c, d]} x | y \end{equation} At a maximum of 2<sup>64</sup> iterations, direct evaluation is clearly not an option for 32-bit integers.<br/><br/>Fortunately, there is an algorithm that has a complexity linear in the number of bits, given by Warren in Hackers Delight, Propagating Bounds through Logical Operations, which the <a href="http://www.hackersdelight.org/permissions.htm">license</a> permits me to show here: <pre class="brush: cpp">unsigned minOR(unsigned a, unsigned b, <br /> unsigned c, unsigned d) {<br /> unsigned m, temp; <br /> <br /> m = 0x80000000; <br /> while (m != 0) {<br /> if (~a & c & m) {<br /> temp = (a | m) & -m; <br /> if (temp <= b) {a = temp; break;} <br /> } <br /> else if (a & ~c & m) {<br /> temp = (c | m) & -m; <br /> if (temp <= d) {c = temp; break;} <br /> } <br /> m = m >> 1; <br /> } <br /> return a | c; <br />}</pre><br/>So let's break down what it's doing. It starts at the MSB, and then it searches for either the highest bit that is zero <code>a</code> and one in <code>c</code> such that changing <code>a</code> to have that bit set and all bits the right of it unset would not make the new <code>a</code> higher than <code>b</code>, or, the highest bit that is zero <code>c</code> and one in <code>a</code> such that changing <code>c</code> to have that bit set and all bits the right of it unset would not make the new <code>c</code> higher than <code>d</code>, whichever one comes first.<br/><br/>That's literally easier to code than to explain, and I haven't even explained yet <i>why</i> it works.<br/>Suppose the highest such bit is found in <code>a</code>. Setting that bit in <code>a</code> does not affect the value of <code>a | c</code>, after all, that bit must have been set in <code>c</code> already so it was already set in <code>a | c</code>, too. However, resetting the bits to the right of that bit however can lower <code>a | c</code>. Notice that it is pointless to continue looking at lower bits - in <code>a</code> there are no more bits to reset, and for <code>c</code> there are no more bits that have the corresponding bit in <code>a</code> set.<br/><br/>Warren notes that <code>m</code> could start at <code>0x80000000 >> nlz(a ^ c)</code> (where <code>nlz</code> is the "number of leading zeros" function), meaning it starts looking at the first bit that is different in <code>a</code> and <code>c</code>. But we can do better. Not only can we start at the first bit which is different in <code>a</code> and <code>c</code>, we could look at <i>only</i> those bits. That requires frequent invocation of the <code>nlz</code> function (or <code>bsr</code>, <b>b</b>it <b>s</b>can <b>r</b>everse, giving the index of the leftmost bit), but it maps to a fast instruction on many platforms. <pre class="brush: csharp">uint minOR(uint a, uint b, uint c, uint d)<br />{<br /> uint bits = a ^ c;<br /> while (bits != 0)<br /> {<br /> // get the highest bit<br /> uint m = 1u << (nlz(bits) ^ 31);<br /> // remove the bit<br /> bits ^= m;<br /> if ((a & m) == 0)<br /> {<br /> uint temp = (a | m) & -m;<br /> if (temp <= b) { a = temp; break; }<br /> }<br /> else<br /> {<br /> uint temp = (c | m) & -m;<br /> if (temp <= d) { c = temp; break; }<br /> }<br /> }<br /> return a | c;<br />}</pre>One interesting consequence of looking only at the bits that are different is that the second <code>if</code> disappears - the case where the bits are equal is ruled out by looking only at the different bits in the first place.<br/><br/>But that is not all. The bit positions at which the <= operators could return true, are precisely all those at and to the right of one important point: the highest set bit in <code>a ^ b</code> (or <code>c ^ d</code> for the other bound). Why? Well the upper bounds are not lower than the lower bounds, so the first bit at which they differ must be the first position at which the lower bound has a zero where the upper bound has a one. Setting that bit to one and all bits to the right to zero in the lower is clearly valid (ie doesn't make it higher than the upper bound), but whether that bit can actually be set depends on the other lower bound as well.<br/><br/>What that means in practical terms, is that the value of <code>m</code> that first passes the tests is directly computable. No loops required. Also, because the test to check whether the new bound is still less than or equal to the upper bound isn't necessary anymore (by construction, that test always passes), the bit doesn't even have to be set anymore - without the test the new value isn't really needed, and the entire idea was that setting that bit would not change the result, so setting it is pointless. <pre class="brush: csharp">uint minOR(uint a, uint b, uint c, uint d)<br />{<br /> uint settablea = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);<br /> uint settablec = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);<br /> uint candidatebitsa = (~a & c) & settablea;<br /> uint candidatebitsc = (a & ~c) & settablec;<br /> uint candidatebits = candidatebitsa | candidatebitsc;<br /><br /> uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);<br /> uint targeta = c & target;<br /> uint targetc = a & target;<br /><br /> uint newa = a & ~(targeta == 0 ? 0 : targeta - 1);<br /> uint newc = c & ~(targetc == 0 ? 0 : targetc - 1);<br /> return newa | newc;<br />}</pre>Sadly, there's an awful lot of conditionals in there, which could be branches. But they could also be conditional moves. And on x86 at least, both <code>bsr</code> and <code>lzcnt</code> set a nice condition flag if the input was zero, so it's really not too bad in practice. It is, in my opinion, a pity that there aren't more instruction to deal with leftmost bits, while instruction that deal with the rightmost bit are being added <a href="http://chessprogramming.wikispaces.com/BMI1">left</a> and <a href="http://chessprogramming.wikispaces.com/TBM">right</a>. They are nice, I will admit, but the rightmost bit could already be efficiently dealt with, while the leftmost bit is somewhat problematic.<br/><br/><a href="http://bitmath.blogspot.com/2012/09/calculating-upper-bound-of-bitwise-or.html">Next post</a>, well, I'm not sure yet. I'll work on something similar for the upper bound instead of the lower bound, but who knows.<br/>edit: As it turns out, the upper bound was fairly easy to do as well. </span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-19383374435840730322012-09-13T13:39:00.000-07:002014-01-29T09:52:46.379-08:00Divisibility and modular multiplication, even divisors<span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">As promised, I will now expand the <a href="http://bitmath.blogspot.com/2012/09/divisibility-and-modular-multiplication.html">divisibility testing by modular multiplication</a> algorithm to handle even divisors.<br/><br/>Recall that a number <code>y</code> that has a rightmost bit <a href="http://bitmath.blogspot.com/2012/09/the-basics-of-working-with-rightmost-bit.html">can be written as</a> <code>y = d * 2<sup>n</sup></code> where <code>d</code> is odd. A number <code>x</code> is divisible by <code>y = d * 2<sup>n</sup></code> iff it is divisible by <code>2<sup>n</sup></code> and by <code>d</code>. And both of those problems have already been solved in earlier posts, so: <pre class="brush: cpp">bool IsDivisibleBy(uint32_t x, uint32_t divisor)<br />{<br /> uint32_t poweroftwo = divisor & -divisor;<br /> uint32_t d = divisor >> bsf(divisor);<br /> return (x & (poweroftwo - 1)) == 0 && <br /> (d == 1 || IsDivisibleByOdd(x, d));<br />}</pre><br/>Pretty straightforward. Except perhaps the <code>d == 1</code> bit. Recall that <code>IsDivisibleByOdd</code> doesn't want the divisor to be one, so that case has to be avoided. And if <code>d</code> is one, that means the <code>divisor</code> was a power of two. It even works if <code>divisor</code> is one; <code>poweroftwo</code> would also be one, and <code>x & 0</code> is clearly always zero.<br/><br/>And <code>bsf</code> is not defined. The implementation would be strongly platform dependent, and not particularly enlightening.<br/><br/>Now, on to the performance part. Does this help? The answer is the same as last time - no, usually not. Except in some select cases, such as when you need to test a whole array for divisibility by the same number, which is not a compile-time constant.<br/><br/>Added on 29 January 2014: as Jasper Neumann pointed out, if right-rotate can be used, the part that tests whether the lower bits are zero can be merged with the other test, as described in, for example, Hacker's Delight chapter 10-16 (Test for Zero Remainder after Division by a Constant) and <a href="http://gmplib.org/~tege/divcnst-pldi94.pdf">gmplib.org/~tege/divcnst-pldi94.pdf</a>. <br/><br/>That concludes the divisibility series (for now, anyway). <a href="http://bitmath.blogspot.com/2012/09/calculating-lower-bound-of-bitwise-or.html">Next post</a>, something completely different.</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-27293161146499359392012-09-13T08:55:00.000-07:002014-04-11T14:33:07.100-07:00The basics of working with the rightmost bit<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;"><p>note: I rewrote this post because of its popularity.</p><p>In this post I will assume that the reader knows the bitwise operators and what they do (if not, see <a href="http://www.catonmat.net/blog/low-level-bit-hacks-you-absolutely-must-know/">Low Level Bit Hacks You Absolutely Must Know</a>), to avoid having to cover the basics.<p/> <p>The rightmost bit (not to be confused with the least-significant bit), also called "rightmost 1", is the lowest bit that is set (rightmost zero is the lowest bit that is not set). So zero, unlike all other numbers, doesn't have a rightmost bit. The rightmost bit is interesting because surprisingly many useful operations can be done on it.</p> <p>Here's a small selection of potentially useful basic "rightmost bit/zero operations": <center><table><tr><td><code>x - 1</code></td><td>Remove the rightmost bit and smear it to the right.<br/>(zero is interpreted as having a rightmost bit just beyond the msb)</td></tr><tr><td><code>x & (x - 1)</code></td><td>Remove the rightmost bit.</td></tr><tr><td><code>x & -x</code></td><td>Isolate the rightmost bit.</td></tr><tr><td><code>x | (x + 1)</code></td><td>Set the rightmost zero.</td></tr><tr><td><code>x | ~(x + 1)</code></td><td>Isolate the rightmost zero (as an inverted mask).</td></tr></table></center></p> <h1>How it works</h1> <p>Manipulation of the rightmost bit makes use of the properties of the carry/borrow process, namely that it propagates from the lsb to the msb, changes the bits it touches, and can be stopped. For example, the simplest operation operation, <code>x - 1</code> just runs a borrow through the zeroes on the right, changing all of them to ones, the borrow is stopped by the rightmost one (which is changed to a zero). Effectively it inverted the part of the number that includes the rightmost bit and spans to the right, and left the rest alone. ANDing that number with <code>x</code> (as in the "remove the rightmost bit" operation) then sets that rightmost part to all zeroes (because <code>x & ~x = 0</code>) and leaves the rest of the number alone (because <code>x & x = x</code>).</p> <p>Since <code>-x = ~(x - 1)</code>, clearly negation is a kind of "opposite" of subtracting 1; the rightmost part (including the rightmost 1) is <i>not</i> change, and the leftmost part <i>is</i> changed. So <code>x & -x</code> also gives the opposite thing of <code>x & (x - 1)</code>, namely just the rightmost bit (instead of everything else).</p> <p>The other two operations from the table can be derived by taking <code>~operation(~x)</code> and simplifying it: <pre>~(~x & (~x - 1)) =<br />// use the definition of subtraction: a - b = ~(~a + b)<br />~(~x & ~(x + 1)) =<br />// use De Morgan's law<br />x | (x + 1)</pre> <pre>~(~x & -~x) =<br />// use the definition of negation: -a = ~(a - 1)<br />~(~x & ~(~x - 1)) =<br />// use De Morgan's law<br />x | (~x - 1) =<br />// use the definition of subtraction: a - b = ~(~a + b)<br />x | ~(x + 1)</pre></p> <p>Using the same principles, more complicated operations can be constructed. For example (by chaining two operations on the rightmost bit), by first smearing the rightmost 1 to the right, the rightmost run of ones is now in a good position to get rid of it (by adding one and ANDing):<br/><code>(x | (x - 1)) + 1 & x</code></p> <p>The same can also be accomplished differently (no better, just different), by instead of smearing the rightmost bit to the right so that it can be affected by the +1, adding a big enough number - that number is, of course, the isolated rightmost bit, so we get this:<br/><code>x + (x & -x) & x</code></p> <p>A good overview of the basic rightmost-bit operations can be found <a href="http://programming.sirrida.de/programming.html#rightmost_bits">here</a>.</p> <a href="http://bitmath.blogspot.com/2012/09/divisibility-and-modular-multiplication_13.html">Next time</a>, why rightmost bits are relevant in testing divisibility by even numbers.</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-68735297439457798762012-09-13T03:46:00.000-07:002012-09-13T08:57:01.018-07:00Divisibility and modular multiplication<span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">An other post from <a href="http://www.codeproject.com/Messages/4336898/Divisibility-and-modular-multiplication.aspx">CodeProject</a> (with small changes). I promise I'll post something new next time.<br /><br />Typically, all computer math is done modulo 2<sup>32</sup> (or 2<sup>something</sup>, which the rest of my post trivially generalizes to). This leads to sometimes surprising effects, such as that the sum of <code>a</code> and <code>b</code>, while both positive, can end up being lower than <code>min(a, b)</code>, which is rather well-known and known as "overflow" (often treated like something bad, which it can be). Less well known is that it also means that some numbers have a <a href="https://en.wikipedia.org/wiki/Modular_multiplicative_inverse">multiplicative inverse</a>, ie a number <code>x<sup>-1</sup></code> such that <code>x<sup>-1</sup>x=1</code>. As mentioned in the wikipedia article, the numbers which have multiplicative inverses are precisely those coprime to the modulo. The modulo is a power of two, which means that a number has a multiplicative inverse iff it is odd.<br /><br />And it turns out to be actually useful, too. One application is, as you might guess from the title, divisibility testing.<br /><br />Multiplying a number by an odd number is reversible - you can take the multiplicative inverse of the odd number and then multiply by it to get the original number back. Put differently, the function <code>f(x) = x * k (for k odd)</code> is a bijection. <br /><br />Modular multiplication is associative, so a multiple of <code>k</code>, say <code>n * k</code> multiplied by <code>inv(k)</code> (the multiplicative inverse of <code>k</code>), is <code>n</code>, because</span><br /><pre>(n * k) * inv(k) =<br />// use associativity<br />n * (k * inv(k)) =<br />// use definition of multiplicative inverse<br />n * 1 =<br />// multiplicative identity<br />n</pre><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">That means that in <code>x * inv(k)</code>, the multiples of <code>k</code> "use up" the results from 0 through (2<sup>32</sup>-1)/<code>k</code>, they can't be used twice because it's a bijection, leaving just the numbers bigger than (2<sup>32</sup>-1)/<code>k</code> for non-multiples-of-k. Which suggest a very simple divisibility test:</span><br /><pre class="brush: csharp">static bool IsDivisibleByOdd(uint x, uint divisor)<br />{<br /> if ((divisor & 1) == 0)<br /> throw new ArgumentException("divisor must be odd");<br /> uint d_inv = inv(divisor);<br /> uint biggest = uint.MaxValue / divisor; // problem right here<br /> return (x * d_inv) <= biggest;<br />}<br /> <br />static uint inv(uint d)<br />{<br /> // see Hacker's Delight,<br /> // Computing the Multiplicative Inverse by Newton's Method<br /> // use extra iteration when extending to 64 bits<br /> uint x = (d * d) + d - 1;<br /> uint t = d * x;<br /> x *= 2 - t;<br /> t = d * x;<br /> x *= 2 - t;<br /> t = d * x;<br /> x *= 2 - t;<br /> return x;<br />}</pre><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">And there is a problem. It didn't avoid the division, it turned it into an other division. So this doesn't actually help - in fact it's slower than the usual way. The left hand side of the division is a constant though, so unlike in the usual <code>x % divisor == 0</code>, the slow operation (division and modulo are slow relative to multiplication) can be done at compile time. Unfortunately, that doesn't actually help - it's one of the most common optimizations that compilers do, so it already happened automatically.<br /><br />But. And there is a but. If you want to test for a lot of number whether they are divisible by some <i>odd number</i><sup>1</sup> that doesn't vary in that test, you only have one division, one <code>inv</code>, and the cost per item is only a multiplication and a test. What's more, no compiler I know of automatically does this for you, so this can actually save time.<br /><hr />1. Except 1, but why would you test for divisibility by 1? I will extend this method to even numbers <b>after</b> the <a href="http://bitmath.blogspot.com/2012/09/the-basics-of-working-with-rightmost-bit.html">next post</a>, which will cover some concept (bits again, I haven't forgotten about them) that are necessary for it.<br /></span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0tag:blogger.com,1999:blog-1465986942435538208.post-57902219466229576982012-09-12T14:09:00.003-07:002012-09-13T07:19:17.643-07:00Divisibility and digital roots<span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">To start off this blog, I'll start with a subject that I wrote about on <a href="http://www.codeproject.com/Messages/4329850/Digital-roots-and-divisibility.aspx">CodeProject</a>, divisibility and digital roots.<br /><br />It is well known that a number is divisible by 9 if and only if its digital root is 9. Less well known is that a similar trick kind applies to numbers other than 9, but doesn't really work out.<br /><br />In order to make this trick "work" (I'll get to why it sometimes doesn't) for number <code>k</code>, the digit at position <code>i</code> has to be multiplied by <code>base^i - [the biggest multiple of k <= base^i]</code> before adding it to the (modified) digital sum.<br /><br />For example for <code>k = 7, base = 10</code>, you'd multiply the ones position by 3, the tens position by 2, the hundreds position by 6, and so forth (3, 2, 6, 4, 5, 8, then it repeats). <br /><br />It <i>does</i> transform every multiple of 7 into a multiple of 7 (and every non-multiple-of-7 into a non-multiple-of-7), but it can be the same number, for example 14: 3 * 4 + 2 * 1 = 14, or it can even be a bigger number, for example 9.<br /><br />But we're programmers, so the base isn't 10. It can be 16. 6 * 6 = 36, so every (positive integer) power of 16 ends in a 6, which means that the nearest lower multiple of 5 is only 1 away. So for <code>k = 5</code>, it works out to a factor of 1 at every position.<br /><br />Even better, <code>16^n-1</code> is divisible by 15, so for base 16, <code>k = 15</code> works out well too, with a factor of 1 at every position. This leads to the following algorithm:</span><br /><pre class="brush: csharp">static bool IsMultipleOf15(int x)<br />{<br /> // lookup table to speed up last step<br /> const ulong lookuptable = 0x1000200040008001;<br /> int t = (x & 0x0F0F0F0F) + ((x & unchecked((int)0xF0F0F0F0)) >> 4);<br /> t = (t & 0x001F001F) + ((t & 0x1F001F00) >> 8);<br /> t = (t & 0x0000003F) + ((t & 0x003F0000) >> 16);<br /> t = (t & 0xF) + ((t & 0x70) >> 4);<br /> return ((lookuptable >> t) & 1) != 0;<br />}</pre><span style="font-family: "Helvetica Neue",Arial,Helvetica,sans-serif;">15, of course, has factors 3 and 5, so the same code works to test for divisibility by 3 or 5 just by changing the lookup table to <code>0x9249249249249249</code> or <code>0x1084210842108421</code>, respectively (the two of those ANDed together gives the lookup table for 15, of course). I haven't encountered a situation where this is useful; modulo by a constant is optimized by every sane compiler so this is never an optimization, just a curiosity (or perhaps something to torture interviewees with).<br /><br />In the <a href="http://bitmath.blogspot.com/2012/09/divisibility-and-modular-multiplication.html">next post</a>, I'll cover a divisibility testing algorithm that is actually useful.</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com0