tag:blogger.com,1999:blog-14659869424355382082023-10-21T13:05:22.347-07:00Bits, Math and Performance(?)50% bits, 50% math (allegedly)Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comBlogger37125tag:blogger.com,1999:blog-1465986942435538208.post-77691032379647309192023-09-23T16:33:00.005-07:002023-09-24T22:18:55.304-07:00Permuting bits with GF2P8AFFINEQB<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>It's no secret that GF2P8AFFINEQB can be tricky to think about, even in the restricted context of bit-permutations. Thinking about more than one step (such as more than one GF2P8AFFINEQB back-to-back, or GF2P8AFFINEQB flanked by byte-wise shuffles) is just too much. Or perhaps you can do it, tell me your secret.</p>
<p>A good way for mere mortals to reason about these kinds of permutations, I think, is to think in terms of the bits of the indices of the bits that are really being permuted. So we're 4 levels deep:</p>
<ol>
<li>The value whose bits are being permuted.</li>
<li>The bits that are being permuted.</li>
<li>The indices of those bits.</li>
<li>The bits of those indices.</li>
</ol>
<p>This can get a little confusing because a lot of the time the operation that will be performed on the bits of those indices is a permutation again, but they don't have to be, another classic example is that a rotation corresponds to add/subtracting a constant to the indices. Just keep in mind that we're 4 levels deep the entire time.</p>
<div style="margin: auto;"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2O3nYzf-SioFbHLCZEn6KMJjtr9hzSrrWgad5TShU6ZLebPiBUgmtkSKZweuPubnsPWtUkYngBvIH1EEWrEcMj0nQ0LMWaUPjy7CTiKd_j6O4uEcp2X1icA0PI4JK2eqGM7bJ1ZSGC4TqDNKnYFlRaPE7vyh06hZdXtS7-JWpXTo4XWnBEqeybucacCU/s1600/we%20need%20to%20go%20deeper.jpg"/><p>Actually we don't need to go deeper.</p></div>
<h2>The building blocks</h2>
<p>Assuming we have 512 bits to work with, the indices of those bits are 0..511: 9-bit numbers. We will split that into 3 groups of 3 bits, denoted <tt>a,b,c</tt> where <tt>a</tt> locates a QWORD in the 512-bit register, <tt>b</tt> locates a byte within that QWORD, and <tt>c</tt> locates a bit within that byte.</p>
<p>Here are some nice building blocks (given fairly arbitrary names):</p>
<ul>
<li><tt>P<sub>f</sub>(a,b,c) = a,b,f(c)</tt> aka "right GF2P8AFFINEQB", where <tt>f</tt> is any mapping from a 3-bit integer to a 3-bit integer. This building block can be implemented with <tt>_mm512_gf2p8affine_epi64_epi8(input, _mm512_set1_epi64(f_as_a_reversed_matrix), 0)</tt></li>
<li><tt>Q<sub>f</sub>(a,b,c) = a,f(c),~b</tt> aka "left GF2P8AFFINEQB", where <tt>~b</tt> is a 3-bit inversion, equivalent to <tt>7 - b</tt>. <tt>f</tt> can often be the identity mapping, swapping the second and third groups of bits is useful on its own (the "bonus" inversion can be annoying to deal with). This building block can be implemented with <tt>_mm512_gf2p8affine_epi64_epi8(_mm512_set1_epi64(f_as_a_matrix), input, 0)</tt></li>
<li><tt>S<sub>g</sub>(a,b,c) = g(a,b),c</tt> aka Shuffle, where <tt>g</tt> is any mapping from a 6-bit integer to a 6-bit integer. This building block can be implemented with <tt>_mm512_permutexvar_epi8(g_as_an_array, input)</tt>, but in some cases also with another instruction that you may prefer, depending on the mapping.</li>
</ul>
<p><tt>S</tt>, though it doesn't touch <tt>c</tt>, is quite powerful. As a couple of special cases that may be of interest, it can be used to swap <tt>a</tt> and <tt>b</tt>, invert <tt>a</tt> or <tt>b</tt>, or do a combined swap-and-invert.</p>
<p>We could further distinguish:</p>
<ul>
<li><tt>S64<sub>f</sub>(a,b,c) = f(a),b,c</tt> aka VPERMQ. This building block can be implemented with, you guessed it, VPERMQ.</li>
<li><tt>S8<sub>f</sub>(a,b,c) = a,f(b),c</tt> aka PSHUFB. This building block can be implemented with, you guessed it, PSHUFB. PSHUFB allows a bit more freedom than is used here, the mapping could be from 4-bit integers to 4-bit integers, but that's not nice to think about in this framework of 3 groups of 3 bits.</li>
</ul>
<h2>Building something with the blocks</h2>
<p>Let's say that we want to take a vector of 8 64-bit integers, and transpose it into a vector of 64 8-bit integers such that the k'th bit of the n'th uint64 ends up in the n'th bit of the k'th uint8. In terms of the bits of the indices of the bits (I swear it's not as confusing as it sounds) that means we want to build something that maps <tt>a,b,c</tt> to <tt>b,c,a</tt>. It's immediately clear that we need a <tt>Q</tt> operation at some point, since it's the only way to swap some other groups of bits into the 3rd position. But if we start with a <tt>Q</tt>, we get <tt>~b</tt> in the 3rd position while we need <tt>a</tt>. We can solve that by starting with an <tt>S</tt> that swaps <tt>a</tt> and <tt>b</tt> while also inverting <tt>a</tt> (I'm not going to bother defining what that looks like in terms of an index mapping function, just imagine that those functions are whatever they need to be in order to make it work):</p>
<div>
<tt>S<sub>f</sub>(a,b,c) = b,~a,c</tt><br/>
<tt>Q<sub>id</sub>(b,~a,c) = b,c,a</tt><br/>
</div>
<p>Which translates into code like this:</p>
<pre>__m512i Transpose8x64(__m512i x)
{
x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
56, 48, 40, 32, 24, 16, 8, 0,
57, 49, 41, 33, 25, 17, 9, 1,
58, 50, 42, 34, 26, 18, 10, 2,
59, 51, 43, 35, 27, 19, 11, 3,
60, 52, 44, 36, 28, 20, 12, 4,
61, 53, 45, 37, 29, 21, 13, 5,
62, 54, 46, 38, 30, 22, 14, 6,
63, 55, 47, 39, 31, 23, 15, 7), x);
__m512 idmatrix = _mm512_set1_epi64(0x8040201008040201);
x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
return x;
}</pre>
<p>Now let's say that we want to do the inverse of that, going back from <tt>b,c,a</tt> to <tt>a,b,c</tt>. Again it's clear that we need a <tt>Q</tt>, but we have some choice now. We could start by inverting the <tt>c</tt> in the middle first:</p>
<div>
<tt>S8<sub>f1</sub>(b,c,a) = b,~c,a</tt><br/>
<tt>Q<sub>id</sub>(b,~c,a) = b,a,c</tt><br/>
<tt>S<sub>f2</sub>(b,a,c) = a,b,c</tt><br/>
</div>
<p>Which translates into code like this:</p>
<pre>__m512i Transpose64x8(__m512i x)
{
x = _mm512_shuffle_epi8(x, _mm512_setr_epi8(
7, 6, 5, 4, 3, 2, 1, 0,
15, 14, 13, 12, 11, 10, 9, 8,
23, 22, 21, 20, 19, 18, 17, 16,
31, 30, 29, 28, 27, 26, 25, 24,
39, 38, 37, 36, 35, 34, 33, 32,
47, 46, 45, 44, 43, 42, 41, 40,
55, 54, 53, 52, 51, 50, 49, 48,
63, 62, 61, 60, 59, 58, 57, 56));
__m512 idmatrix = _mm512_set1_epi64(0x8040201008040201);
x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
0, 8, 16, 24, 32, 40, 48, 56,
1, 9, 17, 25, 33, 41, 49, 57,
2, 10, 18, 26, 34, 42, 50, 58,
3, 11, 19, 27, 35, 43, 51, 59,
4, 12, 20, 28, 36, 44, 52, 60,
5, 13, 21, 29, 37, 45, 53, 61,
6, 14, 22, 30, 38, 46, 54, 62,
7, 15, 23, 31, 39, 47, 55, 63), x);
return x;
}</pre>
<p>Or we could start with a <tt>Q</tt> to get the <tt>a</tt> out of the third position, then use an <tt>S</tt> to swap the first and second positions and a <tt>P</tt> to invert <tt>c</tt> (in any order).</p>
<div>
<tt>Q<sub>id</sub>(b,c,a) = b,a,~c</tt><br/>
<tt>S<sub>f1</sub>(b,a,~c) = a,b,~c</tt><br/>
<tt>P<sub>f2</sub>(a,b,~c) = a,b,c</tt><br/>
</div>
<p>Which translates into code like this:</p>
<pre>__m512i Transpose64x8(__m512i x)
{
__m512 idmatrix = _mm512_set1_epi64(0x8040201008040201);
x = _mm512_gf2p8affine_epi64_epi8(idmatrix, x, 0);
x = _mm512_permutexvar_epi8(_mm512_setr_epi8(
0, 8, 16, 24, 32, 40, 48, 56,
1, 9, 17, 25, 33, 41, 49, 57,
2, 10, 18, 26, 34, 42, 50, 58,
3, 11, 19, 27, 35, 43, 51, 59,
4, 12, 20, 28, 36, 44, 52, 60,
5, 13, 21, 29, 37, 45, 53, 61,
6, 14, 22, 30, 38, 46, 54, 62,
7, 15, 23, 31, 39, 47, 55, 63), x);
x = _mm512_gf2p8affine_epi64_epi8(x, idmatrix, 0);
return x;
}</pre>
<p>I will probably keep using a SAT solver to solve the masks (using the same techniques as in <a href="https://bitmath.blogspot.com/2023/04/not-transposing-16x16-bitmatrix.html">(Not) transposing a 16x16 bitmatrix</a>), but now at least I have a proper way to think about the <i>shape</i> of the solution, which makes it a lot easier to ask a SAT solver to fill in the specifics.</p>
<p>This framework could be extended with other bit-permutation operatations such as QWORD rotates, but that quickly becomes tricky to think about.</p>
</div>
Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-74502350382132905862023-07-02T02:26:00.002-07:002023-08-01T09:28:13.401-07:00Propagating bounds through bitwise operations<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
blockquote footer {
background-color: #F5F5F5;
clear: both;
}
blockquote:before {
color: #ccc;
content: open-quote;
font-size: 4em;
line-height: 0.1em;
margin-right: 0.25em;
vertical-align: -0.4em;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>This post is meant as a replacement/recap of some work that I did over a decade ago on propagating bounds through bitwise operations, which was intended as an improvement over the implementations given in Hacker's Delight chapter 4, Arithmetic Bounds.</p>
<p>The goal is, given two variables <tt>x</tt> and <tt>y</tt>, with known bounds <tt class="nobr">a ≤ x ≤ b</tt>, <tt class="nobr">c ≤ y ≤ d</tt>, compute the bounds of <tt class="nobr">x | y</tt> and of <tt class="nobr">x & y</tt>. Thanks to De Morgan, we have the equations (most also listed in Hacker's Delight, except the last one)</p>
<ul>
<li><tt class="nobr">minAND(a, b, c, d) = ~maxOR(~b, ~a, ~d, ~c)</tt></li>
<li><tt class="nobr">maxAND(a, b, c, d) = ~minOR(~b, ~a, ~d, ~c)</tt></li>
<li><tt class="nobr">minXOR(a, b, c, d) = minAND(a, b, ~d, ~c) | minAND(~b, ~a, c, d)</tt></li>
<li><tt class="nobr">maxXOR(a, b, c, d) = maxOR(a, b, c, d) & ~minAND(a, b, c, d)</tt></li>
</ul>
<p>Everything can be written in terms of only <tt>minOR</tt> and <tt>maxOR</tt> and some basic operations.</p>
<h2><tt>maxOR</tt></h2>
<p>To compute the upper bound of the OR of <tt>x</tt> and <tt>y</tt>, what we need to do is find is the leftmost bit (henceforth the "target bit") such that it is both:</p>
<ol>
<li>set in both <tt>b</tt> and <tt>d</tt> (the upper bounds of <tt>x</tt> and <tt>y</tt>) and,</li>
<li>changing an upper bound (either one of them, doesn't matter, but never both) by resetting the target bit and setting the bits that are less significant, keeps it greater-or-equal than the corresponding lower bound.</li>
</ol>
<p>The explanation of why that works can be found in Hacker's Delight, along with a more of less direct transcription into code, but we can do better than a direct transcription.</p>
<p>Finding the leftmost bit that passes only the first condition would be easy, its the highest set bit in <tt class="nobr">b & d</tt>. The second condition is a bit more complex to handle, but still surprisingly easy thanks to one simple observation: the bits that can pass it, are precisely those bits that are at (or to the right of) the leftmost bit where the upper and lower bound differ. Imagine two numbers in binary, one being the lower bound and the other the upper bound. The number have some equal prefix (possibly zero bits long, up to <i>all</i> bits) and then if they differ, they must differ by a bit in the upper bound being 1 while the corresponding bit in the lower bound is 0. Lowering the upper bound by resetting that bit while setting all bits the right of it, cannot make it lower than the lower bound.</p>
<p>For one of the inputs, say <tt>x</tt>, the position at which that second condition start being <i>false</i> (looking at that bit and to the left of it) can be computed directly with <tt class="nobr">64 - lzcnt(a ^ b)</tt>. We actually need the maximum of that across both pairs of bounds, but there's no need to compute that for both bounds and then take the maximum, we can use this to let the <tt>lzcnt</tt> find the maximum automatically: <tt class="nobr">64 - lzcnt((a ^ b) | (c ^ d))</tt>.</p>
<p><a href="https://www.felixcloutier.com/x86/bzhi.html"><tt>bzhi(m, k)</tt></a> is an operation that resets the bits in <tt>m</tt> starting at index <tt>k</tt>. It can be emulated by shifting or masking, but an advantage of <tt>bzhi</tt> is that it is well defined for any relevant <tt>k</tt>, including when <tt>k</tt> is equal to the size of the integer in bits. <tt>bzhi</tt> is not strictly required here, but it is more convenient than "classic" bitwise operations, and available on most x64 processors today<sup><a href="#note1">[1]</a></sup>. Using <tt>bzhi</tt>, it's simple to take the position calculated in the previous paragraph and reset all the bits in <tt class="nobr">b & d</tt> that do not pass the second condition: <tt class="nobr">bzhi(b & d, 64 - lzcnt((a ^ b) | (c ^ d)))</tt>.</p>
<p>With that bitmask in hand, all we need to do is apply it to one of the upper bounds. We can skip the "reset the target bit" part, since that bit will be set in the <i>other</i> upper bound and therefore also in the result. It also does not matter which upper bound is changed, regardless of which bound we were conceptually changing. Let's pick <tt>b</tt> for no particular reason. Then in total, the implementation could be:</p>
<pre>uint64_t maxOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
uint64_t index = 64 - _lzcnt_u64((a ^ b) | (c ^ d));
uint64_t candidates = _bzhi_u64(b & d, index);
if (candidates) {
uint64_t target = highestSetBit(candidates);
b |= target - 1;
}
return b | d;
}</pre>
<p>For the <tt>highestSetBit</tt> function you can choose any way you like to isolate the highest set bit in an integer.</p>
<h2><tt>minOR</tt></h2>
<p>Computing the lower bound of <tt class="nobr">x | y</tt> surprisingly seems to be more complex. The basic principles are similar, but this time bits are being reset in one of the lower bounds, and it <i>does</i> matter in which lower bound that happens. The computation of the mask of candidate bits also "splits" into separate candidates for each lower bound, unless there's some trick that I've missed. This whole "splitting" thing cannot be avoided by defining <tt>minOR</tt> in terms of <tt>maxAND</tt> either, because the same things happen there. But it's not too bad, a little bit of extra arithmetic. Anyway, let's see some code.</p>
<pre>uint64_t minOR(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
{
uint64_t candidatesa = _bzhi_u64(~a & c, 64 - _lzcnt_u64(a ^ b));
uint64_t candidatesc = _bzhi_u64(a & ~c, 64 - _lzcnt_u64(c ^ d));
uint64_t target = highestSetBit(candidatesa | candidatesc);
if (a & target) {
c &= -target;
}
if (c & target) {
a &= -target;
}
return a | c;
}</pre>
<p>A Fun Fact here is that the target bit cannot be set in both bounds, opposite to what happens in <tt>maxOR</tt> where the target bit is always set in both bounds. You may be tempted to turn the second <tt>if</tt> into <tt>else if</tt>, but in my tests it was quite important that the <tt>if</tt>s are compiled into conditional moves rather than branches (which of the lower bounds the target bit is found in was essentially random), and using <tt>else if</tt> here apparently discourages compilers (MSVC at least) from using conditional moves.</p>
<p><tt>candidatesa | candidatesc</tt> can be zero, although that is very rare, at least in my usage of the function. As written, the code assumes that <tt>highestSetBit</tt> deals with that gracefully by returning zero if its input is zero. Branching here is (unlike in the two <tt>if</tt>s at the end of <tt>minOR</tt>) not a big deal since this case is so rare (and therefore predictable).</p>
<h2>Conclusion</h2>
<p>In casually benchmarking these functions, I found them to be a bit faster than the ones that I came up with over a decade ago, and significantly faster than the ones from Hacker's Delight. That basic conclusion probably translates to different scenarios, but the exact ratios will vary a lot based on how predictable the branches are in that case, on your CPU, and on arbitrary codegen decisions made by your compiler.</p>
<p>In any case these new versions look nicer to me.</p>
<p>There are probably much simpler solutions if the bounds were stored in bit-reversed form, but that doesn't seem convenient.</p>
<p>Someone on a certain link aggregation site asked about signed integers. As Hacker's Delight explains via a table, things can go wrong if one (or both) bounds cross the negative/positive boundary - but the solution in those cases is still easy to compute. The way I see it, the basic problem is that a signed bound that crosses the negative/positive boundary effectively encodes two different unsigned intervals, one starting at zero and one ending at the greatest unsigned integer, and the basic unsigned <tt>minOR</tt> and so on cannot (by themselves) handle those "split" intervals.</p>
<hr>
<p id="note1">[1] Sadly not all, some low-end Intel processors have AVX disabled, which apparently is done by disabling the entire VEX encoding and it takes out BMI2 as collateral damage.</p>
</div>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-61235177564640760542023-06-12T09:02:00.002-07:002023-06-12T09:07:19.991-07:00Some ways to check whether an unsigned sum wraps<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>When computing <tt class="nobr">x + y</tt>, does the sum wrap? There are various ways to find out, some of them well known, some less. Some of these are probably totally unknown, in some cases deservedly so.</p>
<p>This is not meant to be an exhaustive list.</p>
<ul>
<li><h3><tt class="nobr">~x < y</tt></h3>
<p>A cute trick in case you don't want to compute the sum, for whatever reason.</p>
<p>Basically a variation of how <a href="https://wiki.sei.cmu.edu/confluence/display/c/INT30-C.+Ensure+that+unsigned+integer+operations+do+not+wrap">MISRA C</a> recommends checking for wrapping if you use the precondition test since <tt class="nobr">~x = UINT_MAX - x</tt>.</p>
</li>
<li><h3><tt class="nobr">(x + y) < x</tt></h3>
<h3><tt class="nobr">(x + y) < y</tt></h3>
<p>The Classic™. Useful for being cheap to compute: <tt class="nobr">x + y</tt> is often needed anyway, in which case this effectively only costs a comparison. Also recommended by MISRA C, in case you want to express the "has the addition wrapped"-test as a postcondition test.</p>
</li>
<li><h3><tt class="nobr">avg_down(x, y) <s 0</tt></h3>
<h3><tt class="nobr">avg_down(x, y) & 0x80000000 // adjust to integer size</tt></h3>
<p><tt class="nobr"><s</tt> is signed-less-than. Performed on unsigned integers here, so be it.</p>
<p><tt class="nobr">avg_down</tt> is the unsigned average rounded down. <tt class="nobr">avg_up</tt> is the unsigned average rounded up.</p>
<p>Since <tt class="nobr">avg_down</tt> is the sum (the full sum, without wrapping) shifted right by 1, what would have been the carry out of the top of the sum becomes the top bit of the average. So, checking the top bit of <tt class="nobr">avg_down(x, y)</tt> is equivalent to checking the carry out of <tt class="nobr">x + y</tt>.</p>
<p>Can be converted into <tt class="nobr">avg_up(~x, ~y) >=s 0</tt> through the equivalence <tt class="nobr">avg_down(x, y) = ~avg_up(~x, ~y)</tt>.</p>
</li>
<li><h3><tt class="nobr">(x + y) < min(x, y)</tt></h3>
<h3><tt class="nobr">(x + y) < max(x, y)</tt></h3>
<h3><tt class="nobr">(x + y) < avg(x, y)</tt></h3>
<h3><tt class="nobr">(x + y) < (x | y)</tt></h3>
<h3><tt class="nobr">(x + y) < (x & y)</tt></h3>
<h3><tt class="nobr">~(x | y) < (x & y)</tt></h3>
<p>Variants of The Classic™. They all work for <a href="https://bitmath.blogspot.com/2022/07/bit-level-commutativity-and-sum-gap.html">essentially the same reason</a>: addition is commutative, including at the bit-level. So if we have <tt class="nobr">(x + y) < x</tt>, then we also have <tt class="nobr">(x + y) < y</tt>, and together they imply that instead of putting <tt>x</tt> or <tt>y</tt> on the right hand side of the comparison, we could arbitrarily select one of them, or anything between them too. Bit-level commutativity takes care of the bottom three variants in a similar way.</p>
<p>Wait, is that a signed or unsigned <tt>min</tt>? Does <tt>avg</tt> round up or down or either way depending on the phase of the moon? It doesn't matter, all of those variants work and more.</p>
</li>
<li><h3><tt class="nobr">(x + y) != addus(x, y)</tt></h3>
<h3><tt class="nobr">(x + y) < addus(x, y)</tt></h3>
<p><tt class="nobr">addus</tt> is addition with unsigned saturation, meaning that instead of wrapping the result would be <tt>UINT_MAX</tt>.</p>
<p>When are normal addition and addition with unsigned saturation different? Precisely when one wraps and the other saturates. Wrapping addition cannot "wrap all the way back" to <tt>UINT_MAX</tt>, the highest result <i>when the addition wraps</i> is <tt class="nobr">UINT_MAX + UINT_MAX = UINT_MAX - 1</tt>.</p>
<p>When the normal sum and saturating sum are different, the normal sum must be the smaller of the two (it certainly couldn't be greater than <tt>UINT_MAX</tt>), hence the second variant.</p>
</li>
<li><h3><tt class="nobr">subus(y, ~x) != 0</tt></h3>
<h3><tt class="nobr">subus(x, ~y) != 0</tt></h3>
<h3><tt class="nobr">addus(~x, ~y) != UINT_MAX</tt></h3>
<p><tt class="nobr">subus</tt> is subtraction with unsigned saturation.</p>
<p>Strange variants of <tt class="nobr">~x < y</tt>. Since <tt class="nobr">subus(a, b)</tt> will be zero when <tt class="nobr">a <= b</tt>, it will be non-zero when <tt class="nobr">b < a</tt>, therefore <tt class="nobr">subus(y, ~x) != 0</tt> is equivalent to <tt class="nobr">~x < y</tt>.</p>
<p><tt class="nobr">subus(a, b) = ~addus(~a, b)</tt> lets us turn the <tt>subus</tt> variant into the <tt>addus</tt> variant.</p>
</li>
<li><h3><tt class="nobr">(x + y) < subus(y, ~x)</tt></h3>
<p>Looks like a cursed hybrid of <tt class="nobr">(x + y) < avg(x, y)</tt> and <tt class="nobr">subus(y, ~x) != 0</tt>, but the mechanism is (at least the way I see it) different from both of them.</p>
<p><tt class="nobr">subus(y, ~x)</tt> will be zero when <tt class="nobr">~x >= y</tt>, which is exactly when the sum <tt class="nobr">x + y</tt> would not wrap. <tt class="nobr">x + y</tt> certainly cannot be unsigned-less-than zero, so overall the condition <tt class="nobr">(x + y) < subus(y, ~x)</tt> must be false (which is good, it's supposed to be false when <tt class="nobr">x + y</tt> would not wrap).</p>
<p>In the other case, when <tt class="nobr">~x < y</tt>, we know that <tt class="nobr">x + y</tt> will wrap and <tt class="nobr">subus(y, ~x)</tt> won't be zero (and therefore cannot saturate). Perhaps there is a nicer way to show what happens, but at least under those conditions (predictable wrapping and no saturation) it is easy to do algebra:</p>
<ul style="list-style-type:none">
<li><tt class="nobr">(x + y) < subus(y, ~x)</tt></li>
<li><tt class="nobr">x + y - 2<sup>k</sup> < y - (2<sup>k</sup> - 1 - x)</tt></li>
<li><tt class="nobr">x + y - 2<sup>k</sup> < y - 2<sup>k</sup> + 1 + x</tt></li>
<li><tt class="nobr">x + y < y + 1 + x</tt></li>
<li><tt class="nobr">0 < 1</tt></li>
</ul>
<p>So the overall condition <tt class="nobr">(x + y) < subus(y, ~x)</tt> is true <i>IFF</i> <tt class="nobr">x + y</tt> wraps.</p>
</li>
<li><h3><tt class="nobr">~x < avg_up(~x, y)</tt></h3>
<p>Similar to <tt class="nobr">~x < y</tt>, but stranger. Averaging <tt>y</tt> with <tt class="nobr">~x</tt> cannot take a low <tt>y</tt> to above <tt class="nobr">~x</tt>, nor a high <tt>y</tt> to below <tt class="nobr">~x</tt>. The direction of rounding is important: <tt class="nobr">avg_down(~x, y)</tt> could take an <tt>y</tt> that's just one higher than <tt class="nobr">~x</tt> down to <tt class="nobr">~x</tt> itself, making it no longer higher than <tt class="nobr">~x</tt>. <tt class="nobr">avg_up(~x, y)</tt> cannot do that thanks to rounding up.</p>
</li>
</ul>
</div>
Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-17437326884631113852023-05-22T02:48:00.004-07:002023-05-22T23:48:28.925-07:00grevmul<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
blockquote footer {
background-color: #F5F5F5;
clear: both;
}
blockquote:before {
color: #ccc;
content: open-quote;
font-size: 4em;
line-height: 0.1em;
margin-right: 0.25em;
vertical-align: -0.4em;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p><tt>grev</tt> (<a href="http://programming.sirrida.de/bit_perm.html#general_reverse_bits">generalized bit-reverse</a>) is an operation that implements bit-permutations corresponding to XOR-ing the indices by some value. It has been proposed to be part of the Zbp extension of RISC-V, with this reference implementation (source: <a href="https://github.com/riscv/riscv-bitmanip/releases/tag/v0.93">release v0.93</a>)</p>
<pre>uint32_t grev32(uint32_t rs1, uint32_t rs2)
{
uint32_t x = rs1;
int shamt = rs2 & 31;
if (shamt & 1) x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
if (shamt & 2) x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
if (shamt & 4) x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
if (shamt & 8) x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8);
if (shamt & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
return x;
}</pre>
<p><tt>grev</tt> looks in some ways similar to bit-shifts and rotates: the left and right operands have distinct roles with the right operand being a mask of <tt>k</tt> bits if the left operand has 2<sup>k</sup> bits<sup><a href="#note1">[1]</a></sup>.</p>
<p>Carry-less multiplication normally has a left-shift in it, <tt>grevmul</tt> is what you get when that left-shift is replaced with <tt>grev</tt>.</p>
<pre>uint32_t grevmul32(uint32_t x, uint32_t y)
{
uint32_t r = 0;
for (int k = 0; k < 32; k++) {
if (y & (1 << k))
r ^= grev32(x, k);
}
return x;
}</pre>
<p><tt>grevmul</tt> is, at its core, very similar to <tt>clmul</tt>: take single-bit products (logical AND) of every bit of the left operand with every bit of the right operand, then do some XOR-reduction. The difference is in which partial products are grouped together. For <tt>clmul</tt>, the partial products that contribute to bit <tt>k</tt> of the result are pairs with indices <tt>i,j</tt> such that <tt>i + j = k</tt>. For <tt>grevmul</tt>, it's the pairs with indices such that <tt>i ^ j = k</tt>. This goes back to <tt>grev</tt> permuting the bits by XOR-ing their indices by some value, and that value is <tt>k</tt> here.</p>
<p>Now that <tt>grevmul</tt> has been defined, let's look at some of its properties, comparing it to <tt>clmul</tt> and plain old <tt>imul</tt>.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-3z1b{border-color:#000000;text-align:right;vertical-align:top}
.tg .tg-73oq{border-color:#000000;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
<tr>
<th class="tg-73oq"></th>
<th class="tg-3z1b">grevmul</th>
<th class="tg-3z1b">clmul</th>
<th class="tg-3z1b">imul</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-73oq">zero<sup><a href="#note2">[2]</a></sup></td>
<td class="tg-3z1b">0</td>
<td class="tg-3z1b">0</td>
<td class="tg-3z1b">0</td>
</tr>
<tr>
<td class="tg-73oq">identity</td>
<td class="tg-3z1b">1</td>
<td class="tg-3z1b">1</td>
<td class="tg-3z1b">1</td>
</tr>
<tr>
<td class="tg-73oq">commutative</td>
<td class="tg-3z1b">yes</td>
<td class="tg-3z1b">yes</td>
<td class="tg-3z1b">yes</td>
</tr>
<tr>
<td class="tg-73oq">associative</td>
<td class="tg-3z1b">yes</td>
<td class="tg-3z1b">yes</td>
<td class="tg-3z1b">yes</td>
</tr>
<tr>
<td class="tg-73oq">distributes over</td>
<td class="tg-3z1b">xor</td>
<td class="tg-3z1b">xor</td>
<td class="tg-3z1b">addition</td>
</tr>
<tr>
<td class="tg-73oq">op(x, 1 << k) is</td>
<td class="tg-3z1b">grev(x, k)</td>
<td class="tg-3z1b">x << k</td>
<td class="tg-3z1b">x << k</td>
</tr>
<tr>
<td class="tg-73oq">x has inverse if<br></td>
<td class="tg-3z1b">popcnt(x) & 1</td>
<td class="tg-3z1b">x & 1</td>
<td class="tg-3z1b">x & 1</td>
</tr>
<tr>
<td class="tg-73oq">op(x, x) is</td>
<td class="tg-3z1b">popcnt(x) & 1</td>
<td class="tg-3z1b"><a href="https://wunkolo.github.io/post/2020/05/pclmulqdq-tricks/#bit-spread">pdep(x, 0x55555555)</a><br></td>
<td class="tg-3z1b">x²</td>
</tr>
</tbody>
</table>
<h2 id="inverse">What is the "grevmul inverse" of <tt>x</tt>?</h2>
<p>Time for some algebra. Looking just at the table above, and forgetting the actual definition of <tt>grevmul</tt>, can we say something about the solutions of <tt>grevmul(x, y) == 1</tt>? Surprisingly, yes.</p>
<p>Assuming we have some <tt>x</tt> with odd parity (numbers with even parity do not have inverses, so let's ignore them for now), we know that <tt>grevmul(x, x) == 1</tt>. The <a href="https://proofwiki.org/wiki/Inverse_in_Monoid_is_Unique">inverse in a monoid is unique</a> so <tt>x</tt> is not just some inverse of <tt>x</tt>, it is <i>the</i> (unique) inverse of <tt>x</tt>.</p>
<p>Since the "addition operator" is XOR (for which negation is the identity function), this is a non-trivial example of a ring in which <tt class="nobr">x = -x = x<sup>-1</sup></tt>, when <tt class="nobr">x<sup>-1</sup></tt> exists. Strange, isn't it?</p>
<p>We also have that <tt>f(x) = grevmul(x, c)</tt> (for appropriate choices of <tt>c</tt>) is a (non-trivial) involution, so it may be a contenter for the "middle operation" of an <a href="http://marc-b-reynolds.github.io/math/2019/08/20/InvFinalizer.html">involutary bit finalizer</a>, but probably useless without an efficient implementation.</p>
<p>I was going to write about implementing <tt>grevmul</tt> by an 8-bit constant with two <tt>GF2P8AFFINEQB</tt>s but I've had enough for now, maybe later.</p>
<hr>
<p id="note1">[1] The right operand of a shift is often called the shift <i>count</i>, but it can also be interpreted as a mask indicating some subset of shift-by-2<sup>i</sup> operations to perform. That interpretation is useful for example when implementing a shift-by-variable operation on a machine that only has a shift-by-constant instruction, following the same pattern as the reference implementation of <tt>grev32</tt>.</p>
<p id="note2">[2] This looks like a joke, but I mean that the numeric value 0 acts as the zero element of the corresponding semigroup. </p>
</div>
Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-64198561125188993572023-04-12T17:11:00.006-07:002023-04-13T01:09:21.252-07:00(Not) transposing a 16x16 bitmatrix<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
blockquote footer {
background-color: #F5F5F5;
clear: both;
}
blockquote:before {
color: #ccc;
content: open-quote;
font-size: 4em;
line-height: 0.1em;
margin-right: 0.25em;
vertical-align: -0.4em;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>Inverting a 16-element permutation may done like this:</p>
<pre>for (int i = 0; i < 16; i++)
inv[perm[i]] = i;</pre>
<p>Computing a histogram of 16 nibbles may done like this:</p>
<pre>for (int i = 0; i < 16; i++)
hist[data[i]] += 1;</pre>
<p>These different-sounding but already similar-looking tasks have something in common: they can be both be built around a 16x16 bitmatrix transpose. That sounds silly, why would anyone want to first construct a 16x16 bitmatrix, transpose it, and then do yet more processing to turn the resulting bitmatrix back into an array of numbers?</p>
<p>Because it turns out to be an efficiently-implementable operation, on some modern processors anyway.</p>
<p>If you know anything about the <a href="https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0">off-label application of <tt>GF2P8AFFINEQB</tt></a>, you may already suspect that it will be involved somehow (merely left-<tt>GF2P8AFFINEQB</tt>-ing by the identity matrix already results in some sort of 8x8 transpose, just horizontally mirrored), and it will be, but that's not the whole story.</p>
<p>First I will show not only how to do it with <tt>GF2P8AFFINEQB</tt>, but also how to find that solution programmatically using a SAT solver. There is nothing that fundamentally prevents a human from finding a solution by hand, but it seems difficult. Using a SAT solver to find a solution <i>ex nihilo</i> (requiring it to find both a sequence of instructions and their operands) is not that easy either (though that technique also exists). Thankfully, Geoff Langdale suggested a promising sequence of instructions:</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">nibble value, so now we have 16 bitfields with 1 bit set. <br><br>At this point, what we really want is a 16x16 transpose, not 4 8x8 transposes, but I'm pretty sure we can fake it by using VPERMB to redistribute our bytes (probably first grouping all top 8 bytes into the first 128b ...</p>— Geoff Langdale (@geofflangdale) <a href="https://twitter.com/geofflangdale/status/1611662757665050625?ref_src=twsrc%5Etfw">January 7, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>The problem we have now (which the SAT solver will solve) is, under the constraint that for all <tt>X</tt>, <tt>f(X) = PERMB(GF2P8AFFINE(B, PERMB(X, A)), C)</tt> computes the transpose of <tt>X</tt>, what is a possible valuation of the variables <tt>A, B, C</tt>. Note that the variables in the SAT problem correspond to constants in the resulting code, and the variable in the resulting code (<tt>X</tt>) is quantified out of the problem.</p>
<p>If you know a bit about SAT solving, that "for all <tt>X</tt>" sounds like trouble, requiring either creating a set of constraints for every possible value of <tt>X</tt> (henceforth, concrete values of <tt>X</tt> will be known as "examples"), or some advanced technique such as CEGIS to dynamically discover a smaller set of examples to base the constraints on. Luckily, since we are dealing with a bit-permutation, there are simple and small sets of examples that together sufficiently constrain the problem. For a 16-bit permutation, this set of values could be used:</p>
<ul>
<li><tt>1010101010101010</tt></li>
<li><tt>1100110011001100</tt></li>
<li><tt>1111000011110000</tt></li>
<li><tt>1111111100000000</tt></li>
</ul>
<p>For a 256-bit permutation, a similar pattern can be used, where each of the examples has 256 bits and there would be 8 of them. Note that if you read the <i>columns</i> of the values, they list out the indices of the corresponding columns, which is no coincidence. Using that set of examples to constrain the problem with, essentially means that we assert that <tt>f</tt> when applied to the sequence 0..n-1 must result in the desired permutation. The way that I actually implemented this puts a column into one "abstract bit", so that it represents the index of the bit all in one place instead of spread out.</p>
<p>Implementing a "left <tt>GF2P8AFFINEQB</tt>" (multiplying a constant matrix on the left by a variable matrix on the right) in <a href="https://en.wikipedia.org/wiki/Conjunctive_normal_form">CNF</a>, operating on "abstract bits" (8 variables each), is relatively straight forward. Every (abstract) bit of the result is the XOR of the AND of some (abstract) bits, writing that down is mostly a chore, but there is one interesting aspect: the XOR can be turned into an OR, since we know that we're multiplying by a permutation matrix. In CNF, OR is simpler than XOR, and easier for the solver to reason through.</p>
<p><tt>VPERMB</tt> is more difficult to implement, given that the permutation operand is a variable (if it was a constant, we could just permute the abstract bits without generating any new constraints). To make it easier, I represent the permutation operand as a 32x32 permutation matrix, letting me create a bunch of simple ternary constraints of the form <tt>(¬P(i, j) ∨ ¬A(j) ∨ R(i)) ∧ (¬P(i, j) ∨ A(j) ∨ ¬R(i))</tt> (read: if <tt>P(i, j)</tt>, then <tt>A(j)</tt> must be equal to <tt>R(i)</tt>). The same thing can be used to implement <tt>VPSHUFB</tt>, with additional constraints on the permutation matrix (to prevent cross-slice movement).</p>
<p>Running <a href= "https://gitlab.com/haroldaptroot/transposesolver16x16/-/blob/main/TransposeSolver16x16.cpp">that code</a>, at least on my PC at this time<a href="#note1"><sup>[1]</sup></a>, results in (with some whitespace manually added):</p>
<pre>__m256i t0 = _mm256_permutexvar_epi8(_mm256_setr_epi8(
14, 12, 10, 8, 6, 4, 2, 0,
30, 28, 26, 24, 22, 20, 18, 16,
15, 13, 11, 9, 7, 5, 3, 1,
31, 29, 27, 25, 23, 21, 19, 17), input);
__m256i t1 = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x1080084004200201), t0, 0);
__m256i t2 = _mm256_shuffle_epi8(t1, _mm256_setr_epi8(
0, 8, 1, 9, 3, 11, 5, 13,
7, 15, 2, 10, 4, 12, 6, 14,
0, 8, 1, 9, 3, 11, 5, 13,
7, 15, 2, 10, 4, 12, 6, 14));</pre>
<p>So that's it. That's the answer<a href="#note2"><sup>[2]</sup></a>. If you want to transpose a 16x16 bitmatrix, on a modern PC (this code requires AVX512_VBMI and AVX512_GFNI<a href="#note3"><sup>[3]</sup></a>), it's fairly easy and cheap, it's just not so easy to find this solution to begin with.</p>
<p>Using this transpose to invert a 16-element permutation is pretty easy, for example using <tt>_mm256_sllv_epi16</tt> to construct the matrix and <tt>_mm256_popcnt_epi16(_mm256_sub_epi16(t2, _mm256_set1_epi16(1)))</tt> (sadly there is no SIMD version of <tt>TZCNT</tt> .. yet) to convert the bit-masks back into indices. It may be tempting to try to use a mirrored matrix and leading-zero count, which AVX512 does offer, but it only offers the DWORD and QWORD versions <tt>VPLZCNTD/Q</tt>.</p>
<p>Making a histogram is even simpler, using only <tt>_mm256_popcnt_epi16(t2)</tt> to convert the matrix into counts. </p>
<h2>And for my next trick, I will now <i>not</i> transpose the matrix</h2>
<p>What if we didn't transpose that matrix. Does that even make sense? Well, at least for the two applications that I focused on, what we really need is not so much the transpose of the matrix, but any matrix such that:</p>
<ol>
<li>Every bit of the original matrix occurs exactly once in the result.</li>
<li>Each row of the result contains all bits from a particular column.</li>
<li>The permutation within each row is "regular" enough that we can work with it. We don't need this when making a histogram (as Geoff already noted in one of his tweets).</li>
</ol>
<p>There is no particular requirement on the order of the rows, any row-permutation we end up with is easy to undo.</p>
<p>The first two constraints leave plenty of options open, but the last constraint is quite vague. Too vague for me to do something such as searching for the best not-quite-transpose, so I don't promise to have found it. But here is <i>a</i> solution: rotate every row by its index, then rotate every column by its index.</p>
<p>At least, that's the starting point. Rotating the columns requires 3 rounds of blending a vector with cross-slice-permuted copy of that vector, and a <tt>VPERMQ</tt> sandwiched by two <tt>VPSHUFB</tt>s to rotate the last 8 columns by 8. That's a lot of cross-slice permuting, most of it can be avoided by modifying the overall permutation slightly:</p>
<ol>
<li>Exchange the off-diagonal quadrants.</li>
<li>Rotate each row by its index.</li>
<li>For each quadrant individually, rotate each column by its index.</li>
</ol>
<p>Here is some attempt at illustrating that process, feel free to <a href="#afterimages">skip past it</a></p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfdPsR2jD09_YFXo49BXxK6TKn2yEnEiZXOQ8YniaiIXlIsd-hAK43pmt7OMl1iKLkYO9WrnBsP-GJGm-O6RxKvpxxHNby8w7OFpD1Sl9wBmKmpTwJSr5KMb8YKq8omaY9rtoLV4OW0ZPBf761pk17DnglJ2GwmzVhYVE1iRbz9pP0PdGCSoKUxBIG/s1600/M1.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="513" data-original-width="513" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfdPsR2jD09_YFXo49BXxK6TKn2yEnEiZXOQ8YniaiIXlIsd-hAK43pmt7OMl1iKLkYO9WrnBsP-GJGm-O6RxKvpxxHNby8w7OFpD1Sl9wBmKmpTwJSr5KMb8YKq8omaY9rtoLV4OW0ZPBf761pk17DnglJ2GwmzVhYVE1iRbz9pP0PdGCSoKUxBIG/s1600/M1.png"/></a></div><div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoWeFTX8UNGGnz6DyeDuEv-AgW4iDwLXn4di-QRBGh71XIcJwgI1m5YFe5SfvSlzVW77uJ-A632DMwtEEFv11JXySdl7DF4aAbRCApsaHipP5pVeDKBet6hylsCiOnp8d7gmZl9cdYC2AZGuz6FZS07hdKYiyr7CpaheQ1NXsHDQAuJ1a746P0I6Y/s1600/M2.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="513" data-original-width="513" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoWeFTX8UNGGnz6DyeDuEv-AgW4iDwLXn4di-QRBGh71XIcJwgI1m5YFe5SfvSlzVW77uJ-A632DMwtEEFv11JXySdl7DF4aAbRCApsaHipP5pVeDKBet6hylsCiOnp8d7gmZl9cdYC2AZGuz6FZS07hdKYiyr7CpaheQ1NXsHDQAuJ1a746P0I6Y/s1600/M2.png"/></a></div><div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjt7g_o74IX00R9j3PWjdT3BftUpayzauVbymZHhKd0aiLT8lVlMYM7Jf1vwXLb5NGTeU0wunQyjLFq5dtXioNBW_SNp2bnYCJPmaEvwRh4IFyR_Nk3JiUGd3dMWwb9497zUrAxU9xTD7pEcst7THrCjDPlcgPiuygcBF0uVoNEQYF89fGVrzWfsa8r/s1600/M3.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="513" data-original-width="513" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjt7g_o74IX00R9j3PWjdT3BftUpayzauVbymZHhKd0aiLT8lVlMYM7Jf1vwXLb5NGTeU0wunQyjLFq5dtXioNBW_SNp2bnYCJPmaEvwRh4IFyR_Nk3JiUGd3dMWwb9497zUrAxU9xTD7pEcst7THrCjDPlcgPiuygcBF0uVoNEQYF89fGVrzWfsa8r/s1600/M3.png"/></a></div><div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5scRDLbvJwPVxF5nZFK6vbGINKnWmcskh8Y1jhfBmWKyQK0VVZpjle2P8b6nsLQDZjUsExjOTYgyTD5vhFpxyFtUAZHDjvJUvDqR0yNbdZP6WtOlKgux-iVbADLoN2XTzUsEmmZ-9cWXsMw8y-ybNRWYTsDhfA0HlQuNJ8-j6ttWWC-joUZduOayq/s1600/M4.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="513" data-original-width="513" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5scRDLbvJwPVxF5nZFK6vbGINKnWmcskh8Y1jhfBmWKyQK0VVZpjle2P8b6nsLQDZjUsExjOTYgyTD5vhFpxyFtUAZHDjvJUvDqR0yNbdZP6WtOlKgux-iVbADLoN2XTzUsEmmZ-9cWXsMw8y-ybNRWYTsDhfA0HlQuNJ8-j6ttWWC-joUZduOayq/s1600/M4.png"/></a></div>
<p id="aferimages">These three steps are implementable in AVX2:</p>
<ol>
<li>Exchanging the off-diagonal quadrants can be done by gathering the quadrants into QWORDs, permuting them, and shuffling the QWORDs back into quadrants.</li>
<li>Rotating the rows can be done with <tt>VPMULLW</tt> (used as a variable shift-left), <tt>VPMULHUW</tt> (used as a variable shift-right), and <tt>VPOR</tt>.</li>
<li>Rotating the columns can be done by conditionally rotating the columns with odd indices by 1, conditionally rotating the columns that have the second bit of their index set by 2, and conditionally rotating the columns that have the third bit of their index set by 4. The rotations can be done using <tt>VPALIGNR</tt><a href="#note4"><sup>[4]</sup></a>, the conditionality can be implemented with blending, but since this needs to be bit-granular blend, it cannot be performed using <tt>VPBLENDVB</tt>.</li>
</ol>
<p>In total, here is how I <i>don't</i> transpose a 16x16 matrix with AVX2, hopefully there is a better way:</p>
<pre>__m256i nottranspose16x16(__m256i x)
{
// exchange off-diagonal quadrants
x = _mm256_shuffle_epi8(x, _mm256_setr_epi8(
0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15,
0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15));
x = _mm256_permutex_epi64(x, _MM_SHUFFLE(3, 1, 2, 0));
x = _mm256_shuffle_epi8(x, _mm256_setr_epi8(
0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15,
0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15));
// rotate every row by its y coordinate
__m256i shifts = _mm256_setr_epi16(
1 << 0, 1 << 1, 1 << 2, 1 << 3,
1 << 4, 1 << 5, 1 << 6, 1 << 7,
1 << 8, 1 << 9, 1 << 10, 1 << 11,
1 << 12, 1 << 13, 1 << 14, 1 << 15);
__m256i sll = _mm256_mullo_epi16(x, shifts);
__m256i srl = _mm256_mulhi_epu16(x, shifts);
x = _mm256_or_si256(sll, srl);
// within each quadrant independently,
// rotate every column by its x coordinate
__m256i x0, x1, m;
// rotate by 4
m = _mm256_set1_epi8(0x0F);
x0 = _mm256_and_si256(x, m);
x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 8));
x = _mm256_or_si256(x0, x1);
// rotate by 2
m = _mm256_set1_epi8(0x33);
x0 = _mm256_and_si256(x, m);
x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 4));
x = _mm256_or_si256(x0, x1);
// rotate by 1
m = _mm256_set1_epi8(0x55);
x0 = _mm256_and_si256(x, m);
x1 = _mm256_andnot_si256(m, _mm256_alignr_epi8(x, x, 2));
x = _mm256_or_si256(x0, x1);
return x;
}</pre>
<p>Using that not-transpose to invert a 16-element permutation takes some extra steps that, without AVX512, are about as annoying as not-transposing the matrix was.</p>
<ul>
<li>Constructing the matrix is more difficult. AVX2 has shift-by-variable, but not for 16-bit element.<a href="#note5"><sup>[5]</sup></a> There are various work-arounds, such as using DWORDs and then narrowing, of course (boring). Another (funnier) option is to duplicate every byte, add 0xF878 to every word, then use <tt>VPSHUFB</tt> in lookup-table-mode to index into a table of powers of two. Having added 0x78 to every low byte of every word, that byte will mapped to zero if it was 8 or higher, or otherwise two to the power of that byte. The high byte, having 0xF8 added to it, will be mapped to 0 if it was below 8, or otherwise to two to the power of that byte minus 8. As wild as that sounds, it is pretty fast, costing only 5 cheap instructions (whereas widening to DWORDs, shifting, and narrowing, would be worse than it sounds). Perhaps there is a better way.</li>
<li>Converting masks back into indices is more difficult due to the lack of trailing zero count, leading zero count, or even popcount. What AVX2 does have, is .. <tt>VPSHUFB</tt> again. We can multiply by an order-4 de Bruijn sequence and use <tt>VPSHUFB</tt> to map the results to the indices of the set bits.</li>
<li>Then we have indices, but since the rows and columns were somewhat arbitrarily permuted, they must still be mapped back into something that makes sense. Fortunately that's no big deal, a modular subtraction (or addition, same thing really) cancels out the row-rotations, and yet another <tt>VPSHUFB</tt> cancels out the strange order that the rows are in. Fun detail: the constants that are subtracted and the permutation are both <tt>0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9</tt>.</li>
</ul>
<p>All put together:</p>
<pre>void invert_permutation_avx2(uint8_t *p, uint8_t *inv)
{
__m256i v = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i*)p));
// indexes to masks
v = _mm256_or_si256(v, _mm256_slli_epi64(v, 8));
v = _mm256_add_epi8(v, _mm256_set1_epi16(0xF878));
__m256i m = _mm256_shuffle_epi8(_mm256_setr_epi8(
1, 2, 4, 8, 16, 32, 64, 128,
1, 2, 4, 8, 16, 32, 64, 128,
1, 2, 4, 8, 16, 32, 64, 128,
1, 2, 4, 8, 16, 32, 64, 128), v);
// ???
m = nottranspose16x16(m);
// masks to indexes
__m256i deBruijn = _mm256_and_si256(_mm256_mulhi_epu16(m, _mm256_set1_epi16(0x9AF0)), _mm256_set1_epi16(0x000F));
__m128i i = _mm_packs_epi16(_mm256_castsi256_si128(deBruijn), _mm256_extracti128_si256(deBruijn, 1));
i = _mm_shuffle_epi8(_mm_setr_epi8(
0, 1, 2, 5, 3, 9, 6, 11, 15, 4, 8, 10, 14, 7, 13, 12), i);
// un-mess-up the indexes
i = _mm_sub_epi8(i, _mm_setr_epi8(0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9));
i = _mm_and_si128(i, _mm_set1_epi8(0x0F));
i = _mm_shuffle_epi8(i, _mm_setr_epi8(0, 7, 6, 5, 4, 3, 2, 1, 8, 15, 14, 13, 12, 11, 10, 9));
_mm_storeu_si128((__m128i*)inv, i);
}</pre>
<p>To make a histogram, emulate <tt>VPOPCNTW</tt> <a href="http://0x80.pl/articles/sse-popcount.html">using, you guessed it, <tt>PSHUFB</tt></a>.</p>
<h2>The end</h2>
<p>This post is, I think, one of the many examples of how AVX512 can be an enormous improvement compared to AVX2 even when <i>not</i> using 512-bit vectors. Every step of every problem had a simple solution in AVX512 (even if it was not always easy to find it). With AVX2, everything felt "only barely possible".</p>
<p>"As complicated as it is, is this actually faster than scalar code?" Yes actually, but feel free to benchmark it yourself. The AVX2 version being somewhat more efficient than scalar code is not really the point of this post anyway. The AVX512 version is nice and efficient, I'm showing an AVX2 version mostly to show how hard it is to create it.<a href="#note6"><sup>[6]</sup></a></p>
<p>Transposing larger matrices with AVX512 can be done by first doing some quadrant-swapping (also used at the start of the not-transpose) until the bits that need to end up together in one 512-bit block are all in there, and then a <tt>VPERMB, VGF2P8AFFINEQB, VPERMB</tt> sequence with the right constants (which can be found using the techniques that I described) can put the bits in their final positions. But well, I <a href="https://gist.github.com/IJzerbaard/61d145c55143d56409a158350271e127">already did that</a>, so there you go.</p>
<p>A proper transpose can be done in AVX2 of course, for example using 4 rounds of quadrant-swapping. Implementations of that already exist so I thought that would be boring to talk about, but there is an interesting aspect to that technique that is often not mentioned: every round of quadrant-swapping can be seen as <a href="http://programming.sirrida.de/bit_perm.html#bit_index">exchanging two bits of the indices</a>. Swapping the big 8x8 quadrants swaps bits 3 and 7 of the indices, transposing the 2x2 submatrices swaps bits 0 and 4 of the indices. From that point of view, it's easy to see that the order in which the four steps are performed does not matter - no matter the order, the lower nibble of the index is swapped with the higher nibble of the index.</p>
<hr>
<p id="note1">[1] While MiniSAT (which this program uses as its SAT solver) is a "deterministic solver" in the sense of definitely finding a satifying valuation if there is one, it is not deterministic in the sense of guaranteeing that the same satisfying valuation is found every time the solver is run on the same input.</p>
<p id="note2">[2] Not the unique answer, there are multiple solutions.</p>
<p id="note3">[3] But not 512-bit vectors.</p>
<p id="note4">[4] Nice! It's not common to see a 256-bit <tt>VPALIGNR</tt> being useful, due to it not being the natural widening of 128-bit <tt>PALIGNR</tt>, but acting more like two <tt>PALIGNR</tt>s side-by-side (with the same shifting distance).</p>
<p id="note5">[5] Intel, why do you keep doing this.</p>
<p id="note6">[6] Also as an excuse to use <tt>PSHUFB</tt> for everything.</p>
</div>
Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-40075877436931549402023-04-09T16:54:00.003-07:002023-04-12T00:13:20.656-07:00Column-oriented row reduction<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
blockquote footer {
background-color: #F5F5F5;
clear: both;
}
blockquote:before {
color: #ccc;
content: open-quote;
font-size: 4em;
line-height: 0.1em;
margin-right: 0.25em;
vertical-align: -0.4em;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<blockquote cite="https://en.wikipedia.org/wiki/Gaussian_elimination">In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations.
<footer>-- The Wikipedia entry of <a href="https://en.wikipedia.org/wiki/Gaussian_elimination">Gaussian elimination</a></footer></blockquote>
<p>The usual implementation of row reduction of a bit-matrix<a href="#note1"><sup>[1]</sup></a> takes an array of bit-packed rows, and applies row operations to the matrix in a row-wise manner. As a reminder, it works something like this:</p>
<ol>
<li>Find a row with a non-zero element<a href="#note2"><sup>[2]</sup></a> in the current column.</li>
<li>Swap it into position.</li>
<li>XOR that row into any other row that has a non-zero element in the current column.</li>
<li>Go to step 1 for the next column, if there is one.</li>
</ol>
<p>For the application that I had mind when writing this blog post<a href="#note3"><sup>[3]</sup></a>, step 2 is not necessary and will be skipped.<a href="#note4"><sup>[4]</sup></a></p>
<p>Note that at the single-bit-level, the XOR-logic of step 3 comes down to <tt>if M(i, j) and M(j, k) then flip M(i, k)</tt>. That is, in order for the XOR to flip <tt>M(i, k)</tt>, the corresponding row must have a 1 in the current column (condition A), and the row that was selected to pivot on must have a 1 in the column <tt>k</tt> (condition B).</p>
<h2>Turning it into a column-oriented algorithm</h2>
<p>In the row-oriented algorithm, condition A is handled by conditionally XOR-ing into only those rows that have a 1 in the current column, and condition B is handled by the XOR itself (which is a conditional bit-flip afterall). A column-oriented algorithm would do it the other way around, using XOR to implement condition B, and skipping columns for which condition A is false:</p>
<ol>
<li>Find a row that hasn't already been pivoted on, with a non-zero element in the current column.</li>
<li><i>Don't</i> swap it into position.</li>
<li>XOR each column that has a 1 in the row that we're pivoting on, with a mask formed by taking the current column and resetting the bit that corresponds to the pivot row.</li>
<li>If there are any rows left that haven't been pivoted on, go to step 1 for the next column.</li>
</ol>
<p>Step 1 may sound somewhat annoying, but it is really simple: <tt>AND</tt> the current column with a mask of the rows that have already been pivoted on, <tt>had</tt>, and use the old Isolate Lowest Set Bit<a href="#note5"><sup>[5]</sup></a> trick <tt>x & -x</tt> to extract a mask corresponding to the first row that passes both conditions.</p>
<p>Here is an example of what an implementation might look like (shown in C#, or maybe this is Java code with a couple of capitalization errors)</p>
<pre>static void reduceRows(long[] M)
{
long had = 0;
for (int i = 0; i < M.Length; i++)
{
long r = lowestSetBit(M[i] & ~had);
if (r == 0)
continue;
long mask = M[i] & ~r;
for (int j = 0; j < M.Length; j++)
{
if ((M[j] & r) != 0)
M[j] ^= mask;
}
had |= r;
}
}
static long lowestSetBit(long x)
{
return x & -x;
}</pre>
<h2>The stuff that goes at the end of the post</h2>
<p>While the column-oriented approach makes the step "search for a row that has a 1 in the current column" easy, transposing a matrix just to be able to do this does not seem worthwhile to me.</p>
<p>Despite the <tt>if</tt> inside the inner loops of both the row-oriented and the column-oriented algorithm, both of them can be accellerated using SIMD. The conditional code can easily be rewritten into arithmetic-based masking<a href="#note6"><sup>[6]</sup></a> or "real" masked operations (in eg AVX512). "But don't compilers autovectorize these days?" <a href="https://godbolt.org/z/eb3nzzqnr">Yes they do, but not always in a good way</a>, note that both GCC and Clang both used a masked store, which is worse than writing back some unchanged values with a normal store (especially on AMD processors, and also especially considering that the stored values will be loaded again soon). Rumour has it that there are Reasons™ for that quirk, but that doesn't improve the outcome.</p>
<hr>
<p id="note1">[1] Of course this blog post is not about floating point matrices, this is the bitmath blog.</p>
<p id="note2">[2] Sometimes known as 1 (one).</p>
<p id="note3">[3] XOR-clause simplification</p>
<p id="note4">[4] Row-swaps are particularly expensive to apply to a column-major bit-matrix, so perhaps this is just an excuse to justify the column-based approach. You decide.</p>
<p id="note5">[5] AKA BLSI, which AMD describes as "Isolate Lowest Set Bit", which makes sense, because well, that is what the operation does. Meanwhile Intel writes "Extract Lowest Set Isolated Bit" (isolated bit? what?) in their manuals.</p>
<p id="note6">[6] Arithmetic-based masking can also be used in scalar code.</p>
</div>
Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-85533500201869628532023-01-03T08:02:00.002-08:002023-04-30T18:30:29.102-07:00Weighted popcnt<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>Since I showed the code below on Twitter, and some people understood it and some didn't, I suppose I should explain how it works and what the more general technique is.</p>
<pre>// sum of indexes of set bits
int A073642(uint64_t n)
{
return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
(__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
(__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
(__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
(__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
(__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}</pre>
<p>A sum of several multi-digit numbers, such as <tt>12 + 34</tt>, can be rewritten to make the place-value notation explicit, giving <tt>1*10 + 2*1 + 3*10 + 4*1</tt>. Then the distributive property of multiplication can be used to group digits of equal place-value together, giving <tt>(1 + 3)*10 + (2 + 4)*1</tt>. I don't bring up something so basic as an insult to the reader, the basis of that possibly-intimidating piece of code really is this basic.</p>
<p>The masks 0xAAAAAAAAAAAAAAAA, 0xCCCCCCCCCCCCCCCC, etc, represent nothing more than the numbers 0..63, each written vertically in binary, starting with the least significant bit. Each column of bits contains its own index. AND-ing the masks with <tt>n</tt> leaves, in each column, either zero (if the corresponding bit of <tt>n</tt> was zero) or the index of that column.</p>
<p>The first <tt>popcnt</tt> then adds together all the digits of the "filtered" indexes with a place-value of 1, the second <tt>popcnt</tt> adds together the digits with a place-value of 2, and so on. The result of each <tt>popcnt</tt> is shifted to match the corresponding place-value, and all of them are added together.</p>
<h2>Generalizing the technique</h2>
<p>The numbers in the columns did not have to be 0..63, the columns can contain arbitrary numbers, even negative numbers. This gives some sort of weighted <tt>popcnt</tt>: rather than each bit having a weight of 1, we are free to choose an arbitrary weight for each bit.</p>
<p>In general, starting with some arbitrary numbers, interpret them as a binary matrix. Transpose that matrix. Every row of the transposed matrix is the mask that goes into the corresponding <tt>popcnt</tt>-step.</p>
<p>Some simplifications may be possible:</p>
<ul>
<li>If a row of the matrix is zero, the corresponding step can be left out.</li>
<li>If two or more rows are equal, their <tt>popcnt</tt>-steps can be merged into one with a weight that is the sum of the weights of the steps that are merged.</li>
<li>If a row is a linear combination of some set of other rows, the <tt>popcnt</tt> corresponding to that row can be computed as a linear combination of the <tt>popcnt</tt>s corresponding to those other rows.</li>
<li>If a row has exactly one bit set, its step can be implemented with an AND and a shift, without <tt>popcnt</tt>, by shifting the bit to its target position.</li>
<li>If <i>every</i> row has exactly one bit set, we're really dealing with a bit-permutation and there are better ways to implement them.</li>
</ul>
<p>For example, let's construct "the sum of squares of indexes of set bits" (with 1-indexed indexes) aka <a href="https://oeis.org/A003995">oeis.org/A003995</a>. Transposing the squares 1,2,4,9..4096 gives these masks:</p>
<pre>0x5555555555555555
0x0000000000000000
0x2222222222222222
0x1414141414141414
0x0d580d580d580d58
0x0335566003355660
0x00f332d555a66780
0x555a5b6666387800
0x66639c78783f8000
0x787c1f807fc00000
0x7f801fff80000000
0x7fffe00000000000
0x8000000000000000
</pre>
<p>The second mask is zero because a square cannot be congruent to 2 or 3 modulo 4. The last mask has exactly one bit set, so its step can be implemented without <tt>popcnt</tt>. Putting it together:</p>
<pre>int A003995(uint64_t n)
{
return __popcnt64(n & 0x5555555555555555) +
(__popcnt64(n & 0x2222222222222222) << 2) +
(__popcnt64(n & 0x1414141414141414) << 3) +
(__popcnt64(n & 0x0d580d580d580d58) << 4) +
(__popcnt64(n & 0x0335566003355660) << 5) +
(__popcnt64(n & 0x00f332d555a66780) << 6) +
(__popcnt64(n & 0x555a5b6666387800) << 7) +
(__popcnt64(n & 0x66639c78783f8000) << 8) +
(__popcnt64(n & 0x787c1f807fc00000) << 9) +
(__popcnt64(n & 0x7f801fff80000000) << 10) +
(__popcnt64(n & 0x7fffe00000000000) << 11) +
((n & 0x8000000000000000) >> 51);
}</pre>
<p>Aside from implementing strange sequences from the OEIS, possible uses of this technique may include</p>
<ul>
<li>Summing the face-values of a hand of cards. Unfortunately this cannot take "special combinations" into account, unlike <a href="https://jonathanhsiao.com/blog/evaluating-poker-hands-with-bit-math">other techniques</a>.</li>
<li>A basic evaluation of a board state in Reversi/Othello, based on weights for each captured square that differ by their position.</li>
<li>Determining the price for a pizza based on a set of selected toppings, I don't know, I'm grasping at straws here.</li>
</ul>
<hr>
<h2>Addendum</h2>
<p>What if the input is small, and the weights are also small? Here is a different trick that is applicable in some of those cases, depending on whether the temporary result fits in 64 bits. The trick this time is much simpler, or at least sounds much simpler: for bit <tt>i</tt> of weight <tt>w[i]</tt>, make <tt>w[i]</tt> copies of bit <tt>i</tt>, then <tt>popcnt</tt> everything.</p>
<p>A common trick to make <tt>k</tt> copies of only one bit is <tt>(bit << k) - bit</tt>. Assuming that <a href="https://www.felixcloutier.com/x86/pdep">pdep</a> exists and is efficient, that trick can be generalized to making different numbers of copies of different bits. The simplest version of that trick would sacrifice one bit of padding per input bit, which may be acceptable depending on whether that all still fits in 64 bits. For example A073642 with a 10-bit input would work:</p>
<pre>// requires: n is a 10-bit number
int A073642_small(uint32_t n)
{
uint64_t top = _pdep_u64(n, 0x0040100808104225);
uint64_t bot = _pdep_u64(n, 0x000020101020844b);
return __popcnt64(top - bot);
}</pre>
<p>That can be extended to an 11-bit input like this:</p>
<pre>// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
uint64_t top = _pdep_u64(n >> 1, 0x0040100808104225);
uint64_t bot = _pdep_u64(n >> 1, 0x000020101020844b);
return __popcnt64(top - bot | top);
}</pre>
<p>Or like this</p>
<pre>// requires: n is an 11-bit number
int A073642_small(uint32_t n)
{
uint64_t top = _pdep_u64(n, 0x0040100808104225) >> 1;
uint64_t bot = _pdep_u64(n, 0x008020101020844b) >> 1;
return __popcnt64(top - bot);
}</pre>
<p><tt>pdep</tt> cannot move bits to the right of their original position, in some cases (if there is no space for padding bits) you may need to hack around that, as in the example above. Rotates and right-shift are good candidates to do that with, and in general you may gather the bits with non-zero weights with <tt>pext</tt>.</p>
<p>This approach relies strongly on being able to efficiently implement the step "make <tt>w[i]</tt> copies of bit <tt>i</tt>", it is probably not possible to do that efficiently using only the plain old standard integer operations. Also, <tt>pdep</tt> is not efficient on all processors that support it, unfortunately making the CPUID feature flag for BMI2 useless for deciding which implementation to use.</p>
</div>
Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-21990154374045465592022-07-22T07:04:00.003-07:002022-07-22T07:04:40.523-07:00Bit-level commutativity and "sum gap" revisited<style type="text/css">
.nobr {
white-space: nowrap;
}
blockquote {
background-color: #EEE;
clear: both;
}
img {
max-width: 100%;
max-height: 100%;
}
</style>
<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript">
MathJax.Hub.Config({
"HTML-CSS": {
scale: 100
}
});
</script>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>In an earlier post, <a href="https://bitmath.blogspot.com/2017/12/bit-level-commutativity.html">bit-level commutativity</a>, it was shown that addition is not only invariant under swapping the operands, but also under swapping individual bits between the operands, provided that they are of equal place-value. For example the least significant bits of the operands of a sum could be exchanged without changing the value of the sum. Bit-level commutativity of addition implies that <tt class="nobr">A + B = (A & B) + (A | B)</tt>.</p>
<p>In <a href="https://bitmath.blogspot.com/2020/09/the-range-sum-cannot-fall-within.html">the range a sum cannot fall within</a>, it was shown that the "gap" which a sum cannot fall within can be widened from:
$$A + B \begin{cases}
\ge \max(A,B) & \text{ if addition does not carry} \\
\le \min(A,B) & \text{ if addition does carry}
\end{cases}\tag{Eq1}$$
To:
$$A + B \begin{cases}
\ge A\:|\:B & \text{ if addition does not carry} \\
\le A\:\&\:B & \text{ if addition does carry}
\end{cases}\tag{Eq2}$$
This makes the gap wider because <tt>A & B</tt> is often less than <tt>min(A, B)</tt>, and <tt>A | B</tt> is often greater than <tt>max(A, B)</tt>.
</p>
<p>To start with, let's assume that
$$A + B \begin{cases}
\ge A & \text{ if addition does not carry} \\
\le A & \text{ if addition does carry}
\end{cases}\tag{Eq3}$$
Since addition is commutative, the roles of <tt>A</tt> and <tt>B</tt> could be exchanged without affecting the truth of equation Eq3, so this must also hold:
$$A + B \begin{cases}
\ge B & \text{ if addition does not carry} \\
\le B & \text{ if addition does carry}
\end{cases}\tag{Eq4}$$
If <tt>A + B</tt> is greater than or equal to both <tt>A</tt> and <tt>B</tt>, then it must also be greater than and equal to <tt>max(A, B)</tt>, and a similar argument applies to the upper bound. Therefore, Eq1 follows from Eq3.
</p>
<p>Substituting <tt>A</tt> and <tt>B</tt> with <tt>A & B</tt> and <tt>A | B</tt> (respectively) in Eq3 and Eq4 gives
$$(A\:\&\:B) + (A\:|\:B) \begin{cases}
\ge (A\:\&\:B) & \text{ if addition does not carry} \\
\le (A\:\&\:B) & \text{ if addition does carry}
\end{cases}\tag{Eq5}$$
And
$$(A\:\&\:B) + (A\:|\:B) \begin{cases}
\ge (A\:|\:B) & \text{ if addition does not carry} \\
\le (A\:|\:B) & \text{ if addition does carry}
\end{cases}\tag{Eq6}$$
Picking the "does carry" case from Eq5 and the "does not carry" case from Eq6, and using that <tt class="nobr">A + B = (A & B) + (A | B)</tt>, gives Eq2.
</p>
</div>
Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-57376692920487784482021-07-10T09:43:00.004-07:002021-11-17T07:42:52.013-08:00Integer promotion does not help performance<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>There is a rule in the C language which roughly says that arithmetic operations on short integer types implicitly convert their operands to normal-sized integers, and also give their result as a normal-sized integer. For example in C: </p>
<blockquote style="background: #f9f9f9; border-left: 10px solid #ccc; padding: 0.5em 10px;">If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.</blockquote>
<p>Various other languages have similar rules. For example C#, the specification of which is not as jargon-infested as the C specifiction:</p>
<blockquote style="background: #f9f9f9; border-left: 10px solid #ccc; padding: 0.5em 10px;">In the case of integral types, those operators (except the ++ and -- operators) are defined for the int, uint, long, and ulong types. When operands are of other integral types (sbyte, byte, short, ushort, or char), their values are converted to the int type, which is also the result type of an operation.</blockquote>
<p>There may be various reasons to include such a rule in a programming language (and some reasons <i>not</i> to), but one that is commonly mentioned is that "the CPU prefers to work at its native word size, other sizes are slower". There is just one problem with that: it is based on an incorrect assumption about how the compiler will need to implement "narrow arithmetic".</p>
<p>To give that supposed reason the biggest benefit that I can fairly give it, I will be using MIPS for the assembly examples. MIPS completely lacks narrow arithmetic operations.</p>
<h2>Implementing narrow arithmetic as a programmer</h2>
<p>Narrow arithmetic is often required, even though various languages make it a bit cumbersome. C# and Java both demand that you explicitly convert the result back to a narrow type. Despite that, code that needs to perform several steps of narrow arithmetic is usually <i>not</i> littered with casts. The usual pattern is to do the arithmetic without intermediate casts, then only in the end use one cast just to make the compiler happy. In C, even that final cast is not necessary.</p>
<p>For example, let's reverse the bits of a byte in C#. This code was written by Igor Ostrovsky, in his blog post <a href="http://igoro.com/archive/programming-job-interview-challenge/">Programming job interview challenge</a>. It's not a unique or special case, and I don't mean that negatively: it's good code that anyone proficient could have written, a job well done. Code that senselessly casts back to <tt>byte</tt> after every step is also sometimes seen, perhaps because in that case, the author does not really understand what they are doing.</p>
<pre>// Reverses bits in a byte
static byte Reverse(byte b)
{
int rev = (b >> 4) | ((b & 0xf) << 4);
rev = ((rev & 0xcc) >> 2) | ((rev & 0x33) << 2);
rev = ((rev & 0xaa) >> 1) | ((rev & 0x55) << 1);
return (byte)rev;
}</pre>
<p>Morally, all of the operations in this function are really narrow operations, but C# cannot express that. A special property of this code is that none of the intermediate results exceed the limits of a <tt>byte</tt>, so in a language without integer promotion it could be written in much the same way, but without going through <tt>int</tt> for the intermediate results.</p>
<h2>Implementing narrow arithmetic as a compiler</h2>
<p>The central misconception that (I think) gave rise to the myth that integer promotion helps performance, is the assumption that without integer promotion, the compiler must implement narrow operations by inserting an explicit narrowing operation after every arithmetic operation. But that's not true, a compiler for a language that lacks integer promotion can use the same approach that programmers use to implement narrow arithmetic in languages that do have integer promotion. For example, what if two bytes were added together (from memory, and storing the result in memory) in a hypothetical language that lacks integer promotion, and what if that code was compiled for MIPS? The assumption is that it will cost an additional operation to get rid of the "trash bits", but it does not:</p>
<pre> lbu $2,0($4)
lbu $3,0($5)
addu $2,$2,$3
sb $2,0($4)</pre>
<p>The <tt>sb</tt> instruction does not care about any "trash" in the upper 24 bits, those bits simply won't be stored. This is not a cherry-picked case. Even if there were more arithmetic operations, in most cases the "trash" in the upper bits could safely be left there, being mostly isolated from the bits of interest by the fact that carries only propagate from the least significant bit up, never down. For example, let's throw in a multiplication by 42 and a shift-left by 3 as well for good measure:</p>
<pre> lbu $6,0($4)
li $3,42
lbu $2,0($5)
mul $5,$3,$6
addu $2,$5,$2
sll $2,$2,3
sb $2,0($4)</pre>
<p>What is true, is that before some operations, the trash in the upper bits must be cleared. For example before a division, shift-right, or comparison. That is not an exhaustive list, but the list of operations that require the upper bits to be clean is shorter (and have less "total frequency") than the list of operations that do not require that, see for example <a href="https://stackoverflow.com/q/34377711/555045">which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?</a> "Before some operations" is not the same thing as "after every operation", but that still sounds like an additional cost. However, the trash-clearing operations that a compiler for a language that lacks integer promotion would have to insert are not <i>additional</i> operations: they are the same ones that a programmer would write explicitly in a language with integer promotion.</p>
<p>It may be possible to construct a contrived case in which a human would know that the high bits of an integer are clean, while a compiler would struggle to infer that. For example, a compiler may have more trouble reasoning about bits that were cancelled by an XOR, or worse, by a multiplication. Such cases are not likely to be the reason behind the myth. A more likely reason is that many programmers are not as familiar with basic integer arithmetic as they perhaps ought to be.</p>
<h2>So integer promotion is useless?</h2>
<p>Integer promotion may prevent accidental use of narrow operations where wide operations were intended, whether that is worth it is another question. All I wanted to say with this post, is that "the CPU prefers to work at its native word size" is a bogus argument. Even when it is true, it is irrelevant.</p>
</div>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-42880317106887453982021-06-09T21:50:00.004-07:002021-11-17T07:43:06.567-08:00Partial sums of blsi and blsmsk<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p><a href="https://www.felixcloutier.com/x86/blsi">blsi</a> is an x86 operation which extracts the rightmost set bit from a number, it can be implemented efficiently in terms of more familiar operations as <tt>i&-i</tt>. <a href="https://www.felixcloutier.com/x86/blsmsk">blsmsk</a> is a closely related operation which extracts the rightmost set bit and also "smears" it right, setting the bits to the right of that bit as well. <tt>blsmsk</tt> can be implemented as <tt>i^(i-1)</tt>. The smearing makes the result of <tt>blsmsk</tt> almost (but not quite) twice as high as the result of <tt>blsi</tt> for the same input: <tt>blsmsk(i) = floor(blsi(i) * (2 - ε))</tt>.</p>
<p>The partial sums of <tt>blsi</tt> and <tt>blsmsk</tt> can be defined as <tt>b(n) = sum(i=1, n, blsi(i))</tt> and <tt>a(n) = sum(i=1, n, blsmsk(i))</tt> respectively. These sequences are on OEIS as <a href="https://oeis.org/A006520">A006520</a> and <a href="https://oeis.org/A080277">A080277</a>. Direct evaluation of those definitions would be inefficient, is there a better way?</p>
<h3>Unrecursing the recursive definition</h3>
<p>Let's look at the partial sums of <tt>blsmsk</tt> first. Its OEIS entry suggests the recursion below, which is already significantly better than the naive summation:</p>
<p><tt>a(1) = 1, a(2*n) = 2*a(n) + 2*n, a(2*n+1) = 2*a(n) + 2*n + 1</tt></p>
<p>A "more code-ish"/"less mathy" way to implement that could be like this:</p>
<pre>int PartialSumBLSMSK(int i) {
if (i == 1)
return 1;
return (PartialSumBLSMSK(i >> 1) << 1) + i;
}</pre>
<p>Let's understand what this actually computes, and then find an other way to do it. The overal big picture of the recursion is that <tt>n</tt> is being shifted right on the way "down", and the results of the recursive calls are being shifted left on the way "up", in a way that cancels each other. So in total, what happens is that a bunch of "copies" of <tt>n</tt> is added up, except that at the kth step of the recursion, the kth bit of <tt>n</tt> is reset.</p>
<p>This non-tail recursion can be turned into tail-recursion using a standard trick: turning the "sum on the way"-logic into "sum on the way down", by passing two accumulators in extra arguments, one to keep track of the sum, and an other to keep track of how much to multiply <tt>i</tt> by:</p>
<pre>int PartialSumBLSMSKTail(int i, int accum_a, int accum_m) {
if (i <= 1)
return accum_a + i * accum_m;
return PartialSumBLSMSKTail(i >> 1, accum_a + i * accum_m, accum_m * 2);
}</pre>
<p>Such a tail-recursive function is then simple to transform into a loop. Shockingly, GCC (even quite old versions) manages to compile <a href="https://godbolt.org/z/jTjz4vbG6">the original non-tail recursive function into a loop as well</a> without much help (just changing the left-shift into the equivalent multiplication), although some details differ.</p>
<p>Anyway, let's put in a value and see what happens. For example, if the input was <tt>11101001</tt>, then the following numbers would be added up:</p>
<pre>11101001 (reset bit 0)
11101000 (reset bit 1, it's already 0)
11101000 (reset bit 2)
11101000 (reset bit 3)
11100000 (reset bit 4)
11100000 (reset bit 5)
11000000 (reset bit 6)
10000000 (base case)</pre>
<p>Look at the columns of the matrix of numbers above, the column for bit 3 has four ones in it, the column for bit 5 has six ones in it. If bit <tt>k</tt> is set bit in <tt>n</tt>, there is a pattern that that bit is set in <tt>k+1</tt> rows.</p>
<h3>Using the pattern</h3>
<p>Essentially what that pattern means, is that <tt>a(n)</tt> can be expressed as the dot-product between <tt>n</tt> viewed as a vector of bits (weighted according to their position) (𝓷), and a constant vector (𝓬) with entries 1, 2, 3, 4, etc, up to the size of the integer. For example for <tt>n=5</tt>, 𝓷 would be (1, 0, 4), and the dot-product with 𝓬 would be 13. A dot-product like that can be implemented with some bitwise trickery, by using bit-slicing. The trick there is that instead of multiplying the entries of 𝓷 by the entries of 𝓬 directly, we multiply the entries of 𝓷 by the least-significant bits of the entries of 𝓬, and then seperately multiply it by all the second bits of the entries of 𝓬, and so on. Multiplying every entry of 𝓷 at once by a bit of an entry of 𝓬 can be implemented using just a bitwise-AND operation.</p>
<p>Although this trick lends itself well to any vector 𝓬, I will use 0,1,2,3.. and add an extra <tt>n</tt> separately (this corresponds to factoring out the +1 that appears at the end of the recursive definition), because that way part of the code can be reused directly by the solution of the partial sums of <tt>blsi</tt> (and also because it looks nicer). The masks that correspond to the chosen vector 𝓬 are easy to compute: each <i>column</i> across the masks is an entry of that vector. In this case, for 32bit integers:</p>
<pre>c0 10101010101010101010101010101010
c1 11001100110011001100110011001100
c2 11110000111100001111000011110000
c3 11111111000000001111111100000000
c4 11111111111111110000000000000000</pre>
<p>The whole function could look like this:</p>
<pre>int PartialSumBLSMSK(int n)
{
int sum = n;
sum += (n & 0xAAAAAAAA);
sum += (n & 0xCCCCCCCC) << 1;
sum += (n & 0xF0F0F0F0) << 2;
sum += (n & 0xFF00FF00) << 3;
sum += (n & 0xFFFF0000) << 4;
return sum;
}</pre>
<p><tt>PartialSumBLSI</tt> works almost the same way, with its recursive formula being <tt style="white-space:nowrap;">b(1) = 1, b(2n) = 2b(n) + n, b(2n+1) = 2b(n) + n + 1</tt>. The +1 can be factored out as before, and the other part (<tt>n</tt> instead of <tt>2*n</tt>) is exactly half of what it was before. Dividing 𝓬 by half seems like a problem, but it can be done implicitly by shifting the bit-slices of the product to the right by 1 bit. There are no problems with bits being lost that way, because the least significant bit is always zero in this case (𝓬 has zero as its first element).</p>
<pre>int PartialSumBLSI(int n)
{
int sum = n;
sum += (n & 0xAAAAAAAA) >> 1;
sum += (n & 0xCCCCCCCC);
sum += (n & 0xF0F0F0F0) << 1;
sum += (n & 0xFF00FF00) << 2;
sum += (n & 0xFFFF0000) << 3;
return sum;
}</pre>
<h3>Wrapping up</h3>
<p>The particular set of constants I used is very useful and appears in more tricks, such as <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">collecting indexes of set bits</a>. They are the bitwise complements of a set of masks that Knuth (in The Art of Computer Programming volume 4, section 7.1.3) calls "magic masks", labeled µ<sub>k</sub>.</p>
<p>This post was inspired by <a href="https://stackoverflow.com/q/67854074/555045">this question on Stack Overflow</a> and partially based on my own answer and the answer of Eric Postpischil, without which I probably would not have come up with any of this, although I used a used a different derivation and explanation for this post.</p>
</div>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-48282986168984613092020-09-26T06:27:00.006-07:002021-11-17T07:43:19.426-08:00The range a sum cannot fall within<style type="text/css">
.nobr {
white-space: nowrap;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>Throughout this post, the variables <tt>A</tt>, <tt>B</tt>, and <tt>R</tt> are used, with R defined as <tt>R = A + B</tt>, and <tt class="nobr">A ≤ B</tt>. Arithmetic in this post is unsigned and modulo 2<sup>k</sup>. Note that <tt>A ≤ B</tt> is not a restriction on the input, it is a choice to label the smaller input as <tt>A</tt> and the larger input as <tt>B</tt>. Addition is commutative, so this choice can be made without loss of generality.</p>
<h2><tt>R < A || R ≥ B</tt></h2>
<p>The sum is less than A <i>iff</i> the addition wraps (1), otherwise it has to be at least B (2).</p>
<ol>
<li><tt>B</tt> cannot be so high that the addition can wrap all the way up to or past <tt>A</tt>. To make <tt>A + B</tt> add up to <tt>A</tt>, <tt>B</tt> would have had to be 2<sup>k</sup>, which is one beyond the maximum value it can be. <tt class="nobr">R = A</tt> is possible only if <tt>B</tt> is zero, in which case <tt>R ≥ B</tt> holds instead.</li>
<li>Since <tt>A</tt> is at least zero, in the absence of wrapping there is no way to reduce the value below the inputs.</li>
</ol>
<p>Perhaps that all looks obvious, but this has a useful application: if the carry-out of the addition is not available, it can be computed via <tt class="nobr">carry = (x + y) < x</tt>, which is a relatively well-known trick. It does not matter which of <tt>x</tt> or <tt>y</tt> is the smaller or larger input, the sum cannot fall within the "forbidden zone" between them. The occasionally seen <tt class="nobr">carry = (x + y) < max(x, y)</tt> adds an unnecessary complication.</p>
<h2><tt>R < (A & B) || R ≥ (A | B)</tt></h2>
<p>This is a stronger statement, because <tt class="nobr">A & B</tt> is usually smaller than <tt>A</tt> and <tt class="nobr">A | B</tt> is usually greater than <tt>B</tt>.</p>
<p>If no wrapping occurs, then <tt class="nobr">R ≥ (A | B)</tt>. This can be seen for example by splitting the addition into a XOR and adding the carries separately, <tt class="nobr">(A + B) = (A ^ B) + (A & B) * 2</tt>, while bitwise OR can be decomposed similarly into <tt class="nobr">(A | B) = (A ^ B) + (A & B)</tt><sup>(see below)</sup>. Since there is no wrapping (by assumption), <tt class="nobr">(A & B) * 2 ≥ (A & B)</tt> and therefore <tt class="nobr">(A + B) ≥ (A | B)</tt>. Or, with less algebra: addition sometimes produces a zero where the bitwise OR produces a one, but then addition compensates doubly for it by carrying into the next position.</p>
<p>For the case in which wrapping occurs I will take a bit-by-bit view. In order to wrap, the carry out of bit <tt class="nobr">k-1</tt> must be 1. In order for the sum to be greater than or equal to <tt class="nobr">A & B</tt>, bit <tt class="nobr">k-1</tt> of the sum must be greater than or equal to bit <tt class="nobr">k-1</tt> of <tt class="nobr">A & B</tt>. That combination means that the carry <i>into</i> bit <tt class="nobr">k-1</tt> of the sum must have been 1 as well. Furthermore, bit <tt class="nobr">k-1</tt> of the sum can't be greater than bit <tt class="nobr">k-1</tt> of <tt class="nobr">A & B</tt>, at most it can be equal, which means bit <tt class="nobr">k-2</tt> must be examined as well. The same argument applies to bit <tt class="nobr">k-2</tt> and so on, until finally for the least-significant bit it becomes impossible for it to be carried into, so the whole thing falls down: by contradiction, <tt class="nobr">A + B</tt> must be less than <tt class="nobr">A & B</tt> when the sum wraps.</p>
<h2>What about <tt class="nobr">(A | B) = (A ^ B) + (A & B)</tt> though?</h2>
<p>The more obvious version is <tt class="nobr">(A | B) = (A ^ B) | (A & B)</tt>, compensating for the bits reset by the XOR by ORing exactly those bits back in. Adding them back in also works, because the set bits in <tt class="nobr">A ^ B</tt> and <tt class="nobr">A & B</tt> are <i>disjoint</i>: a bit being set in the XOR means that exactly one of the input bits was set, which makes their AND zero.</p>
</div>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-86301127161752818652020-08-03T06:22:00.001-07:002020-08-03T06:24:30.687-07:00Why does AND distribute over XOR<style>
blockquote {
background: #f9f9f9;
border-left: 10px solid #ccc;
margin: 1.5em 10px;
padding: 0.5em 10px;
quotes: "\201C""\201D""\2018""\2019";
}
blockquote:before {
color: #ccc;
content: open-quote;
font-size: 4em;
line-height: 0.1em;
margin-right: 0.25em;
vertical-align: -0.4em;
}
blockquote p {
display: inline;
}
q {
background: #f9f9f9;
}
</style>
<div style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">
<p>
AND distributes over XOR, unsurprisingly both from the left and right, that is:
</p>
<pre>x & y ^ z & y == (x ^ z) & y
x & y ^ x & z == x & (y ^ z)
a & c ^ a & d ^ b & c ^ b & d == (a ^ b) & (c ^ d)</pre>
A somewhat popular explanation for <i>why</i> is,
<blockquote cite="https://en.wikipedia.org/wiki/Exclusive_or#Properties">Conjunction and exclusive or form the multiplication and addition operations of a field GF(2), and as in any field they obey the distributive law.
<footer><a href="https://en.wikipedia.org/wiki/Exclusive_or#Properties">Wikipedia: Exclusive or#Properties</a></footer>
</blockquote>
<p>
Which is true and a useful way to think about it, but it is also the type of backwards explanation that relies on a concept that is more advanced than the thing which is being explained.
</p>
<h2>Diagrams with crossing lines</h2>
<p>
Let's represent an expression such as <tt>a & c ^ a & d ^ b & c ^ b & d</tt> by putting the variables on the left of every AND along the top of a grid, and the variables on the right of every AND along the side. Then for example the grid cell on the intersection between the column of <tt>a</tt> and the row of <tt>c</tt> corresponds to the term <tt>a & c</tt>. Further, let's draw lines for variables that are True, in this example all variables are True:
</p>
<img src="https://i.imgur.com/rvAKOhI.png" width="250"/>
<p>
The overall expression <tt>a & c ^ a & d ^ b & c ^ b & d</tt> counts the number of crossings, modulo 2. Rather than counting the crossings one by one, the number of crossings could be computed by counting how many variables along the top are True, how many along the side are True, and taking the product, again modulo 2. A sum modulo 2 is XOR and a product modulo 2 is AND, so this gives the equivalent expression <tt>(a ^ b) & (c ^ d)</tt>.
</p>
<p>
The simpler cases <tt>x & y ^ z & y</tt> and <tt>x & y ^ x & z</tt> correspond to 1x2 and 2x1 diagrams.
</p>
<h2>Diagrams with bites taken out of them</h2>
<p>
Such a diagram with a section of it missing can be dealt with by completing the grid and subtracting the difference. For example the unwieldy <tt>a & e ^ a & f ^ a & g ^ a & h ^ b & e ^ b & f ^ b & g ^ b & h ^ c & e ^ c & f ^ d & e ^ d & f</tt> (shown in the diagram below) is "incomplete", it misses the 2x2 square that corresponds to <tt>(c ^ d) & (g ^ h)</tt>. Completing the grid and subtracting the difference gives <tt>((a ^ b ^ c ^ d) & (e ^ f ^ g ^ h)) ^ ((c ^ d) & (g ^ h))</tt>, which <a href="http://haroldbot.nl/?q=a+%26+e+%5E+a+%26+f+%5E+a+%26+g+%5E+a+%26+h+%5E+b+%26+e+%5E+b+%26+f+%5E+b+%26+g+%5E+b+%26+h+%5E+c+%26+e+%5E+c+%26+f+%5E+d+%26+e+%5E+d+%26+f+%3D%3D+%28%28a+%5E+b+%5E+c+%5E+d%29+%26+%28e+%5E+f+%5E+g+%5E+h%29%29+%5E+%28%28c+%5E+d%29+%26+%28g+%5E+h%29%29">is correct</a>.
</p>
<img src="https://i.imgur.com/SfIxt6C.png" width="250"/>
<p>
This all has a clear connection to the FOIL method and its generalizations, after all <q>conjunction and exclusive or form the multiplication and addition operations of a field GF(2)</q>.
</p>
<p>
The same diagrams also show why AND distributes over OR (the normal, inclusive, OR), which could alternatively be explained in terms of the <a href="https://en.wikipedia.org/wiki/Two-element_Boolean_algebra">Boolean semiring</a>.
</p>
</div>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-36711275019684038902020-05-03T16:28:00.000-07:002020-05-03T16:42:18.175-07:00Information on incrementation<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript">
MathJax.Hub.Config({
"HTML-CSS": {
scale: 100
}
});
</script>
<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<h2>Defining <tt>increment</tt></h2>
<p>Just to avoid any confusion, the operation that this post is about is adding 1 (one) to a value: $$\text{increment}(x) = x + 1$$
Specifically, performing that operation in the domain of bit-vectors.</p>
<p>Incrementing is very closely related to <a href="https://bitmath.blogspot.com/2017/12/notes-on-negation.html">negating</a>. After all, <tt style="white-space: nowrap">-x = ~x + 1</tt> and therefore <tt style="white-space: nowrap">x + 1 = -~x</tt>, though putting it that way feel oddly reversed to me.</p>
<h2>Bit-string notation</h2>
<p>In bit-string notation (useful for analysing compositions of operations at the bit level), increment can be represented as: $$a01^k + 1 = a10^k$$</p>
<p>An "English" interpretation of that form is that an increment carries through the trailing set bits, turning them to zero, and then carries into the right-most unset bit, setting it.</p>
<p>That "do something special with the right-most unset bit" aspect of increment is the basis for various <a href="http://programming.sirrida.de/programming.html#rightmost_bits">right-most bit manipulations</a>, some of which were implemented in <a href="https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_(Trailing_Bit_Manipulation)">AMD Trailing Bit Manipulation (TBM)</a> (which has been discontinued).</p>
<p>For example, the right-most unset bit in <tt>x</tt> can be set using <tt style="white-space: nowrap">x | (x + 1)</tt>, which has a nice symmetry with the more widely known trick for unsetting the right-most set bit, <tt style="white-space: nowrap">x & (x - 1)</tt>.</p>
<h2>Increment by XOR</h2>
<p>As was the case with negation, there is a way to define increment in terms of XOR.
The bits that flip during an increment are all the trailing set bits and the right-most unset bit, the TBM instruction for which is <tt>BLCMSK</tt>.
While that probably does not seem very useful yet, the fact that <tt style="white-space: nowrap">x ^ (x + 1)</tt> takes the form of some number of leading zeroes followed by some number of trailing ones turns out to be useful.</p>
<p>Suppose one wants to increment a bit-reversed integer, a possible (and commonly seen) approach is looping of the bits from top the bottom and implementing the "carry through the ones, into the first zero" logic by hand. However, if the non-reversed value was <i>also</i> available (let's call it <tt>i</tt>), the bit-reversed increment could be implemented by calculating the number of ones in the mask as <tt style="white-space: nowrap">tzcnt(i + 1) + 1</tt> (or <tt style="white-space: nowrap">popcnt(i ^ (i + 1))</tt>) and forming a mask with that number of ones located at the desired place within an integer:
<pre>// i = normal counter
// rev = bit-reversed counter
// N = 1 << number_of_bits
int maskLen = tzcnt(i + 1) + 1;
rev ^= N - (N >> maskLen);</pre>
That may still not seem useful, but this enables an implementation of the <a href="https://en.wikipedia.org/wiki/Bit-reversal_permutation">bit-reversal permutation</a> (not a bit-reversal itself, but the permutation that results from bit-reversing the indices). The bit-reversal permutation is sometimes used to re-order the result of a non-auto-sorting Fast Fourier Transform algorithm into the "natural" order. For example,
<pre>// X = array of data
// N = length of X, power of two
for (uint32_t i = 0, rev = 0; i < N; ++i)
{
if (i < rev)
swap(X[i], X[rev]);
int maskLen = tzcnt(i + 1) + 1;
rev ^= N - (N >> maskLen);
}</pre>
This makes no special effort to be cache-efficient.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-64281581720303151482019-10-10T07:43:00.000-07:002019-10-10T08:16:08.742-07:00Square root of bitwise NOT<p style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">
<p>The square root of bitwise NOT, if it exists, would be some function <i>f</i> such that <i>f(f(x)) = not x</i>, or in other words, <i>f²(x) = not x</i>.
It is similar in concept to the <a href="https://en.wikipedia.org/wiki/Quantum_logic_gate#Square_root_of_NOT_gate_(%E2%88%9ANOT)">√NOT gate in Quantum Computing</a>, but in a different domain which makes the solution very different.</p>
<p>Before trying to find any specific <i>f</i>, it may be interesting to wonder what properties it would have to have (and lack).
<ul>
<li><i>f</i> must be bijective, because its square is bijective.</li>
<li><i>f²</i> is an involution but <i>f</i> cannot be an involution, because its square would then be the identity.</li>
<li><i>f</i> viewed as a permutation (which can be done, because it has to be bijective) must be a <a href="https://en.wikipedia.org/wiki/Derangement">derangement</a>, if it had any fixed point then that would also be a fixed point in <i>f²</i> and the <i>not</i> function does not have a fixed point.</li>
</ul>
</p>
<p><h2>Does <i>f</i> exist?</h2>
In general, a permutation has a square root if and only if the number of cycles of same even length is even. The <i>not</i> function, being an involution, can only consist of swaps and fixed points, and we already knew it has no fixed points so it must consist of only swaps. A swap is a cycle of length 2, so an even length. Since the <i>not</i> function operates on <i>k</i> bits, the size of its domain is a power of two, <i>2<sup>k</sup></i>. That almost always guarantees an even number of swaps, except when <i>k = 1</i>. So, the <i>not</i> function on a single bit has no square root, but for more than 1 bit there are solutions.</p>
<h2><i>f</i> for even k</h2>
<p>For 2 bits, the <i>not</i> function is the permutation <span>(0 3) (1 2)</span>. An even number of even-length cycles, as predicted. The square root can be found by interleaving the cycles, giving (0 1 3 2) or (1 0 2 3). In bits, the first looks like:</p>
<tt>
<table>
<tr><td>in</td><td>out</td></tr>
<tr><td>00</td><td>01</td></tr>
<tr><td>01</td><td>11</td></tr>
<tr><td>10</td><td>00</td></tr>
<tr><td>11</td><td>10</td></tr>
</table></tt>
<p>Which corresponds to swapping the bits and then inverting the lsb, the other variant corresponds to inverting the lsb first and then swapping the bits.</p>
<p>That solution can be applied directly to other even numbers of bits, swapping the even and odd bits and then inverting the even bits, but the square root is not unique and there are multiple variants. The solution can be generalized a bit, combining a step that inverts half of the bits with a permutation that brings each half of the bits into the positions that are inverted when it is applied twice, so that half the bits are inverted the first time and the <i>other</i> half of the bits are inverted the second time. For example for 32 bits, there is a nice solution in x86 assembly:
<pre style="background-color: #EBECE4">bswap eax
xor eax, 0xFFFF
</pre></p>
<h2><i>f</i> for odd k</h2>
<p>Odd k makes things less easy. Consider k=3, so (0 7) (1 6) (2 5) (3 4). There are different ways to pair up and interleave the cycles, leading to several distinct square roots:
<ol>
<li>(0 1 7 6) (2 3 5 4)</li>
<li>(0 2 7 5) (1 3 6 4)</li>
<li>(0 3 7 4) (1 2 6 5)</li>
<li>etc..</li>
</ol>
</p>
<tt>
<table>
<tr><td>in</td><td>1</td><td>2</td><td>3</td></tr>
<tr><td>000</td><td>001</td><td>010</td><td>011</td></tr>
<tr><td>001</td><td>111</td><td>011</td><td>010</td></tr>
<tr><td>010</td><td>011</td><td>111</td><td>110</td></tr>
<tr><td>011</td><td>101</td><td>110</td><td>111</td></tr>
<tr><td>100</td><td>010</td><td>001</td><td>000</td></tr>
<tr><td>101</td><td>100</td><td>000</td><td>001</td></tr>
<tr><td>110</td><td>000</td><td>100</td><td>101</td></tr>
<tr><td>111</td><td>110</td><td>101</td><td>100</td></tr>
</table></tt>
<p>These correspond to slightly tricky functions, for example the first one has as its three from lsb to msb: the msb but inverted, the parity of the input, and finally the lsb. The other ones also incorporate the parity of the input in some way.</p>
</p>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-60850388386975823762018-10-03T03:39:00.000-07:002018-10-03T03:52:47.090-07:00abs and its "extra" result<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript">
MathJax.Hub.Config({
"HTML-CSS": {
scale: 100
}
});
</script>
<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<p>The <tt>abs</tt> function has, in its usual (most useful) formulation, one extra value in its codomain than just "all non-negative values". That extra value is the most negative integer, which satisfies <tt>abs(x) == x</tt> despite being negative. Even accepting that the absolute value of the most negative integer is itself, it may still seem strange (for an operation that is supposed to have such a nice symmetry) that the size of the codomain is not exactly half of the size of the domain.</p>
<p>That there is an "extra" value in the codomain, and that it is specifically the most negative integer, may be more intuitively obvious when the action of <tt>abs</tt> on the number <s>line</s> circle is depicted as "folding" the circle symmetrically in half across the center and through zero (around which <tt>abs</tt> is supposed to be symmetric), folding the negative numbers onto the corresponding positive numbers: </br>
<img border="0" src="https://1.bp.blogspot.com/-cymRZhXigG8/W7SWlsqpvjI/AAAAAAAAAJs/zHDYKevIwfcy1x81hv9zqB6aTdVb8MM_ACLcBGAs/s1600/number%2Bcircle%2Bmirror.png"/></p>
<p>Clearly both zero and the most negative integer (which is also on the "folding line") stay in place in such a folding operation and remain part of the resulting half-circle. That there is an "extra" value in the codomain is the usual fencepost effect: the resulting half-circle is half the size of the original circle in some sense, but the "folding line" cuts through two points that have now become endpoints.</p>
<p>By the way the "ones' complement alternative" to the usual <tt>abs</tt>, let's call it <tt>OnesAbs(x) = x < 0 ? ~x : x</tt> (there is a nice branch-free formulation too) <i>does</i> have a codomain with a size exactly half of the size of its domain. The possible results are exactly the non-negative values. It has to pay for that by, well, not being the usual <tt>abs</tt>. The "folding line" for <tt>OnesAbs</tt> runs <i>between</i> points, avoiding the fencepost issue:</br>
<img border="0" src="https://4.bp.blogspot.com/-rHHJNzlzsYE/W7SbubkyAJI/AAAAAAAAAJ4/l6IAS_yVEv4ZZSlpCBHiU4mVHKAy2zAtACLcBGAs/s1600/number%2Bcircle%2Bmirror%2Bcpl.png" /></p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-18537902208846599142018-08-21T13:06:00.000-07:002018-08-21T13:18:11.594-07:00Signed wrapping is meaningful and algebraically nice<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<div>
<ul>
<li><a href="#c1">Signed wrapping is not wrong</a></li>
<li><a href="#c2">Signed wrapping is meaningful</a></li>
<li><a href="#c3">Signed wrapping is not inherent</a></li>
<li><a href="#c4">Signed wrapping is algebraically nice</a></li>
</ul>
</div>
<p>In this post I defend wrapping, a bit more opinionated than my other posts.
As usual I'm writing from the perspective that signed and unsigned integer types are a thin transparent wrapper around bit vectors, of course I am aware that they are often not used that way. That difference between their use and their actual nature is probably the source of the problems.</p>
<a name="c1"></a>
<h2>Signed wrapping is not wrong</h2>
<p>It is often said that when signed wraparound occurs, the result is simply wrong.
That is an especially narrow view to take, probably inspired by treating fixed-size bit vector arithmetic as if it is arithmetic in ℤ, which it is not.
Bit vector arithmetic can be viewed as arithmetic in ℤ so long as no "overflow" occurs, but violating that condition does not make the result wrong, it makes the interpretation wrong.
</p>
<a name="c2"></a>
<h2>Signed wrapping is meaningful</h2>
<p>The wrapping works exactly the same as unsigned wrapping, it corresponds to taking the lowest k bits of the arbitrary precision result.
Such a truncation therefore gives you exactly k meaningful bits, it's just a slice of the result.
Some upper bits may be lost, they can be calculated if you need them.
If the whole result is meaningful, then part of it is too, namely <i>at least</i> under the interpretation of being "part of the result".</p>
<p>An other well known example of benign wrapping is the calculation of the average of two non-negative signed integers.
While <tt>(a + b) / 2</tt> gives inconvenient results when the addition "overflows", <tt>(uint)(a + b) / 2</tt> (using unsigned division) or <tt>(a + b) >>> 1</tt> (unsigned right shift as in Java) are correct even when the addition of two positive values results in a negative value. An other way to look at it is that there is no <i>unsigned</i> wrapping. Nominally the integers being added here are signed but that doesn't really matter. Casting the inputs to unsigned before adding them is a no-op that can be performed mentally.</p>
<p>Wrapping can also sometimes be cancelled with more wrapping.
For example, taking an absolute value with wrapping and casting the result to an unsigned type of the same width results in the actual absolute value without the funny <tt>int.MinValue</tt> edge case:</p>
<pre>(uint)abs(int.MinValue) =
(uint)abs(-2147483648) =
(uint)(-2147483648) =
2147483648</pre>
<p>This is <i>not</i> what <a href="https://docs.microsoft.com/en-us/dotnet/api/system.math.abs?view=netframework-4.7.2#System_Math_Abs_System_Int32_"><tt>Math.Abs</tt></a> in C# does, it throws, perhaps inspired by its signed return type.
On the other hand, Java's <a href="https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html#abs-int-"><tt>Math.abs</tt></a> gets this right and leaves the reinterpretation up to the consumer of the result, of course in Java there is no uint32 to cast to but you can still treat that result <i>as if</i> it is unsigned.
Such "manual reinterpretation" is in general central to integer arithmetic, it's really about the bits, not the "default meaning" of those bits.</p>
<p>The principle of cancelling wrapping also has some interesting data structure applications.
For example, in a Fenwick tree or Summed Area Table, the required internal integer width is the desired integer width of any range/area-sum query that you actually want to make.
So a SAT over signed bytes can use an internal width of 16 bits as long as you restrict queries to an area of 256 cells or fewer, since 256 * -128 = -2<sup>15</sup> which still fits a signed 16 bit word.</p>
<p>An other nice case of cancelled wrapping is strength reductions like <tt>A * 255 = (A << 8) - A</tt>.
It is usually not necessary to do that manually, but that's not the point, the point is that the wrapping is not "destructive".
The overall expression wraps only <i>iff</i> <tt>A * 255</tt> wraps and even then it has exactly the same result.
There are cases in which the left shift experience "signed wrapping" but <tt>A * 255</tt> does not (for example, in 32 bits, A = 0x00800000), in those cases the subtraction also wraps and brings the result back to being "unwrapped".
That is not a coincidence nor an instance of two wrongs making a right, it's a result of the intermediate wrapped result being meaningful and wrapping being algebraically nice.</p>
<a name="c3"></a>
<h2>Signed wrapping is not inherent</h2>
<p>Signed and unsigned integers are two different ways to interpret bit vectors.
Almost all operations have no specific signed or unsigned version, only a generic version that does both.
There is no such thing as signed addition or unsigned addition, addition is just addition.
Operations that are actually different are:
<ul>
<li>Comparisons except equality</li>
<li>Division and remainder</li>
<li>Right shift, maybe, but arithmetic right shift and logical right shift can both be reasonably applied in both signed and unsigned contexts</li>
<li>Widening conversion</li>
<li>Widening multiplication</li>
</ul>
One thing almost all of these have in common is that they cannot overflow, except division of the smallest integer by negative one.
By the way I regard that particular quirk of division as a mistake since it introduces an asymmetry between dividing by negative one and multiplying by negative one.</p>
<p>The result is that the operations that can "overflow" are neither signed nor unsigned, and therefore do not overflow specifically in either of those ways.
If they can be said to overflow at all, when and how they do so depends on how they are being viewed by an outside entity, not on the operation itself.</p>
<p>The distinction between unsigned and signed wrapping is equivalent to imagining a "border" on the <a href="http://bitmath.blogspot.com/2017/08/visualizing-addition-subtraction-and.html">ring of integers</a> (not the mathematical Ring of Integers) either between 0 and -1 (unsigned) or between signed-smallest and signed-highest numbers, but <i>there is no border</i>. Crossing either of the imaginary borders does not mean nearly as much as many people think it means.</p>
<a name="c4"></a>
<h2>Signed wrapping is algebraically nice</h2>
<p>A property that wrapping arithmetic shares with arbitrary precision integer arithmetic, but not with trapping arithmetic, is that it obeys a good number of desirable algebraic laws.
The root cause of this is that ℤ/ℤ2<sup>k</sup> is a <a href="https://en.wikipedia.org/wiki/Ring_(mathematics)">ring</a>, and trapping arithmetic is infested with implicit conditional exceptions.
Signed arithmetic can largely be described by ℤ/ℤ2<sup>k</sup>, like unsigned arithmetic, since it is mostly a reinterpretation of unsigned arithmetic.
That description does not cover all operations or properties, but it covers the most important aspects.</p>
<p>Here is a small selection of laws that apply to wrapping arithmetic but not to trapping arithmetic:
<ul>
<li>-(-A) = A</li>
<li>A + -A = 0</li>
<li>A - B = A + -B</li>
<li>A + (B + C) = (A + B) + C</li>
<li>A * (B + C) = A * B + A * C</li>
<li>A * -B = -A * B = -(A * B)</li>
<li>A * (B * C) = (A * B) * C</li>
<li>A * 15 = A * 16 - A</li>
<li>A * multiplicative_inverse(A) = 1 (iff A is odd, this is something not found in ℤ which has only two trivially invertible numbers, so sometimes wrapping gives you a new useful property)</li>
</ul>
Some laws also apply to trapping arithmetic:
<ul>
<li>A + 0 = A</li>
<li>A - A = 0</li>
<li>A * 0 = 0</li>
<li>A * 1 = A</li>
<li>A * -1 = -A</li>
<li>-(-(-A)) = -A</li>
</ul>
</p>
<p>The presence of all the implicit exceptional control flow makes the code very hard to reason about, for humans as well as compilers.</p>
<p>Compilers react to that by not optimizing as much as they otherwise would, since they are forced to preserve the exception behaviour.
Almost anything written in the source code must actually happen, and in the same order as originally written, just to preserve exceptions that are not even supposed to ever actually be triggered.
The consequences of that are often seen in Swift, where code using the <tt>&+</tt> operator is optimized quite well (including auto-vectorization) and code using the unadorned <tt>+</tt> operator can be noticeably slower.</p>
<p>Humans .. probably don't truly want trapping arithmetic to begin with, what they want is to have their code checked for unintended wrapping.
Wrapping is not a bug by itself, but <i>unintended</i> wrapping is.
So while canceling a "bare" double negation is not algebraically justified in trapping arithmetic, a programmer will do it anyway since the goal is not to do trapping arithmetic, but removing bad edge cases.
Statically checking for unintended wrapping would be a more complete solution, no longer relying on being lucky enough to dynamically encounter every edge case.
Arbitrary precision integers would just remove most edge cases altogether, though it would rely heavily on range propagation for performance, making it a bit fragile.</p>
<p>But anyway, wrapping is not so bad. Just often unintended.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-44889289559536761212018-08-02T03:07:00.001-07:002019-01-07T13:28:56.721-08:00Implementing Euclidean division<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript">
MathJax.Hub.Config({
"HTML-CSS": {
scale: 100
}
});
</script>
<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<p>While implementing various kinds of division in <a href="http://haroldbot.nl">haroldbot</a>, I had to look up/work out how to implement different kinds of signed division in terms of unsigned division. The common truncated division (written as <tt>/s</tt> in this post and in haroldbot, <tt>/t</tt> in some other places) is natural result of using your intuition from ℝ and writing the definition based on signs and absolute values, ensuring that the division only happens between non-negative numbers (making its meaning unambiguous) and that the result is an integer: $$\DeclareMathOperator{\sign}{sign} D /_s d = \sign(d)\cdot\sign(D)\cdot\left\lfloor\cfrac{\left|D\right|}{\left|d\right|}\right\rfloor$$ That definition leads to a plot like this, showing division by 3 as an example:<p><img src="https://4.bp.blogspot.com/-JL8nINlDDKM/W2K9YuzJUzI/AAAAAAAAAIw/DjpnpiTeR-ghkrwDcnHrsApkx69IHxIuwCPcBGAYYCw/s400/truncated_div_plot1.png" /><p/>
Of course the absolute values and sign functions create symmetry around the origin, and that seems like a reasonable symmetry to have. But that little plateau around the origin often makes the mirror at the origin a kind of barrier that you can run into, leading to the well-documented downsides of truncated division.</p>
<p>The alternative floored division and Euclidean division have a different symmetry, which does not lead to that little plateau, instead the staircase pattern simply continues:<p><img src="https://1.bp.blogspot.com/-hAmcPQ7ZxrU/W2K9W92ncWI/AAAAAAAAAIs/BpwkdHCVv7Yr8RauK4fT1ya6oS5YdW-6QCPcBGAYYCw/s400/euclidean_div_plot1.png" /></p>
The point of symmetry, marked by the red cross, is at (-0.5, -0.5). Flipping around -0.5 may remind you of bitwise complement, especially if you have read my earlier post <a href="http://bitmath.blogspot.com/2017/08/visualizing-addition-subtraction-and.html">visualizing addition, subtraction and bitwise complement</a>, and mirroring around -0.5 is no more than a conditional complement. So Euclidean division may be implemented with positive division as: $$\DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\xor}{\bigoplus} D /_e d = \sign(d)\cdot(\sgn(D)\xor\left\lfloor\cfrac{D\xor\sgn(D)}{\left|d\right|}\right\rfloor)$$ Where the <tt>sgn</tt> function is -1 for negative numbers and 0 otherwise, and the circled plus is XOR. XORing with the <tt>sgn</tt> is a conditional complement, with the inner XOR being responsible for the horizontal component of the symmetry and the outer XOR being responsible for the vertical component.</p>
<p>It would have been even nicer if the symmetry of the divisor also worked that way, but unfortunately that doesn't quite work out. For the divisor, the offset introduced by mirroring around -0.5 would affect the size of the steps of the staircase instead of just their position.</p>
<p>The <tt>/e</tt> and <tt>%e</tt> operators are available in haroldbot, though like all forms of division the general case is really too hard, even for the circuit-SAT based truth checker (the BDD engine stands no chance at all).</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-81380076835140745842017-12-16T11:04:00.000-08:002018-08-13T09:00:51.671-07:00Notes on negation<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript">
MathJax.Hub.Config({
"HTML-CSS": {
scale: 100
}
});
</script>
<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<h2>The well known formulas</h2>
<p>Most readers will be familiar with <tt>-x = ~x + 1 = ~(x - 1)</tt>. These are often just stated without justification, or even an explanation for why they are equivalent. There are some algebraic tricks, but I don't think they explain much, so I'll use the rings from <a href="http://bitmath.blogspot.com/2017/08/visualizing-addition-subtraction-and.html">visualizing addition, subtraction and bitwise complement</a>. <tt>~x + 1</tt>, in terms of such a ring, means "flip it, then draw a CCW arrow on it with a length of one tick". <tt>~(x - 1)</tt> means "draw a CW arrow with a length of one tick, then flip". Picking CCW first is arbitrary, but the point is that the direction is reversed because flipping the ring also flips the arrow if it is drawn first, but not if it is drawn second. Equivalent to drawing an arrow, you may rotate the ring around its center.</p>
<p>So they're equivalent, but why do they negate. The same effect also explains<br/> <tt>a - b = ~(~a + b)</tt>, which when you substitute <tt>a = 0</tt> almost directly gives <tt>-b = ~(b - 1)</tt>. Or using the difference between one's complement and proper negation as I pointed out in that visualization post: the axis of flipping is offset by half a tick, so the effect of flipping introduces a difference of 1 which can be removed by rotating by a tick.</p>
<h2>Bit-string notation</h2>
<p>I first saw this notation in The Art of Computer Programming v4A, but it probably predates it. It provides a more "structural" view of negation:
$$-(a10^k) =\; {\sim} (a10^k - 1) =\; {\sim} (a01^k) = ({\sim} a)10^k$$
Here juxtaposition is concatenation, and exponentiation is repetition and is done before concatenation. <tt>a</tt> is an arbitrary bit string that may be infinitely long. It does not really deal with the negation of zero, since the input is presumed to end in 10<sup>k</sup>, but the negation of zero is not very interesting anyway.</p>
<p>What this notation shows is that negation can be thought of as complementing everything to the left of the rightmost set bit, a property that is frequently useful when <a href="http://bitmath.blogspot.com/2012/09/the-basics-of-working-with-rightmost-bit.html">working with the rightmost bit</a>. A mask of the rightmost set bit and everything to the right of it can be found with <br/><tt>x ^ (x - 1)</tt> or, on a modern x86 processor, <tt>blsmsk</tt>. That leads to negation by XOR:
$$-x = x\oplus {\sim}\text{blsmsk}(x)$$
which is sort of cheating since <tt>~blsmsk(x) = x ^ ~(x - 1) = x ^ -x</tt>, so this said that <br/><tt>-x = x ^ x ^ -x</tt>.
It may still be useful occasionally, for example when a value of "known odd-ness" is being negated and then XORed with something, the negation can be merged into the XOR.</p>
<h2>Negation by MUX</h2>
<p>Using that mask from <tt>blsmsk</tt>, negation can be written as
$$-x = \text{mux}(\text{blsmsk}(x), {\sim} x, x)$$
which combines with <a href="http://bitmath.blogspot.com/2017/12/bit-level-commutativity.html">bit-level commutativity</a> in some fun ways:
<ul>
<li><tt>(~x + 1) + (x - 1) = mux(blsmsk(x), ~x, x) + mux(blsmsk(x), x, ~x) = ~x + x = -1</tt></li>
<li><tt>(~x + 1) | (x - 1) = ~x | x = -1</tt></li>
<li><tt>(~x + 1) ^ (x - 1) = ~x ^ x = -1</tt></li>
<li><tt>(~x + 1) & (x - 1) = ~x & x = 0</tt></li>
</ul>
All of these have simpler explanations that don't involve bit-level commutativity, by rewriting them back in terms of negation. But I thought it was nice that it was possible this way too, because it makes it seem as though a +1 and a -1 on both sides of an OR, XOR or AND cancel out, which in general they definitely do not.</p>
<p>The formula that I've been using as an example for the proof-finder on <a href="http://haroldbot.nl/how.html">haroldbot.nl/how.html</a>, <br/><tt>(a & (a ^ a - 1)) | (~a & ~(a ^ a - 1)) == -a</tt>, is actually a negation-by-MUX, written using <tt>mux(m, x, y) = y & m | x & ~m</tt>.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-29698167062253011102017-12-01T17:35:00.000-08:002017-12-01T17:39:53.568-08:00Bit-level commutativity<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript">
MathJax.Hub.Config({
"HTML-CSS": {
scale: 100
}
});
</script>
<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<p>By bit-level commutativity I mean that a binary operator has the property that swapping any subset of bits between the left and right operands does not change the result.
The subset may be any old thing, so in general I will call an operator <tt>o</tt> bit-level commutative if it satisfies the following property
$$\forall m,a,b: a \circ b = \text{mux}(m, a, b) \circ \text{mux}(m, b, a)$$
For example, by setting <tt>m = b</tt> we get <tt>a ⊗ b = (a & b) ⊗ (a | b)</tt>, sort of "bitwise sorting" the operands, with zeroes moved to the left operand and ones moved to the right operand (if possible).</p>
<p>Anyway, obviously AND, OR and XOR (and their complemented versions) are all bit-level commutative, indeed any purely bitwise operation (expressible as a vectorized function that takes two booleans as input) that is commutative is necessarily also bit-level commutative, for obvious reasons. Interestingly, addition is also bit-level commutative, which may be less obvious (at least in a recent coding competition, it seemed that people struggled with this). It may help to consider addition on a slightly more digit-by-digit level:
$$ a + b = \sum_i 2^i (a_i + b_i)$$
It should be clear from the bit-level "exploded" sum, that the individual bits a<sub>i</sub> and b<sub>i</sub> can be either swapped or not, independently for any <tt>i</tt>. This should get more obvious the more you think about what representing a number in a positional numeral system even <i>means</i> in the first place: it was always a sum, so adding two numbers is like taking the sum of two "big" sums, of course it does not matter which of the big sums any particular contribution comes from.</p>
<p>Alternatively, the old <tt>a + b = (a ^ b) + 2(a & b)</tt> (ie computing bit-level sums and then adding the carries separately) can explain it: both XOR and AND are bit-level commutative, so the whole expression is, too.</p>
<p>Anyway, a consequence is that <tt>a + b = (a & b) + (a | b)</tt>, which I have more commonly seen derived as:
<pre>a + b = (a ^ b) + 2(a & b) // add carries separately
= (a | b) - (a & b) + 2(a & b) // see below
= (a | b) + (a & b)
</pre>
Where <tt>(a ^ b) = (a | b) - (a & b)</tt> can be explained as XOR being like OR, except that unlike OR it is 0 when both operands are set, so just subtract that case out. I always like having two (or more!) explanations from completely different directions like that.</p>
<p>Multiplication (including carryless multiplication and OR-multiplication) is of course <i>not</i> bit-level commutative. For example if one operand is zero and the other is odd and not 1, then the lowest bit could be swapped to make neither operand zero, and a non-zero result could be produced that way. Operations such as comparison and (by extension) min and max are obviously not bit-level commutative.</p>
<p>There is probably more to this, I may add some stuff later.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-43768952816606238122017-08-06T21:47:00.002-07:002017-11-23T06:00:33.184-08:00Visualizing addition, subtraction and bitwise complement<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<p>A relatively well-known relation between addition and subtraction, besides the basic relations <tt>a - b = a + (-b)</tt> and <tt>a - b = -(-a + b)</tt>, is <tt>a - b = ~(~a + b)</tt>.
But I suspect most people have simply accepted that as fact, or perhaps proved it from the 2's complement definition of negation. haroldbot can <a href="http://haroldbot.nl/?q=~%28~a+%2B+b%29+%3D%3D+a+-+b">do the latter</a>, though not as succinctly as I hoped.</p>
<p>But it also has a nice one-dimensional geometric interpretation, really analogous to the way <tt>a - b = -(-a + b)</tt> looks in ℤ.</p>
<p>As negation mirrors ℤ around the origin, complement mirrors the space of two's complement signed bitvectors around the "space" between -1 and 0. Clearly addition in the mirrored space corresponds to subtraction in the unmirrored space, so the obvious way to subtract is mirror, add, and mirror back. That's precisely what <tt>-(-a + b)</tt> does in ℤ and what <tt>~(~a + b)</tt> does for bitvectors. An observant reader may notice that I convenient disregarded the finiteness of the number line of fixed-size bitvectors, that's actually not a problem but the visualization gets a bit trickier.</p>
<p>What is in a way more surprising is that <tt>a - b = -(-a + b)</tt> works for bitvectors, since negation does not neatly mirror the whole number line when you're talking about two's complement negation. It's around the origin again instead of around a "space", but the most negative number is unaffected.</p>
<p>When we remember that this number line is really a number <i>ring</i> (in the circular sense), that starts to make sense again. To complete this picture, you can think about holding a ring in your hands, flipping it over while holding it at two diametrically opposite points - zero and the most negative number. Of course this visualization also works for complement, just hold the ring at slightly different places: between negative one and zero, and between the minimum and maximum (which are adjacent, you could think of it as where the ring closes). There are images below, but you should probably only look at them if visualization failed to appear in your mind naturally - if you already have one, your own image is probably easier to think about.</p>
<h3>But why</h3>
<p>OK all this spatial insight is fun (maybe), but what was it actually good for. I've found that thinking about the complement operation this way helps me to relate it to addition-like arithmetic operations (add, subtract, min, max, compare, etc) since they're all simple operations with "arrows" around that ring that we just flipped in our minds.</p>
<p>So it helps to make sense of various "mutants" of De Morgan's Law, such as:
<ul>
<li><tt>~x < ~y == x > y</tt></li>
<li><tt>~x > ~y == x < y</tt></li>
<li><tt>~min(~x, ~y) == max(x, y)</tt></li>
<li><tt>~max(~x, ~y) == min(x, y)</tt></li>
<li><tt>~avg_up(~x, ~y) == avg_down(x, y)</tt> where <tt>avg_down</tt> is the average rounding down, see also <a href="http://www.virtualdub.org/blog/pivot/entry.php?id=222">VirtualDub: Weighted averaging in SSE (part 2)</a></li>
<li><tt>~avg_down(~x, ~y) == avg_up(x, y)</tt></li>
<li><tt>~paddsb(~x, y) == psubsb(x, y)</tt> (signed saturation)</li>
<li><tt>~psubsb(~x, y) == paddsb(x, y)</tt> (signed saturation)</li>
<li><tt>~paddusb(~x, y) == psubusb(x, y)</tt> (unsigned saturation)</li>
<li><tt>~psubusb(~x, y) == paddusb(x, y)</tt> (unsigned saturation)</li>
</ul></p>
<p>A similar visualization works for the signed/unsigned "conversion" <tt>x ^ msb == x + msb == x - msb</tt> (msb is a mask with only the most significant bit set), which corresponds to rotating the ring 180 degrees. This may help when thinking about things such as:
<ul>
<li><tt>x <s y == (x ^ msb) <u (y ^ msb)</tt></li>
<li><tt>x <u y == (x ^ msb) <s (y ^ msb)</tt></li>
<li><tt>max_s(x, y) == max_u(x ^ msb, y ^ msb) ^ msb</tt></li>
</ul></p>
<p>The relationship between the different kinds of min/max can be summed up by a nice commutative diagram:<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-KhcYIg7GsbU/WYio_fG-g-I/AAAAAAAAAG8/sFWMrm91KgQJW7W-7UhralOKpFjPqLQwgCPcBGAYYCw/s1600/maxmin_comm.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-KhcYIg7GsbU/WYio_fG-g-I/AAAAAAAAAG8/sFWMrm91KgQJW7W-7UhralOKpFjPqLQwgCPcBGAYYCw/s320/maxmin_comm.png" width="320" height="219" data-original-width="649" data-original-height="445" /></a></div></p>
<p>Hope it helps, for me this sort of thing has come in handy occasionally when writing SSE code.</p>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<p>Here are the images two's complement negation:<br/><a href="https://3.bp.blogspot.com/-PWLAC7TrC3E/WhbS38HR9kI/AAAAAAAAAHg/txwO3m1JpgkdAX39kfOF1JrX6wxoOSAzwCLcBGAs/s1600/twoscompcirc.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-PWLAC7TrC3E/WhbS38HR9kI/AAAAAAAAAHg/txwO3m1JpgkdAX39kfOF1JrX6wxoOSAzwCLcBGAs/s320/twoscompcirc.png" width="320" height="318" data-original-width="434" data-original-height="431" /></a><br/>and for plain one's complement:<br/><a href="https://2.bp.blogspot.com/-edpbUKcGGIs/WhbTAtUXWQI/AAAAAAAAAHk/Kaun6rxhLVk9P8vjiUzjS1eBICpRX5WdACLcBGAs/s1600/complementcirc.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-edpbUKcGGIs/WhbTAtUXWQI/AAAAAAAAAHk/Kaun6rxhLVk9P8vjiUzjS1eBICpRX5WdACLcBGAs/s320/complementcirc.png" width="320" height="320" data-original-width="425" data-original-height="425" /></a><br/>This is in the orientation that I usually use when I think about these operations this way, but there is not particular meaning to going counter-clockwise with 0 at/near the bottom.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-92215643066118768772016-11-24T10:10:00.001-08:002016-12-13T00:37:17.976-08:00Parallel prefix/suffix operations<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<p>After a "brief" hiatus..</p>
<p>Parallel prefix/suffix operations is a family of operations that follow either this template:
<pre class="brush:csharp">x = x ⊕ (x >> 1);
x = x ⊕ (x >> 2);
x = x ⊕ (x >> 4);
x = x ⊕ (x >> 8);
x = x ⊕ (x >> 16);</pre>
Or this template:
<pre class="brush:csharp">x = x ⊕ (x << 1);
x = x ⊕ (x << 2);
x = x ⊕ (x << 4);
x = x ⊕ (x << 8);
x = x ⊕ (x << 16);</pre>
Where ⊕ is typically OR or XOR. I've never seen ⊕=AND show up naturally, but it might be useful for something.</p>
<p>There is some disagreement (stemming from the usual "which end of an integer is the front" question) about which of them should be called the "parallel prefix" and which the "parallel suffix", for the purpose of this post I'll be calling the first one (with the shifts to the right) "parallel prefix". Either way, these operations are bit-level analogues to the more well known prefix sum, prefix max, prefix product, etc. The word "parallel" in the name refers to the bit-level parallelism, which has the same structure as the simple (not work-efficient) <a href="http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html">parallel prefix sum algorithm</a>.</p>
<h3>Some members of the family</h3>
<p>The best-known member of parallel prefix/suffix family is the parallel prefix with ⊕ = XOR (PP-XOR), which converts Gray code back to normally-ordered integers and produces the parity of an integer in the lowest bit.</p>
<p>The parallel <i>suffix</i> with ⊕ = XOR (PS-XOR) is a carryless multiplication by -1, which isn't very interesting by itself (probably), but gives a hint about the algebraic structure that these operations give rise to.</p>
<p>The parallel prefix with ⊕ = OR (PP-OR) often shows up whenever bitwise operations are interacting with ranges/intervals, since <code>PP-OR(low ^ high)</code> gives a mask of the bits that are not constrained by the interval. I have used this operation (though I did not name it then) several times in my series on <a href="http://bitmath.blogspot.nl/2012/09/calculating-lower-bound-of-bitwise-or.html">propagating bounds through bitwise operations</a>.<br/>
This operation lends itself to some optimizations (I put "performance" in the title, which I admit I've mostly ignored), for example on x64 you could implement it as
<pre> lzcnt rax, rax ; input in rax
sbb rdx, rdx
not rdx
shrx rax, rdx, rax</pre>
Or:
<pre> lzcnt rax, rax
mov edx, 64
sub edx, eax
or rax, -1
bzhi rax, rax, rdx</pre>
The each have their pros and cons, and hopefully there's a better way to do it, but I couldn't really find any. I made sure to avoid the infamous false dependency that <code>popcnt</code>, <code>tzcnt</code> and <code>lzcnt</code> all share on any Intel processor that implements them to date. Probably the biggest problem here is that both versions require BMI2, that can be avoided, eg (suggested by Jasper Neumann)
<pre> xor edx, edx // provide a zero register for cmov
bsr ecx, eax
mov eax, -1
not ecx // flags not modified
cmovz eax, edx
shr eax, cl</pre></p>
<h3>Properties/structure</h3>
<p>To start with the obvious, the prefix and suffix versions are exactly each others mirror image. So I'm going to look just at the suffix part of the family, for the prefix part just mirror everything.</p>
<h4>PS-XOR(x) is clmul(x, -1)</h4>
<p>Since every step is a clmul by <code>1 + 2<sup>k</sup></code> and if you clmul those constants together you get -1 (the two's complement -1, not the -1 in the ring formed by XOR and clmul over bitvectors of length k, which would just be 1 again), or from a more intuitive angle, what the PS-XOR is supposed to do in the first place is XOR each bit into all higher bits. So it inherits the properties, such as inversibility (-1 is odd). The inverse of PS-XOR is <code>x ^ (x << 1)</code>.</p>
<p>It also inherits that <code>PS-XOR(x) ^ PS-XOR(y) == PS-XOR(x ^ y)</code> from the distributivity of clmul.</p>
<p>Since clmul is commutative and associative, the steps in PS-XOR can be reordered.</p>
<h4>PS-OR(x) is or_mul(x, -1)</h4>
<p>This isn't as nice (XOR and clmul form a ring structure) since OR doesn't form a group because it has no negation (this is just a fancy way of saying that you can't un-OR numbers, in the way you can un-ADD by subtraction or un-XOR with XOR again) and so it doesn't extended to a ring, but it can be extended into a semiring by defining the following multiplication:
<pre class="brush:csharp">static uint or_mul(uint a, uint b)
{
uint r = 0;
for (int i = 0; i < 32; i++)
{
if ((a & 1) == 1)
r |= b; // the only difference is here
a >>= 1;
b <<= 1;
}
return r;
}</pre>It's not fancy math notation but you get the idea.</p>
<p>This (with OR) forming a commutative semiring (proof "left as an exercise to the reader"), it has some nice properties:
<ul>
<li><code>or_mul(a, b) == or_mul(b, a)</code> commutivity</li>
<li><code>or_mul(a, or_mul(b, c)) == or_mul(or_mul(a, b), c)</code> associativity</li>
<li><code>or_mul(a, b | c) == or_mul(a, b) | or_mul(a, c)</code> left distributivity</li>
<li><code>or_mul(a | b, c) == or_mul(a, c) | or_mul(b, c)</code> right distributivity</li>
</ul>
But, of course, no multiplicative inverses. Except when multiplying by 1, but that doesn't count since it's the multiplicative identity.</p>
<p><code>PS-OR(x) == or_mul(x, -1)</code>, and the individual steps of the PS-OR are of the form <code>x = or_mul(x, 1 + 2<sup>k</sup>)</code>. The or_mul of all those constants together is, unsurprisingly, -1 (though this notation is now slightly odd since this was all a semiring, I mean the element that has all bits set).</p>
<p>And again, PS-OR inherits <code>PS-OR(x) | PS-OR(y) = PS-OR(x | y)</code> from distributivity.</p>
<p>And again, since or_mul is commutative and associative, the steps in PS-OR can be reordered.</p>
<h4>PS-OR(x) is also <code>x | -x</code></h4>
<p>This one doesn't have a nice mirrored equivalent, just the obvious one where you insert a bit-reversal before and after. It also doesn't have an obvious relation to or_mul. As for why it's the case,
given that (in string notation) -(a10<sup>k</sup>) = (~a)10<sup>k</sup>; a10<sup>k</sup> | -(a10<sup>k</sup>) = (~a|a)10<sup>k</sup> = 1<sup>∞</sup>10<sup>k</sup>, so it copies the lowest set bit into all higher bits, just as PS-OR does.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-32212003611256852252014-04-13T10:27:00.001-07:002014-04-13T10:42:24.707-07:00A bitwise relational domain<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<p>This is the thing that I was referring to in my <a href="http://bitmath.blogspot.com/2014/02/addition-in-bitfield-domain-alternative.html">previous post</a> where I said it didn't work out as well as I'd hoped. That's still the case. But it's not a complete failure, it's just not as good as I had hoped it might be, and it's still interesting (and, as far as I know, new).</p>
<p>The basic idea is really the same one as in Miné's <a href="http://www.di.ens.fr/~mine/publi/article-mine-sas02.pdf">A Few Graph-Based Relational Numerical Abstract Domains</a>, namely a matrix where the element <code>i,j</code> says something about the relation between the variables <code>i</code> and <code>j</code>, and there's a closure operator that very closely resembles the Floyd–Warshall algorithm (or matrix multiplication, which comes down to the same thing). Of course saying that the difference of two variables must be in some set isn't great for bitwise operations, so my first thought was to change it to a xor, with Miné's <a href="http://www.di.ens.fr/~mine/publi/article-mine-wing12.pdf">bitfield domain</a> (the one I've written several other posts about) as the Basis. For the closure operator, the multiplication is represented by the bitwise-xor from the bitfield domain, addition is bitwise-and from the bitfield domain. The rest is similarly easy.</p>
<p>The problem with that idea is that it tends finds almost no relations. It can represent some non-trivial relations, for example that one variable is (partially) equal to and other xor-ed with a constant, but it's just not enough.</p>
<p>The problem is that xor just doesn't remember enough. So the next idea I had was to abandon xor, and go with an operator that doesn't throw away anything, and use 4-tuples to represent that set instead of 2-tuples. Due to lack of inspiration, I named the elements of the tuple (a, b, c, d), and they work like this:
<table>
<tr><td width="25%">a</td><td>x = 0, y = 0</td></tr>
<tr><td>b</td><td>x = 0, y = 1</td></tr>
<tr><td>c</td><td>x = 1, y = 0</td></tr>
<tr><td>d</td><td>x = 1, y = 1</td></tr>
</table>
Meaning that if, say, bit 0 in a is set in the element at (x,y) then variables x and y can be even simultaneously. On the diagonal of the matrix, where x=y, b and c must both be zero, because a variable can not be unequal from itself.</p>
<p>The "multiplication" operator for the closure works like this:
<pre>aij &= aik & akj | bik & ckj
bij &= aik & bkj | bik & dkj
cij &= dik & ckj | cik & akj
dij &= dik & dkj | cik & bkj</pre>
For all <code>k</code>. This essentially "chains" <code>i,k</code> with <code>k,j</code> to get an implied constraint on <code>i,j</code>.</p>
<p>To show how powerful this is, here is what happens after asserting that <code>z = x & y</code>
<style>
#t_00 {
border: 1px solid black;
border-collapse: collapse;
}
#t_00 tr {
border: 1px solid black;
}
#t_00 td {
border: 1px solid black;
}
</style>
<table id="t_00">
<tr><td></td><td><center>x</center></td><td><center>y</center></td><td><center>z</center></td></tr>
<tr><td>x</td><td>-1,0,0,-1</td><td>-1,-1,-1,-1</td><td>-1,-1,0,-1</td></tr>
<tr><td>y</td><td>-1,-1,-1,-1</td><td>-1,0,0,-1</td><td>-1,-1,0,-1</td></tr>
<tr><td>z</td><td>-1,0,-1,-1</td><td>-1,0,-1,-1</td><td>-1,0,0,-1</td></tr>
</table>
And now some cool things can happen. For example, if you assert that <code>z</code> is odd, the closure operation will propagate that back to both <code>x</code> and <code>y</code>. Or if you assert that <code>w = x | z</code>, it will deduce that <code>w == x</code>.</p>
<p>There is a symmetry between the upper triangle and the lower triangle. They're not the same, but an element (a, b, c, d) is mirrored by an element (a, c, b, d). One of those triangles can easily be left out, to save some space. Elements that are (-1, -1, -1, -1) could also be left out, but that doesn't typically save many elements - if something non-relational is known about a variable (that is, there is an element on the diagonal that isn't (-1, 0, 0, -1)), then everything in the same row and everything in the same column will have that information in it as well. But an element <code>(i,j)</code> exactly equal to <code>(a<sub>i,i</sub> & a<sub>j,j</sub>, a<sub>i,i</sub> & d<sub>j,j</sub>, d<sub>i,i</sub> & a<sub>j,j</sub>, d<sub>i,i</sub> & d<sub>j,j</sub>)</code> can be left out. That removes all elements that describe a trivial relation of the kind that says that the combination of x and y must be in the Cartesian product of the sets that x and y can be in, which is completely redundant.</p>
<p>But it does have problems. If you xor (or add or subtract) two unconstrained variables, nothing happens. That sort of 3-variable relation isn't handled at all. This led me to explore a 3D variant of this domain (with 8-tuples), which is really great if you only consider its expressive power, but it likes to take a large amount of space, and its closure operator is very inefficient.</p>
<p>The algorithms I <a href="http://bitmath.blogspot.com/2013/08/addition-in-bitfield-domain.html">presented</a> <a href="http://bitmath.blogspot.com/2014/02/addition-in-bitfield-domain-alternative.html">earlier</a> for addition in the bitfield domain still work in this relational domain, and with just a couple of simple changes they can make use of some relational information. They want to know something about <code>p = x ^ y</code> and <code>g = x & y</code>, relational information can be available for both of them. For example, if it is known that <code>x & y == 0</code> (indicated by a 4-tuple (a, b, c, 0)), <code>g</code> will be known to be 0, all carries will be known to be 0, and two relations will be made that look exactly the same as if you had asserted <code>z = x | y</code>. Of course that only works if <code>x & y == 0</code> is known <i>before</i> the addition.</p>
<p>Assertions such as <code>x & y == 0</code> (more generally: x ⊗ y is in (z, o) where (z, o) is an element from the unrelational bitfield domain) can be done directly (without inventing a temporary to hold x ⊗ y and asserting on it), and actually that's simpler and far more efficient.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-31849968245946718292013-09-28T07:10:00.000-07:002017-11-23T04:42:54.437-08:00Determining the cardinality of a set described by a mask and bounds<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript">
MathJax.Hub.Config({
"HTML-CSS": {
scale: 175
}
});
</script>
<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
<p>In other words, calculating this
\begin{equation}
| \left\{ x | (x \& m) = v \land x \ge a \land x \le b \right\} |
\end{equation}</p>
<p>A <a href="http://en.wikipedia.org/wiki/Binary_decision_diagram">BDD</a> (which haroldbot uses to solve this) has no trouble with that at all, but the structure the problem is so nice that I thought it should be possible to do better. And it is, though the solution I'm about to present is probably far from optimal. I don't know how to significantly improve on it, but I just get a very non-scientific gut-feeling that there should be a fundamentally better way to do it.</p>
<p>The cardinality of the set without the bounds (only the mask) is obviously trivial to compute (as <code>1 << popcnt(~m)</code>), and the bounds divide that set in three parts: items that are lower than the lower bound ("left part"), items that are higher than the upper bound ("right part")), and items that are actual set of interest ("middle part"). The main idea it is built on is that it's relatively easy to solve the problem if it didn't have a lower bound and the upper bound was a power of two, with a formula roughly similar to the one above. Using that, all the powers of two that fit before lower bound can be counted from high to low, giving the cardinality of the "left part". The same can be done for the "right part", actually in exactly the same way, by complementing the mask and the upper bound. Obviously the cardinality of the "middle part" can be computed from this.</p>
<p>And here's the code. It doesn't like it when the cardinality is 2<sup>32</sup>, and watch out for weird corner cases such as when the lower bound is bigger than the upper bound (why would you even do that?). It usually works, that's about as much as I can say - I didn't prove it correct.
<pre class="brush: csharp">static uint Cardinality(uint m, uint v, uint a, uint b)
{
// count x such that a <= x <= b && (x & m) == v
// split in three parts:
// left = 0 <= x < a
// right = b < x <= -1
// the piece in the middle is (1 << popcnt(~m)) - left - right
uint left = 0;
uint x = 0;
for (int i = 32 - 1; i >= 0; i--)
{
uint mask = 1u << i;
if (x + mask <= a)
{
x += mask;
uint mask2 = 0 - mask;
if ((x - 1 & m & mask2) == (v & mask2))
{
uint amount = 1u << popcnt(~(m | mask2));
left += amount;
}
}
}
uint right = 0;
uint y = 0;
for (int i = 32 - 1; i >= 0; i--)
{
uint mask = 1u << i;
if (y + mask <= ~b)
{
y += mask;
uint mask2 = 0 - mask;
if ((y - 1 & m & mask2) == (~v & m & mask2))
{
uint amount = 1u << popcnt(~(m | mask2));
right += amount;
}
}
}
uint res = (uint)((1UL << popcnt(~m)) - (left + right));
return res;
}</pre>
The loops can be merged of course, but for clarity they're separate here.</p>
<p>If you have any improvements, please let me know.</p>
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-72980651341267922382013-08-10T07:50:00.000-07:002015-07-10T11:25:30.758-07:00Announcing haroldbot<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
haroldbot <i>was</i> an ircbot (hence the name) that solves some bitmath problems. The title is actually a lie - haroldbot has been around for a while now. But now it finally got its own website.<br/>
<span style="font-size: 150%"><a href="http://haroldbot.nl">haroldbot.nl</a></span><br/>
Check it out, if you work with bits you will probably find this useful.
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.comtag:blogger.com,1999:blog-1465986942435538208.post-51070387291933529082013-05-30T09:54:00.000-07:002017-12-24T12:19:15.307-08:00Carryless multiplicative inverse<span style="font-family: "Helvetica Neue", Arial, Helvetica, sans-serif;">
Note: this post is neither about the normal multiplicative inverse, nor the modular multiplicative inverse. <a href="http://bitmath.blogspot.com/2012/09/divisibility-and-modular-multiplication.html">This other post</a> has information about the modular multiplicative inverse, which might be what you were looking for.<br/><br/>
Mathematicians may call carryless multiplication "multiplication in GF(2^n)", but that doesn't explain how it works - recall the shift-and-add algorithm for multiplication:
<pre class="brush: csharp">static uint mul(uint a, uint b)
{
uint r = 0;
while (b != 0)
{
if ((a & 1) != 0)
r += b;
a >>= 1;
b <<= 1;
}
return r;
}</pre>
Carryless multiplication is a very simple variation on that: do the addition without carries. That's just a XOR.
<pre class="brush: csharp">static uint cl_mul(uint a, uint b)
{
uint r = 0;
while (b != 0)
{
if ((a & 1) != 0)
r ^= b; // carryless addition is xor
a >>= 1;
b <<= 1;
}
return r;
}</pre>
It has some applications in complicated cryptography related algorithms, but it also seems like this should be an interesting and powerful operation when working with bits, and it may well be, but I know of almost no uses for it (<del>besides, Intel's implementation is so slow that it often wouldn't help</del> Intel made it over twice as fast in Haswell). But anyway, let's just start with its basic properties: like normal multiplication, it's commutative and associative. It's also distributive, but over xor instead of over addition. None of this is very surprising.<br/>
<br/>
As an aside, using associativity, it can be shown that the parallel suffix with XOR (which does have some known uses in bitmath, for example in implementing <a href="http://programming.sirrida.de/bit_perm.html#c_e">compress_right</a> in software), code shown below, is equivalent to a carryless multiplication by -1.
<pre>// parallel suffix with XOR
x ^= x << 1;
x ^= x << 2;
x ^= x << 4;
x ^= x << 8;
x ^= x << 16;</pre>
Every step is clearly a carryless multiplication, by 3, 5, 17, 257, and 65537 respectively. So it's equivalent to:<br/><code>clmul(clmul(clmul(clmul(clmul(x, 3), 5), 17), 257), 65537)</code> which can be rearranged (using associativity) to:<br/><code>clmul(x, clmul(clmul(clmul(clmul(3, 5), 17), 257), 65537))</code> which works out to <code>clmul(x, -1)</code>. Of course it was supposed to work out that way, because every bit of the result should be the XOR of all bits up to (and including) that bit, but it's nice that it also follows easily from a basic property. Incidentally if you have a full-width carryless multiplication, multiplying by -1 also computes the parallel <i>prefix</i> with XOR in the upper bits (the upper bit of the low word, which is the parity of the input, is shared by the suffix and the prefix.)<br/>
<br/>
Carryless multiplication also shares an other property with normal multiplication: there are multiplicative inverses modulo 2<sup>n</sup> (and also modulo other numbers, but 2<sup>n</sup> is of particular interest since we're working in that by default anyway). Again there are only inverses for odd numbers, and it's equally obvious (as for normal multiplication) why that should be so - an even multiplier will throw out at least one high order bit. First, here's an example of carrlessly multiplying <code>x</code> by -1 and then carrylessly multiplying that by 3.<br/>
<pre>x = {d}{c}{b}{a} // the letters are bits
y = cl_mul(x, -1) = {d^c^b^a}{c^b^a}{b^a}{a}
z = cl_mulinv(-1) = 0011
cl_mul(y, z) = {d^c^b^a ^ c^b^a}{c^b^a ^ b^a}{b^a ^ a}{a} = {d}{c}{b}{a}</pre>
Ok, so that worked out well, and it also gives part the answer to exercise 3 in chapter 5 of Hacker's Delight (about whether parallel prefix/suffix with XOR is invertible and how) because a carryless multiplication by -1 is the same as the parallel suffix with XOR. A carryless multiplication of <code>y</code> by 3 is of course just <code>y ^ (y << 1)</code>.<br/>
<br/>
But back to actually computing the inverse. The inverse had better be odd, so bit 0 is already known, and for all the other bits follow these steps
<ol>
<li>if the remainder is 1, stop</li>
<li>if bit <code>k</code> is 0, go to step 5</li>
<li>set bit <code>k</code> in the inverse</li>
<li>xor the remainder with <code>input << k</code></li>
<li>increment k and go to step 1</li>
</ol>
Step 4 always resets the offending bit because the input had to be odd, so it's obvious that the remainder always ends up being 1 eventually, and so the algorithm always terminates. Moreover, even in the worst case it only has to process every bit but one, and continuing after the remainder becomes 1 simply does nothing, so step 1 could read "if <code>k</code> is 32" (or some other number, depending on how many bits your ints are wide), which is easier to unroll and better suited for a hardware implementation (not that I've seen this operation implemented in hardware anywhere).<br/>
For example, in C# it could look like this:
<pre class="brush: csharp">static uint clmulinv(uint x)
{
uint inv = 1;
uint rem = x;
for (int i = 1; i < 32; i++)
{
if (((rem >> i) & 1) != 0)
{
rem ^= x << i;
inv |= 1u << i;
}
}
return inv;
}</pre>
<br/>
A variation of the algorithm to find a multiplicative inverse modulo a power of two (see <tt>inv</tt> <a href="http://bitmath.blogspot.com/2012/09/divisibility-and-modular-multiplication.html">here</a>) also works, which is useful when clmul is fast:
<pre class="brush: csharp">static uint clmulinv(uint d)
{
uint x = 1;
for (int i = 0; i < 5; i++)
{
x = clmul(x, clmul(x, d));
}
return x;
}</pre>
<br/>
The first iteration sets x to d, that can be done immediately to skip an iteration.
<br/>
<br/>
Some sample inverses
<pre>1 0x00000001
3 0xFFFFFFFF
5 0x55555555
7 0xDB6DB6DB
9 0x49249249
11 0x72E5CB97
13 0xD3A74E9D
15 0x33333333</pre>
<br/>
<p>The definition of <tt>clmul</tt> at the start of the post was meant to be just that, a faster way to emulate it is this:
<pre class="brush: csharp">static uint clmul(uint a, uint b)
{
uint r = 0;
do
{
r ^= a * (b & (0 - b));
b &= b - 1;
r ^= a * (b & (0 - b));
b &= b - 1;
r ^= a * (b & (0 - b));
b &= b - 1;
r ^= a * (b & (0 - b));
b &= b - 1;
} while (b != 0);
return r;
}</pre>This works by extracting a bit from <tt>b</tt> and multiplying by it (which just shifts <tt>a</tt> left), then resetting that bit. This can be unrolled safely since when <tt>b == 0</tt>, no further changes are made to <tt>r</tt> automatically. The <tt>0 - b</tt> thing is due to an unfortunate misfeature of C#, negating unsigned integers converts it to a <tt>long</tt>.</p>
<p>A similar trick works for the inverse:
<pre class="brush: csharp">static uint clmulinv(uint x)
{
uint inv = 1;
uint rem = x - 1;
while (rem != 0)
{
uint m = rem & (0 - rem);
rem ^= x * m;
inv += m;
}
return inv;
}</pre></p>
<br/>
<br/>
By the way, why is this post so popular? Please let me know in the comments down below.
</span>Haroldhttp://www.blogger.com/profile/16934800558256607460noreply@blogger.com