Tuesday, 3 January 2023

Weighted popcnt

Since I showed the code below on Twitter, and some people understood it and some didn't, I suppose I should explain how it works and what the more general technique is.

// sum of indexes of set bits
int A073642(uint64_t n)
    return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
          (__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);

A sum of several multi-digit numbers, such as 12 + 34, can be rewritten to make the place-value notation explicit, giving 1*10 + 2*1 + 3*10 + 4*1. Then the distributive property of multiplication can be used to group digits of equal place-value together, giving (1 + 3)*10 + (2 + 4)*1. I don't bring up something so basic as an insult to the reader, the basis of that possibly-intimidating piece of code really is this basic.

The masks 0xAAAAAAAAAAAAAAAA, 0xCCCCCCCCCCCCCCCC, etc, represent nothing more than the numbers 0..63, each written vertically in binary, starting with the least significant bit. Each column of bits contains its own index. AND-ing the masks with n leaves, in each column, either zero (if the corresponding bit of n was zero) or the index of that column.

The first popcnt then adds together all the digits of the "filtered" indexes with a place-value of 1, the second popcnt adds together the digits with a place-value of 2, and so on. The result of each popcnt is shifted to match the corresponding place-value, and all of them are added together.

Generalizing the technique

The numbers in the columns did not have to be 0..63, the columns can contain arbitrary numbers, even negative numbers. This gives some sort of weighted popcnt: rather than each bit having a weight of 1, we are free to choose an arbitrary weight for each bit.

In general, starting with some arbitrary numbers, interpret them as a binary matrix. Transpose that matrix. Every row of the transposed matrix is the mask that goes into the corresponding popcnt-step.

Some simplifications may be possible:

  • If a row of the matrix is zero, the corresponding step can be left out.
  • If two or more rows are equal, their popcnt-steps can be merged into one with a weight that is the sum of the weights of the steps that are merged.
  • If a row is a linear combination of some set of other rows, the popcnt corresponding to that row can be computed as a linear combination of the popcnts corresponding to those other rows.
  • If a row has exactly one bit set, its step can be implemented with an AND and a shift, without popcnt, by shifting the bit to its target position.
  • If every row has exactly one bit set, we're really dealing with a bit-permutation and there are better ways to implement them.

For example, let's construct "the sum of squares of indexes of set bits" (with 1-indexed indexes) aka oeis.org/A003995. Transposing the squares 1,2,4,9..4096 gives these masks:


The second mask is zero because a square cannot be congruent to 2 or 3 modulo 4. The last mask has exactly one bit set, so its step can be implemented without popcnt. Putting it together:

int A003995(uint64_t n)
    return __popcnt64(n & 0x5555555555555555) +
          (__popcnt64(n & 0x2222222222222222) << 2) +
          (__popcnt64(n & 0x1414141414141414) << 3) +
          (__popcnt64(n & 0x0d580d580d580d58) << 4) +
          (__popcnt64(n & 0x0335566003355660) << 5) +
          (__popcnt64(n & 0x00f332d555a66780) << 6) +
          (__popcnt64(n & 0x555a5b6666387800) << 7) +
          (__popcnt64(n & 0x66639c78783f8000) << 8) +
          (__popcnt64(n & 0x787c1f807fc00000) << 9) +
          (__popcnt64(n & 0x7f801fff80000000) << 10) +
          (__popcnt64(n & 0x7fffe00000000000) << 11) +
          ((n & 0x8000000000000000) >> 51);

Aside from implementing strange sequences from the OEIS, possible uses of this technique may include

  • Summing the face-values of a hand of cards. Unfortunately this cannot take "special combinations" into account, unlike other techniques.
  • A basic evaluation of a board state in Reversi/Othello, based on weights for each captured square that differ by their position.
  • Determining the price for a pizza based on a set of selected toppings, I don't know, I'm grasping at straws here.

Friday, 22 July 2022

Bit-level commutativity and "sum gap" revisited

In an earlier post, bit-level commutativity, it was shown that addition is not only invariant under swapping the operands, but also under swapping individual bits between the operands, provided that they are of equal place-value. For example the least significant bits of the operands of a sum could be exchanged without changing the value of the sum. Bit-level commutativity of addition implies that A + B = (A & B) + (A | B).

In the range a sum cannot fall within, it was shown that the "gap" which a sum cannot fall within can be widened from: $$A + B \begin{cases} \ge \max(A,B) & \text{ if addition does not carry} \\ \le \min(A,B) & \text{ if addition does carry} \end{cases}\tag{Eq1}$$ To: $$A + B \begin{cases} \ge A\:|\:B & \text{ if addition does not carry} \\ \le A\:\&\:B & \text{ if addition does carry} \end{cases}\tag{Eq2}$$ This makes the gap wider because A & B is often less than min(A, B), and A | B is often greater than max(A, B).

To start with, let's assume that $$A + B \begin{cases} \ge A & \text{ if addition does not carry} \\ \le A & \text{ if addition does carry} \end{cases}\tag{Eq3}$$ Since addition is commutative, the roles of A and B could be exchanged without affecting the truth of equation Eq3, so this must also hold: $$A + B \begin{cases} \ge B & \text{ if addition does not carry} \\ \le B & \text{ if addition does carry} \end{cases}\tag{Eq4}$$ If A + B is greater than or equal to both A and B, then it must also be greater than and equal to max(A, B), and a similar argument applies to the upper bound. Therefore, Eq1 follows from Eq3.

Substituting A and B with A & B and A | B (respectively) in Eq3 and Eq4 gives $$(A\:\&\:B) + (A\:|\:B) \begin{cases} \ge (A\:\&\:B) & \text{ if addition does not carry} \\ \le (A\:\&\:B) & \text{ if addition does carry} \end{cases}\tag{Eq5}$$ And $$(A\:\&\:B) + (A\:|\:B) \begin{cases} \ge (A\:|\:B) & \text{ if addition does not carry} \\ \le (A\:|\:B) & \text{ if addition does carry} \end{cases}\tag{Eq6}$$ Picking the "does carry" case from Eq5 and the "does not carry" case from Eq6, and using that A + B = (A & B) + (A | B), gives Eq2.

Saturday, 10 July 2021

Integer promotion does not help performance

There is a rule in the C language which roughly says that arithmetic operations on short integer types implicitly convert their operands to normal-sized integers, and also give their result as a normal-sized integer. For example in C:

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

Various other languages have similar rules. For example C#, the specification of which is not as jargon-infested as the C specifiction:

In the case of integral types, those operators (except the ++ and -- operators) are defined for the int, uint, long, and ulong types. When operands are of other integral types (sbyte, byte, short, ushort, or char), their values are converted to the int type, which is also the result type of an operation.

There may be various reasons to include such a rule in a programming language (and some reasons not to), but one that is commonly mentioned is that "the CPU prefers to work at its native word size, other sizes are slower". There is just one problem with that: it is based on an incorrect assumption about how the compiler will need to implement "narrow arithmetic".

To give that supposed reason the biggest benefit that I can fairly give it, I will be using MIPS for the assembly examples. MIPS completely lacks narrow arithmetic operations.

Implementing narrow arithmetic as a programmer

Narrow arithmetic is often required, even though various languages make it a bit cumbersome. C# and Java both demand that you explicitly convert the result back to a narrow type. Despite that, code that needs to perform several steps of narrow arithmetic is usually not littered with casts. The usual pattern is to do the arithmetic without intermediate casts, then only in the end use one cast just to make the compiler happy. In C, even that final cast is not necessary.

For example, let's reverse the bits of a byte in C#. This code was written by Igor Ostrovsky, in his blog post Programming job interview challenge. It's not a unique or special case, and I don't mean that negatively: it's good code that anyone proficient could have written, a job well done. Code that senselessly casts back to byte after every step is also sometimes seen, perhaps because in that case, the author does not really understand what they are doing.

// Reverses bits in a byte 
static byte Reverse(byte b)
    int rev = (b >> 4) | ((b & 0xf) << 4);
    rev = ((rev & 0xcc) >> 2) | ((rev & 0x33) << 2);
    rev = ((rev & 0xaa) >> 1) | ((rev & 0x55) << 1); 

    return (byte)rev;

Morally, all of the operations in this function are really narrow operations, but C# cannot express that. A special property of this code is that none of the intermediate results exceed the limits of a byte, so in a language without integer promotion it could be written in much the same way, but without going through int for the intermediate results.

Implementing narrow arithmetic as a compiler

The central misconception that (I think) gave rise to the myth that integer promotion helps performance, is the assumption that without integer promotion, the compiler must implement narrow operations by inserting an explicit narrowing operation after every arithmetic operation. But that's not true, a compiler for a language that lacks integer promotion can use the same approach that programmers use to implement narrow arithmetic in languages that do have integer promotion. For example, what if two bytes were added together (from memory, and storing the result in memory) in a hypothetical language that lacks integer promotion, and what if that code was compiled for MIPS? The assumption is that it will cost an additional operation to get rid of the "trash bits", but it does not:

        lbu     $2,0($4)
        lbu     $3,0($5)
        addu    $2,$2,$3
        sb      $2,0($4)

The sb instruction does not care about any "trash" in the upper 24 bits, those bits simply won't be stored. This is not a cherry-picked case. Even if there were more arithmetic operations, in most cases the "trash" in the upper bits could safely be left there, being mostly isolated from the bits of interest by the fact that carries only propagate from the least significant bit up, never down. For example, let's throw in a multiplication by 42 and a shift-left by 3 as well for good measure:

        lbu     $6,0($4)
        li      $3,42
        lbu     $2,0($5)
        mul     $5,$3,$6
        addu    $2,$5,$2
        sll     $2,$2,3
        sb      $2,0($4)

What is true, is that before some operations, the trash in the upper bits must be cleared. For example before a division, shift-right, or comparison. That is not an exhaustive list, but the list of operations that require the upper bits to be clean is shorter (and have less "total frequency") than the list of operations that do not require that, see for example which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? "Before some operations" is not the same thing as "after every operation", but that still sounds like an additional cost. However, the trash-clearing operations that a compiler for a language that lacks integer promotion would have to insert are not additional operations: they are the same ones that a programmer would write explicitly in a language with integer promotion.

It may be possible to construct a contrived case in which a human would know that the high bits of an integer are clean, while a compiler would struggle to infer that. For example, a compiler may have more trouble reasoning about bits that were cancelled by an XOR, or worse, by a multiplication. Such cases are not likely to be the reason behind the myth. A more likely reason is that many programmers are not as familiar with basic integer arithmetic as they perhaps ought to be.

So integer promotion is useless?

Integer promotion may prevent accidental use of narrow operations where wide operations were intended, whether that is worth it is another question. All I wanted to say with this post, is that "the CPU prefers to work at its native word size" is a bogus argument. Even when it is true, it is irrelevant.

Wednesday, 9 June 2021

Partial sums of blsi and blsmsk

blsi is an x86 operation which extracts the rightmost set bit from a number, it can be implemented efficiently in terms of more familiar operations as i&-i. blsmsk is a closely related operation which extracts the rightmost set bit and also "smears" it right, setting the bits to the right of that bit as well. blsmsk can be implemented as i^(i-1). The smearing makes the result of blsmsk almost (but not quite) twice as high as the result of blsi for the same input: blsmsk(i) = floor(blsi(i) * (2 - ฮต)).

The partial sums of blsi and blsmsk can be defined as b(n) = sum(i=1, n, blsi(i)) and a(n) = sum(i=1, n, blsmsk(i)) respectively. These sequences are on OEIS as A006520 and A080277. Direct evaluation of those definitions would be inefficient, is there a better way?

Unrecursing the recursive definition

Let's look at the partial sums of blsmsk first. Its OEIS entry suggests the recursion below, which is already significantly better than the naive summation:

a(1) = 1, a(2*n) = 2*a(n) + 2*n, a(2*n+1) = 2*a(n) + 2*n + 1

A "more code-ish"/"less mathy" way to implement that could be like this:

int PartialSumBLSMSK(int i) {
    if (i == 1)
        return 1;
    return (PartialSumBLSMSK(i >> 1) << 1) + i;

Let's understand what this actually computes, and then find an other way to do it. The overal big picture of the recursion is that n is being shifted right on the way "down", and the results of the recursive calls are being shifted left on the way "up", in a way that cancels each other. So in total, what happens is that a bunch of "copies" of n is added up, except that at the kth step of the recursion, the kth bit of n is reset.

This non-tail recursion can be turned into tail-recursion using a standard trick: turning the "sum on the way"-logic into "sum on the way down", by passing two accumulators in extra arguments, one to keep track of the sum, and an other to keep track of how much to multiply i by:

int PartialSumBLSMSKTail(int i, int accum_a, int accum_m) {
    if (i <= 1)
        return accum_a + i * accum_m;
    return PartialSumBLSMSKTail(i >> 1, accum_a + i * accum_m, accum_m * 2);

Such a tail-recursive function is then simple to transform into a loop. Shockingly, GCC (even quite old versions) manages to compile the original non-tail recursive function into a loop as well without much help (just changing the left-shift into the equivalent multiplication), although some details differ.

Anyway, let's put in a value and see what happens. For example, if the input was 11101001, then the following numbers would be added up:

11101001 (reset bit 0)
11101000 (reset bit 1, it's already 0)
11101000 (reset bit 2)
11101000 (reset bit 3)
11100000 (reset bit 4)
11100000 (reset bit 5)
11000000 (reset bit 6)
10000000 (base case)

Look at the columns of the matrix of numbers above, the column for bit 3 has four ones in it, the column for bit 5 has six ones in it. If bit k is set bit in n, there is a pattern that that bit is set in k+1 rows.

Using the pattern

Essentially what that pattern means, is that a(n) can be expressed as the dot-product between n viewed as a vector of bits (weighted according to their position) (๐“ท), and a constant vector (๐“ฌ) with entries 1, 2, 3, 4, etc, up to the size of the integer. For example for n=5, ๐“ท would be (1, 0, 4), and the dot-product with ๐“ฌ would be 13. A dot-product like that can be implemented with some bitwise trickery, by using bit-slicing. The trick there is that instead of multiplying the entries of ๐“ท by the entries of ๐“ฌ directly, we multiply the entries of ๐“ท by the least-significant bits of the entries of ๐“ฌ, and then seperately multiply it by all the second bits of the entries of ๐“ฌ, and so on. Multiplying every entry of ๐“ท at once by a bit of an entry of ๐“ฌ can be implemented using just a bitwise-AND operation.

Although this trick lends itself well to any vector ๐“ฌ, I will use 0,1,2,3.. and add an extra n separately (this corresponds to factoring out the +1 that appears at the end of the recursive definition), because that way part of the code can be reused directly by the solution of the partial sums of blsi (and also because it looks nicer). The masks that correspond to the chosen vector ๐“ฌ are easy to compute: each column across the masks is an entry of that vector. In this case, for 32bit integers:

c0 10101010101010101010101010101010
c1 11001100110011001100110011001100
c2 11110000111100001111000011110000
c3 11111111000000001111111100000000
c4 11111111111111110000000000000000

The whole function could look like this:

int PartialSumBLSMSK(int n)
    int sum = n;
    sum += (n & 0xAAAAAAAA);
    sum += (n & 0xCCCCCCCC) << 1;
    sum += (n & 0xF0F0F0F0) << 2;
    sum += (n & 0xFF00FF00) << 3;
    sum += (n & 0xFFFF0000) << 4;
    return sum;

PartialSumBLSI works almost the same way, with its recursive formula being b(1) = 1, b(2n) = 2b(n) + n, b(2n+1) = 2b(n) + n + 1. The +1 can be factored out as before, and the other part (n instead of 2*n) is exactly half of what it was before. Dividing ๐“ฌ by half seems like a problem, but it can be done implicitly by shifting the bit-slices of the product to the right by 1 bit. There are no problems with bits being lost that way, because the least significant bit is always zero in this case (๐“ฌ has zero as its first element).

int PartialSumBLSI(int n)
    int sum = n;
    sum += (n & 0xAAAAAAAA) >> 1;
    sum += (n & 0xCCCCCCCC);
    sum += (n & 0xF0F0F0F0) << 1;
    sum += (n & 0xFF00FF00) << 2;
    sum += (n & 0xFFFF0000) << 3;
    return sum;

Wrapping up

The particular set of constants I used is very useful and appears in more tricks, such as collecting indexes of set bits. They are the bitwise complements of a set of masks that Knuth (in The Art of Computer Programming volume 4, section 7.1.3) calls "magic masks", labeled ยตk.

This post was inspired by this question on Stack Overflow and partially based on my own answer and the answer of Eric Postpischil, without which I probably would not have come up with any of this, although I used a used a different derivation and explanation for this post.

Saturday, 26 September 2020

The range a sum cannot fall within

Throughout this post, the variables A, B, and R are used, with R defined as R = A + B, and A ≤ B. Arithmetic in this post is unsigned and modulo 2k. Note that A ≤ B is not a restriction on the input, it is a choice to label the smaller input as A and the larger input as B. Addition is commutative, so this choice can be made without loss of generality.

R < A || R ≥ B

The sum is less than A iff the addition wraps (1), otherwise it has to be at least B (2).

  1. B cannot be so high that the addition can wrap all the way up to or past A. To make A + B add up to A, B would have had to be 2k, which is one beyond the maximum value it can be. R = A is possible only if B is zero, in which case R ≥ B holds instead.
  2. Since A is at least zero, in the absence of wrapping there is no way to reduce the value below the inputs.

Perhaps that all looks obvious, but this has a useful application: if the carry-out of the addition is not available, it can be computed via carry = (x + y) < x, which is a relatively well-known trick. It does not matter which of x or y is the smaller or larger input, the sum cannot fall within the "forbidden zone" between them. The occasionally seen carry = (x + y) < max(x, y) adds an unnecessary complication.

R < (A & B) || R ≥ (A | B)

This is a stronger statement, because A & B is usually smaller than A and A | B is usually greater than B.

If no wrapping occurs, then R ≥ (A | B). This can be seen for example by splitting the addition into a XOR and adding the carries separately, (A + B) = (A ^ B) + (A & B) * 2, while bitwise OR can be decomposed similarly into (A | B) = (A ^ B) + (A & B)(see below). Since there is no wrapping (by assumption), (A & B) * 2 ≥ (A & B) and therefore (A + B) ≥ (A | B). Or, with less algebra: addition sometimes produces a zero where the bitwise OR produces a one, but then addition compensates doubly for it by carrying into the next position.

For the case in which wrapping occurs I will take a bit-by-bit view. In order to wrap, the carry out of bit k-1 must be 1. In order for the sum to be greater than or equal to A & B, bit k-1 of the sum must be greater than or equal to bit k-1 of A & B. That combination means that the carry into bit k-1 of the sum must have been 1 as well. Furthermore, bit k-1 of the sum can't be greater than bit k-1 of A & B, at most it can be equal, which means bit k-2 must be examined as well. The same argument applies to bit k-2 and so on, until finally for the least-significant bit it becomes impossible for it to be carried into, so the whole thing falls down: by contradiction, A + B must be less than A & B when the sum wraps.

What about (A | B) = (A ^ B) + (A & B) though?

The more obvious version is (A | B) = (A ^ B) | (A & B), compensating for the bits reset by the XOR by ORing exactly those bits back in. Adding them back in also works, because the set bits in A ^ B and A & B are disjoint: a bit being set in the XOR means that exactly one of the input bits was set, which makes their AND zero.

Monday, 3 August 2020

Why does AND distribute over XOR

AND distributes over XOR, unsurprisingly both from the left and right, that is:

x & y ^ z & y == (x ^ z) & y
x & y ^ x & z == x & (y ^ z)
a & c ^ a & d ^ b & c ^ b & d == (a ^ b) & (c ^ d)
A somewhat popular explanation for why is,
Conjunction and exclusive or form the multiplication and addition operations of a field GF(2), and as in any field they obey the distributive law.

Which is true and a useful way to think about it, but it is also the type of backwards explanation that relies on a concept that is more advanced than the thing which is being explained.

Diagrams with crossing lines

Let's represent an expression such as a & c ^ a & d ^ b & c ^ b & d by putting the variables on the left of every AND along the top of a grid, and the variables on the right of every AND along the side. Then for example the grid cell on the intersection between the column of a and the row of c corresponds to the term a & c. Further, let's draw lines for variables that are True, in this example all variables are True:

The overall expression a & c ^ a & d ^ b & c ^ b & d counts the number of crossings, modulo 2. Rather than counting the crossings one by one, the number of crossings could be computed by counting how many variables along the top are True, how many along the side are True, and taking the product, again modulo 2. A sum modulo 2 is XOR and a product modulo 2 is AND, so this gives the equivalent expression (a ^ b) & (c ^ d).

The simpler cases x & y ^ z & y and x & y ^ x & z correspond to 1x2 and 2x1 diagrams.

Diagrams with bites taken out of them

Such a diagram with a section of it missing can be dealt with by completing the grid and subtracting the difference. For example the unwieldy a & e ^ a & f ^ a & g ^ a & h ^ b & e ^ b & f ^ b & g ^ b & h ^ c & e ^ c & f ^ d & e ^ d & f (shown in the diagram below) is "incomplete", it misses the 2x2 square that corresponds to (c ^ d) & (g ^ h). Completing the grid and subtracting the difference gives ((a ^ b ^ c ^ d) & (e ^ f ^ g ^ h)) ^ ((c ^ d) & (g ^ h)), which is correct.

This all has a clear connection to the FOIL method and its generalizations, after all conjunction and exclusive or form the multiplication and addition operations of a field GF(2).

The same diagrams also show why AND distributes over OR (the normal, inclusive, OR), which could alternatively be explained in terms of the Boolean semiring.

Sunday, 3 May 2020

Information on incrementation

Defining increment

Just to avoid any confusion, the operation that this post is about is adding 1 (one) to a value: $$\text{increment}(x) = x + 1$$ Specifically, performing that operation in the domain of bit-vectors.

Incrementing is very closely related to negating. After all, -x = ~x + 1 and therefore x + 1 = -~x, though putting it that way feel oddly reversed to me.

Bit-string notation

In bit-string notation (useful for analysing compositions of operations at the bit level), increment can be represented as: $$a01^k + 1 = a10^k$$

An "English" interpretation of that form is that an increment carries through the trailing set bits, turning them to zero, and then carries into the right-most unset bit, setting it.

That "do something special with the right-most unset bit" aspect of increment is the basis for various right-most bit manipulations, some of which were implemented in AMD Trailing Bit Manipulation (TBM) (which has been discontinued).

For example, the right-most unset bit in x can be set using x | (x + 1), which has a nice symmetry with the more widely known trick for unsetting the right-most set bit, x & (x - 1).

Increment by XOR

As was the case with negation, there is a way to define increment in terms of XOR. The bits that flip during an increment are all the trailing set bits and the right-most unset bit, the TBM instruction for which is BLCMSK. While that probably does not seem very useful yet, the fact that x ^ (x + 1) takes the form of some number of leading zeroes followed by some number of trailing ones turns out to be useful.

Suppose one wants to increment a bit-reversed integer, a possible (and commonly seen) approach is looping of the bits from top the bottom and implementing the "carry through the ones, into the first zero" logic by hand. However, if the non-reversed value was also available (let's call it i), the bit-reversed increment could be implemented by calculating the number of ones in the mask as tzcnt(i + 1) + 1 (or popcnt(i ^ (i + 1))) and forming a mask with that number of ones located at the desired place within an integer:

// i   = normal counter
// rev = bit-reversed counter
// N   = 1 << number_of_bits
int maskLen = tzcnt(i + 1) + 1;
rev ^= N - (N >> maskLen);
That may still not seem useful, but this enables an implementation of the bit-reversal permutation (not a bit-reversal itself, but the permutation that results from bit-reversing the indices). The bit-reversal permutation is sometimes used to re-order the result of a non-auto-sorting Fast Fourier Transform algorithm into the "natural" order. For example,
// X = array of data
// N = length of X, power of two
for (uint32_t i = 0, rev = 0; i < N; ++i)
    if (i < rev)
        swap(X[i], X[rev]);
    int maskLen = tzcnt(i + 1) + 1;
    rev ^= N - (N >> maskLen);
This makes no special effort to be cache-efficient.