Saturday, 10 July 2021

Integer promotion does not help performance

There is a rule in the C language which roughly says that arithmetic operations on short integer types implicitly convert their operands to normal-sized integers, and also give their result as a normal-sized integer. For example in C:

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

Various other languages have similar rules. For example C#, the specification of which is not as jargon-infested as the C specifiction:

In the case of integral types, those operators (except the ++ and -- operators) are defined for the int, uint, long, and ulong types. When operands are of other integral types (sbyte, byte, short, ushort, or char), their values are converted to the int type, which is also the result type of an operation.

There may be various reasons to include such a rule in a programming language (and some reasons not to), but one that is commonly mentioned is that "the CPU prefers to work at its native word size, other sizes are slower". There is just one problem with that: it is based on an incorrect assumption about how the compiler will need to implement "narrow arithmetic".

To give that supposed reason the biggest benefit that I can fairly give it, I will be using MIPS for the assembly examples. MIPS completely lacks narrow arithmetic operations.

Implementing narrow arithmetic as a programmer

Narrow arithmetic is often required, even though various languages make it a bit cumbersome. C# and Java both demand that you explicitly convert the result back to a narrow type. Despite that, code that needs to perform several steps of narrow arithmetic is usually not littered with casts. The usual pattern is to do the arithmetic without intermediate casts, then only in the end use one cast just to make the compiler happy. In C, even that final cast is not necessary.

For example, let's reverse the bits of a byte in C#. This code was written by Igor Ostrovsky, in his blog post Programming job interview challenge. It's not a unique or special case, and I don't mean that negatively: it's good code that anyone proficient could have written, a job well done. Code that senselessly casts back to byte after every step is also sometimes seen, perhaps because in that case, the author does not really understand what they are doing.

// Reverses bits in a byte 
static byte Reverse(byte b)
{
    int rev = (b >> 4) | ((b & 0xf) << 4);
    rev = ((rev & 0xcc) >> 2) | ((rev & 0x33) << 2);
    rev = ((rev & 0xaa) >> 1) | ((rev & 0x55) << 1); 

    return (byte)rev;
}

Morally, all of the operations in this function are really narrow operations, but C# cannot express that. A special property of this code is that none of the intermediate results exceed the limits of a byte, so in a language without integer promotion it could be written in much the same way, but without going through int for the intermediate results.

Implementing narrow arithmetic as a compiler

The central misconception that (I think) gave rise to the myth that integer promotion helps performance, is the assumption that without integer promotion, the compiler must implement narrow operations by inserting an explicit narrowing operation after every arithmetic operation. But that's not true, a compiler for a language that lacks integer promotion can use the same approach that programmers use to implement narrow arithmetic in languages that do have integer promotion. For example, what if two bytes were added together (from memory, and storing the result in memory) in a hypothetical language that lacks integer promotion, and what if that code was compiled for MIPS? The assumption is that it will cost an additional operation to get rid of the "trash bits", but it does not:

        lbu     $2,0($4)
        lbu     $3,0($5)
        addu    $2,$2,$3
        sb      $2,0($4)

The sb instruction does not care about any "trash" in the upper 24 bits, those bits simply won't be stored. This is not a cherry-picked case. Even if there were more arithmetic operations, in most cases the "trash" in the upper bits could safely be left there, being mostly isolated from the bits of interest by the fact that carries only propagate from the least significant bit up, never down. For example, let's throw in a multiplication by 42 and a shift-left by 3 as well for good measure:

        lbu     $6,0($4)
        li      $3,42
        lbu     $2,0($5)
        mul     $5,$3,$6
        addu    $2,$5,$2
        sll     $2,$2,3
        sb      $2,0($4)

What is true, is that before some operations, the trash in the upper bits must be cleared. For example before a division, shift-right, or comparison. That is not an exhaustive list, but the list of operations that require the upper bits to be clean is shorter (and have less "total frequency") than the list of operations that do not require that, see for example which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? "Before some operations" is not the same thing as "after every operation", but that still sounds like an additional cost. However, the trash-clearing operations that a compiler for a language that lacks integer promotion would have to insert are not additional operations: they are the same ones that a programmer would write explicitly in a language with integer promotion.

It may be possible to construct a contrived case in which a human would know that the high bits of an integer are clean, while a compiler would struggle to infer that. For example, a compiler may have more trouble reasoning about bits that were cancelled by an XOR, or worse, by a multiplication. Such cases are not likely to be the reason behind the myth. A more likely reason is that many programmers are not as familiar with basic integer arithmetic as they perhaps ought to be.

So integer promotion is useless?

Integer promotion may prevent accidental use of narrow operations where wide operations were intended, whether that is worth it is another question. All I wanted to say with this post, is that "the CPU prefers to work at its native word size" is a bogus argument. Even when it is true, it is irrelevant.

Wednesday, 9 June 2021

Partial sums of blsi and blsmsk

blsi is an x86 operation which extracts the rightmost set bit from a number, it can be implemented efficiently in terms of more familiar operations as i&-i. blsmsk is a closely related operation which extracts the rightmost set bit and also "smears" it right, setting the bits to the right of that bit as well. blsmsk can be implemented as i^(i-1). The smearing makes the result of blsmsk almost (but not quite) twice as high as the result of blsi for the same input: blsmsk(i) = floor(blsi(i) * (2 - ฮต)).

The partial sums of blsi and blsmsk can be defined as b(n) = sum(i=1, n, blsi(i)) and a(n) = sum(i=1, n, blsmsk(i)) respectively. These sequences are on OEIS as A006520 and A080277. Direct evaluation of those definitions would be inefficient, is there a better way?

Unrecursing the recursive definition

Let's look at the partial sums of blsmsk first. Its OEIS entry suggests the recursion below, which is already significantly better than the naive summation:

a(1) = 1, a(2*n) = 2*a(n) + 2*n, a(2*n+1) = 2*a(n) + 2*n + 1

A "more code-ish"/"less mathy" way to implement that could be like this:

int PartialSumBLSMSK(int i) {
    if (i == 1)
        return 1;
    return (PartialSumBLSMSK(i >> 1) << 1) + i;
}

Let's understand what this actually computes, and then find an other way to do it. The overal big picture of the recursion is that n is being shifted right on the way "down", and the results of the recursive calls are being shifted left on the way "up", in a way that cancels each other. So in total, what happens is that a bunch of "copies" of n is added up, except that at the kth step of the recursion, the kth bit of n is reset.

This non-tail recursion can be turned into tail-recursion using a standard trick: turning the "sum on the way"-logic into "sum on the way down", by passing two accumulators in extra arguments, one to keep track of the sum, and an other to keep track of how much to multiply i by:

int PartialSumBLSMSKTail(int i, int accum_a, int accum_m) {
    if (i <= 1)
        return accum_a + i * accum_m;
    return PartialSumBLSMSKTail(i >> 1, accum_a + i * accum_m, accum_m * 2);
}

Such a tail-recursive function is then simple to transform into a loop. Shockingly, GCC (even quite old versions) manages to compile the original non-tail recursive function into a loop as well without much help (just changing the left-shift into the equivalent multiplication), although some details differ.

Anyway, let's put in a value and see what happens. For example, if the input was 11101001, then the following numbers would be added up:

11101001 (reset bit 0)
11101000 (reset bit 1, it's already 0)
11101000 (reset bit 2)
11101000 (reset bit 3)
11100000 (reset bit 4)
11100000 (reset bit 5)
11000000 (reset bit 6)
10000000 (base case)

Look at the columns of the matrix of numbers above, the column for bit 3 has four ones in it, the column for bit 5 has six ones in it. If bit k is set bit in n, there is a pattern that that bit is set in k+1 rows.

Using the pattern

Essentially what that pattern means, is that a(n) can be expressed as the dot-product between n viewed as a vector of bits (weighted according to their position) (๐“ท), and a constant vector (๐“ฌ) with entries 1, 2, 3, 4, etc, up to the size of the integer. For example for n=5, ๐“ท would be (1, 0, 4), and the dot-product with ๐“ฌ would be 13. A dot-product like that can be implemented with some bitwise trickery, by using bit-slicing. The trick there is that instead of multiplying the entries of ๐“ท by the entries of ๐“ฌ directly, we multiply the entries of ๐“ท by the least-significant bits of the entries of ๐“ฌ, and then seperately multiply it by all the second bits of the entries of ๐“ฌ, and so on. Multiplying every entry of ๐“ท at once by a bit of an entry of ๐“ฌ can be implemented using just a bitwise-AND operation.

Although this trick lends itself well to any vector ๐“ฌ, I will use 0,1,2,3.. and add an extra n separately (this corresponds to factoring out the +1 that appears at the end of the recursive definition), because that way part of the code can be reused directly by the solution of the partial sums of blsi (and also because it looks nicer). The masks that correspond to the chosen vector ๐“ฌ are easy to compute: each column across the masks is an entry of that vector. In this case, for 32bit integers:

c0 10101010101010101010101010101010
c1 11001100110011001100110011001100
c2 11110000111100001111000011110000
c3 11111111000000001111111100000000
c4 11111111111111110000000000000000

The whole function could look like this:

int PartialSumBLSMSK(int n)
{
    int sum = n;
    sum += (n & 0xAAAAAAAA);
    sum += (n & 0xCCCCCCCC) << 1;
    sum += (n & 0xF0F0F0F0) << 2;
    sum += (n & 0xFF00FF00) << 3;
    sum += (n & 0xFFFF0000) << 4;
    return sum;
}

PartialSumBLSI works almost the same way, with its recursive formula being b(1) = 1, b(2n) = 2b(n) + n, b(2n+1) = 2b(n) + n + 1. The +1 can be factored out as before, and the other part (n instead of 2*n) is exactly half of what it was before. Dividing ๐“ฌ by half seems like a problem, but it can be done implicitly by shifting the bit-slices of the product to the right by 1 bit. There are no problems with bits being lost that way, because the least significant bit is always zero in this case (๐“ฌ has zero as its first element).

int PartialSumBLSI(int n)
{
    int sum = n;
    sum += (n & 0xAAAAAAAA) >> 1;
    sum += (n & 0xCCCCCCCC);
    sum += (n & 0xF0F0F0F0) << 1;
    sum += (n & 0xFF00FF00) << 2;
    sum += (n & 0xFFFF0000) << 3;
    return sum;
}

Wrapping up

The particular set of constants I used is very useful and appears in more tricks, such as collecting indexes of set bits. They are the bitwise complements of a set of masks that Knuth (in The Art of Computer Programming volume 4, section 7.1.3) calls "magic masks", labeled ยตk.

This post was inspired by this question on Stack Overflow and partially based on my own answer and the answer of Eric Postpischil, without which I probably would not have come up with any of this, although I used a used a different derivation and explanation for this post.