Let's say that we have a graph of at most 8 nodes (you can also think of it as a relation between 8 things), represented as an 8 by 8 Boolean matrix, and we want to compute its transitive closure. A basic translation of Warshall's algorithm (Warshall's formulation of the Floyd–Warshall algorithm computes the transitive closure instead of shortest paths) into what I imagine a C++ programmer might write would look like this:
// version 0 std::bitset<64> closure8x8(std::bitset<64> g) { for (size_t k = 0; k < 8; k++) { for (size_t i = 0; i < 8; i++) { if (g[i * 8 + k]) { for (size_t j = 0; j < 8; j++) { if (g[k * 8 + j]) g.set(i * 8 + j); } } } } return g; }
Some may be tempted to suggest switching to a non-bit-packed representation instead of std::bitset, but then you're reading the wrong blog. Working with individual Booleans one-by-one is a dead end whether they're packed or not, and a packed representation has more opportunities to work with multiple bits at once, which is exactly what we're doing to do. The "trick" here, if it can even be called a trick, is to extract the k-th row of the 8x8 matrix and OR it into the i-th row, if and only if node k is directly reachable from node i.
//version 1 uint64_t closure8x8(uint64_t x) { for (size_t k = 0; k < 8; k++) { uint64_t kth_row = (x >> (k * 8)) & 0xff; for (size_t i = 0; i < 8; i++) { if (x & (UINT64_C(1) << (i * 8 + k))) x |= kth_row << (i * 8); } } return x; }
Although x is obviously modified in the i-loop, it is modified in a way such that the test doesn't really depend on the new value, we could use a copy of x that is taken just before entering the i-loop (challenge for compiler writers: automate this optimization). This isn't just for show, it's measurably more efficient (at least it was with my tests, on my PC, using a specific compiler..) and makes the next step less of a leap.
//version 2 uint64_t closure8x8(uint64_t x) { for (size_t k = 0; k < 8; k++) { uint64_t x_old = x; uint64_t kth_row = (x >> (k * 8)) & 0xff; for (size_t i = 0; i < 8; i++) { if (x_old & (UINT64_C(1) << (i * 8 + k))) x |= kth_row << (i * 8); } } return x; }
Now we can unroll the i-loop. The 8 condition bits, the bits that control whether the if is entered or not, can be pulled out with (x >> k) & 0x0101010101010101. In each byte of that mask we get either 0, if the k-th row shouldn't be ORed into the corresponding row, or 1, if it should be. Multiplying the mask by the k-th row keeps every zero, but changes every 1 to a copy of the k-th row, precisely what we need to OR by.
//version 3 uint64_t closure8x8(uint64_t x) { uint64_t lsb_of_byte = 0x0101010101010101; for (size_t k = 0; k < 8; k++) { uint64_t kth_row = (x >> (k * 8)) & 0xff; uint64_t m = (x >> k) & lsb_of_byte; x |= kth_row * m; } return x; }
That's decent, if you're not into weird operations or SIMD you can stop there.
MOR
Similar to how (Floyd-)Warshall "looks like" a kind of matrix multiplication, transitive closure "look like" MMIX's MOR operation (which is also a kind of matrix multiplication), which I implemented like this:
uint64_t MOR(uint64_t a, uint64_t b) { uint64_t lsb_of_byte = 0x0101010101010101; uint64_t r = 0; for (size_t i = 0; i < 8; i++) { uint64_t ith_row = (a >> (8 * i)) & 0xFF; uint64_t m = (b >> i) & lsb_of_byte; r |= ith_row * m; } return r; }
Beyond merely looking similar, if we had an efficient implementation of MOR (do we?) it would be possible to use that to implement the transitive closure, although that's not as simple as taking the "MOR square" of the matrix. The "internal dependency" of Warshall's algorithm may look like a small matter on the surface but it's a crucial difference. Essentially, if A is the matrix (with Boolean entries and using OR-AND-algebra instead of the more familiar PLUS-MULTIPLY) representing our graph, we need the sum of A1, A2, ... A8, not just A2. That can be done with only 3 uses of MOR and 3 bitwise OR. Squaring A1 + A2 gives us A2 + A3 + A4, adding A1 and squaring again gives us A2 + ... A8 (1+1=1 in this context so multiple copies of A raised to the same exponent collapse into one when summed).
//version 4 uint64_t closure8x8(uint64_t x) { x |= MOR(x, x); x |= MOR(x, x); x |= MOR(x, x); return x; }
Although MOR cannot easily be built on top of GF2P8AFFINEQB, there is a trick that uses it (but only for its ability to transpose):
uint64_t MOR(uint64_t a, uint64_t b) { __m512i va = _mm512_set1_epi64(a); __m512i z = _mm512_setzero_si512(); __m512i m_id = _mm512_set1_epi64(0x8040201008040201); __m512i masked = _mm512_mask_blend_epi8(b, z, va); __m512i maskedt = _mm512_gf2p8affine_epi64_epi8(m_id, masked, 0); return _mm512_cmpneq_epi8_mask(maskedt, z); }
On my Intel Rocket Lake, version 4 (using the AVX-512 trick for MOR) is not great. The latency is surprisingly decent. In terms of throughput, it beats version 3 by some 35% .. but that's bad. This approach "wastes" the entire SIMD width on processing a single graph. If we have multiple graphs, which we probably do if we're talking about throughput, this is not the way to go. There may be a niche use for it in a context where there's only one graph to process, and the latency isn't critical, and it's competing with another computation.
No more MOR
Taking the same approach as version 3, but using real SIMD instead of "faking SIMD" with scalar operations, the SIMD width is still available to process 4 graphs at once with AVX2:
// version 5 __m256i closure8x8_avx2(__m256i g) { uint64_t ones = 0x0101010101010101; __m256i z = _mm256_setzero_si256(); __m256i m, shuf; shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(0, ones * 8)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 7)); g = _mm256_or_si256(g, m); shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones, ones * 9)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 6)); g = _mm256_or_si256(g, m); shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 2, ones * 10)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 5)); g = _mm256_or_si256(g, m); shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 3, ones * 11)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 4)); g = _mm256_or_si256(g, m); shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 4, ones * 12)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 3)); g = _mm256_or_si256(g, m); shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 5, ones * 13)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 2)); g = _mm256_or_si256(g, m); shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 6, ones * 14)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 1)); g = _mm256_or_si256(g, m); shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * 7, ones * 15)); m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), g); g = _mm256_or_si256(g, m); return g; }
An AVX-512 version could process 8 graphs at once, but it would have to change a bit because there is no 512-bit vpblendvb. Instead you could use vpblendmb, and vptestmb to extract the masks.