[HN Gopher] 8-bit number to binary string (2013)
       ___________________________________________________________________
        
       8-bit number to binary string (2013)
        
       Author : modinfo
       Score  : 47 points
       Date   : 2022-05-25 19:01 UTC (3 hours ago)
        
 (HTM) web link (gynvael.coldwind.pl)
 (TXT) w3m dump (gynvael.coldwind.pl)
        
       | vardump wrote:
       | Those constants are somewhat obfuscated.
       | 
       | 3472328296227680304 == 0x3030303030303030
       | 
       | 9241421688590303745 == 0x8040201008040201
       | 
       | 72340172838076673 == 0x0101010101010101
       | 
       | ... and suddenly things are a lot less mysterious.
       | 
       | Similar tricks were used for poor man's SIMD in one 32-bit
       | register before MMX, Altivec etc. You could for example do just
       | _one_ multiply RGB 5-5-5 bit alpha blending instead of 3! (
       | "Cooked" / premultiplied alpha of course.)
        
         | AdamH12113 wrote:
         | Thank you, that makes much more sense.
        
         | glouwbug wrote:
         | Isn't that poor man SIMD an Abrash Black Book trick?
        
           | vardump wrote:
           | Perhaps, although a lot of people invented these tricks
           | independently. I think some of the first ones have been known
           | since seventies.
        
       | pitaj wrote:
       | Wow that's really clever. I never considered that the 8 unicode
       | bytes corresponding to the 8 bits of u8 would fit in a u64.
        
       | leni536 wrote:
       | The distribution of bits can also be achieved with a PDEP
       | instruction, I wonder if that would make this somewhat practical
       | on a modern CPU.
        
       | vanderZwan wrote:
       | In the previous discussion someone lamented they wished they
       | could do this in JavaScript[0].
       | 
       | Nowadays we have TextDecoder and BigInt64Array so I'm pretty sure
       | we actually could do it[1][2].
       | 
       | I doubt it would be faster than calling <number>.toString(2), but
       | in principle we could port this.
       | 
       | [0] https://news.ycombinator.com/item?id=10517332
       | 
       | [1] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder
       | 
       | [2] https://developer.mozilla.org/en-
       | US/docs/Web/JavaScript/Refe...
        
         | Mr_Modulo wrote:
         | <number>.toString(2) Wouldn't be quite right because the output
         | needs to be padded out with zeros.
        
           | vanderZwan wrote:
           | Oh, right, good catch! So <number>.toString(2).padStart(8,
           | "0") then. I still expect it to be faster than this clever
           | hack though.
           | 
           | https://developer.mozilla.org/en-
           | US/docs/Web/JavaScript/Refe...
        
       | mmphosis wrote:
       | x86_64 no loop needed:                   movabsq
       | $3472328296227680304, %rdx         movzbl %dil, %eax
       | movabsq $-9205322385119247871, %rdi         imulq %rdi, %rax
       | movabsq $72340172838076673, %rdi         shrq $7, %rax
       | andq %rdi, %rax         addq %rdx, %rax         movq %rax, (%rsi)
       | 
       | 6502 with a loop:                   LDX #$08       loop
       | LSR $EB         LDA #$30         ADC #$00         DEX         STA
       | result,X         BNE loop
        
       | dragontamer wrote:
       | To truly understand this stuff... you have to study SIMD.
       | 
       | But wait, what SIMD instructions? No no no. The 64-bit register
       | here is being used as a 64-way 1-bit SIMD.
       | 
       | You can see that addition and the "&" instructions are just
       | simple SIMD. The multiplication instruction can be broken up into
       | complex 64x bitshifts + 64x add instructions all into a single
       | statement.
       | 
       | In general, this is called SWAR (SIMD within a register).
       | 
       | ---------
       | 
       | The "popcount" trick is also SIMD/SWAR (bitshift and add the &
       | 0xcccccccc / 0x33333333 / etc. etc.), and is in fact a
       | "reduction" / "prefix sum" applied over 1-bit values in a 32-bit
       | or 64-bit register applied in a 1-bit SIMD as well.
       | 
       | ------------
       | 
       | This achieves very fast results because its innately parallel. A
       | 64-bit processor after all, can be seen as a 64-way parallel
       | 1-bit processor. Learning the parallelization techniques, and
       | applying them in this fashion leads to this kind of super-fast
       | code.
       | 
       | Furthermore: PEXT / PDEP from Intel can be seen as "bitwise 64x
       | way gather" and "bitwise 64x way scatter" bit-instructions. The
       | 64-bit register as a 64-way parallel computer is truly a
       | marvelous abstraction.
        
       | sfblah wrote:
       | It's a clever solution. For converting unsigned chars to strings,
       | though, a lookup table is going to be the fastest solution, since
       | there are only 256 possibilities.
        
         | userbinator wrote:
         | Possibly fastest in a microbenchmark, but that's 2k of cache
         | (at least) which the table will consume.
        
         | mortenlarsen wrote:
         | Are you sure? A cache miss might be very expensive.
        
         | dragontamer wrote:
         | Lookup tables hit the load/store units, IIRC only 2 loads per
         | clock tick on modern Intel or AMD-Zen chips (though if anyone
         | remembers, correct me if I'm wrong).
         | 
         | Modern bitshifts/add instructions can instead perform 3x per
         | clock tick on Intel, and 4x per clock tick on AMD Zen.
         | 
         | Multiplication is 5-cycles of latency, but at least 1-per-clock
         | tick, maybe 2-per-clock-tick depending on architecture. The
         | /128 is just a "bitshift-right 7".
         | 
         | -------
         | 
         | EDIT: So you need 8x lookups (at least 4 clock ticks worth of
         | resources, assuming the CPU can perform 2-lookups per clock
         | tick) for your lookup table approach.
         | 
         | But this "*(unsigned long long*)out = 3472328296227680304ULL +
         | (((c * 9241421688590303745ULL) / 128) & 72340172838076673ULL)"
         | operation only has 3x bit operations + 1x multiply, using
         | 2-clock ticks worth of resources.
         | 
         | The multiply and lookups have multiple clock-ticks of latency,
         | but throughput tends to be the more important metric rather
         | than latency in high-speed CPU code in my experience. (CPUs are
         | very smart about rearranging code out-of-order to become
         | throughput bound rather than latency bound)
        
       ___________________________________________________________________
       (page generated 2022-05-25 23:01 UTC)