[HN Gopher] 8-bit number to binary string (2013) ___________________________________________________________________ 8-bit number to binary string (2013) Author : modinfo Score : 47 points Date : 2022-05-25 19:01 UTC (3 hours ago) (HTM) web link (gynvael.coldwind.pl) (TXT) w3m dump (gynvael.coldwind.pl) | vardump wrote: | Those constants are somewhat obfuscated. | | 3472328296227680304 == 0x3030303030303030 | | 9241421688590303745 == 0x8040201008040201 | | 72340172838076673 == 0x0101010101010101 | | ... and suddenly things are a lot less mysterious. | | Similar tricks were used for poor man's SIMD in one 32-bit | register before MMX, Altivec etc. You could for example do just | _one_ multiply RGB 5-5-5 bit alpha blending instead of 3! ( | "Cooked" / premultiplied alpha of course.) | AdamH12113 wrote: | Thank you, that makes much more sense. | glouwbug wrote: | Isn't that poor man SIMD an Abrash Black Book trick? | vardump wrote: | Perhaps, although a lot of people invented these tricks | independently. I think some of the first ones have been known | since seventies. | pitaj wrote: | Wow that's really clever. I never considered that the 8 unicode | bytes corresponding to the 8 bits of u8 would fit in a u64. | leni536 wrote: | The distribution of bits can also be achieved with a PDEP | instruction, I wonder if that would make this somewhat practical | on a modern CPU. | vanderZwan wrote: | In the previous discussion someone lamented they wished they | could do this in JavaScript[0]. | | Nowadays we have TextDecoder and BigInt64Array so I'm pretty sure | we actually could do it[1][2]. | | I doubt it would be faster than calling <number>.toString(2), but | in principle we could port this. | | [0] https://news.ycombinator.com/item?id=10517332 | | [1] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder | | [2] https://developer.mozilla.org/en- | US/docs/Web/JavaScript/Refe... | Mr_Modulo wrote: | <number>.toString(2) Wouldn't be quite right because the output | needs to be padded out with zeros. | vanderZwan wrote: | Oh, right, good catch! So <number>.toString(2).padStart(8, | "0") then. I still expect it to be faster than this clever | hack though. | | https://developer.mozilla.org/en- | US/docs/Web/JavaScript/Refe... | mmphosis wrote: | x86_64 no loop needed: movabsq | $3472328296227680304, %rdx movzbl %dil, %eax | movabsq $-9205322385119247871, %rdi imulq %rdi, %rax | movabsq $72340172838076673, %rdi shrq $7, %rax | andq %rdi, %rax addq %rdx, %rax movq %rax, (%rsi) | | 6502 with a loop: LDX #$08 loop | LSR $EB LDA #$30 ADC #$00 DEX STA | result,X BNE loop | dragontamer wrote: | To truly understand this stuff... you have to study SIMD. | | But wait, what SIMD instructions? No no no. The 64-bit register | here is being used as a 64-way 1-bit SIMD. | | You can see that addition and the "&" instructions are just | simple SIMD. The multiplication instruction can be broken up into | complex 64x bitshifts + 64x add instructions all into a single | statement. | | In general, this is called SWAR (SIMD within a register). | | --------- | | The "popcount" trick is also SIMD/SWAR (bitshift and add the & | 0xcccccccc / 0x33333333 / etc. etc.), and is in fact a | "reduction" / "prefix sum" applied over 1-bit values in a 32-bit | or 64-bit register applied in a 1-bit SIMD as well. | | ------------ | | This achieves very fast results because its innately parallel. A | 64-bit processor after all, can be seen as a 64-way parallel | 1-bit processor. Learning the parallelization techniques, and | applying them in this fashion leads to this kind of super-fast | code. | | Furthermore: PEXT / PDEP from Intel can be seen as "bitwise 64x | way gather" and "bitwise 64x way scatter" bit-instructions. The | 64-bit register as a 64-way parallel computer is truly a | marvelous abstraction. | sfblah wrote: | It's a clever solution. For converting unsigned chars to strings, | though, a lookup table is going to be the fastest solution, since | there are only 256 possibilities. | userbinator wrote: | Possibly fastest in a microbenchmark, but that's 2k of cache | (at least) which the table will consume. | mortenlarsen wrote: | Are you sure? A cache miss might be very expensive. | dragontamer wrote: | Lookup tables hit the load/store units, IIRC only 2 loads per | clock tick on modern Intel or AMD-Zen chips (though if anyone | remembers, correct me if I'm wrong). | | Modern bitshifts/add instructions can instead perform 3x per | clock tick on Intel, and 4x per clock tick on AMD Zen. | | Multiplication is 5-cycles of latency, but at least 1-per-clock | tick, maybe 2-per-clock-tick depending on architecture. The | /128 is just a "bitshift-right 7". | | ------- | | EDIT: So you need 8x lookups (at least 4 clock ticks worth of | resources, assuming the CPU can perform 2-lookups per clock | tick) for your lookup table approach. | | But this "*(unsigned long long*)out = 3472328296227680304ULL + | (((c * 9241421688590303745ULL) / 128) & 72340172838076673ULL)" | operation only has 3x bit operations + 1x multiply, using | 2-clock ticks worth of resources. | | The multiply and lookups have multiple clock-ticks of latency, | but throughput tends to be the more important metric rather | than latency in high-speed CPU code in my experience. (CPUs are | very smart about rearranging code out-of-order to become | throughput bound rather than latency bound) ___________________________________________________________________ (page generated 2022-05-25 23:01 UTC)