[HN Gopher] The radix 2^51 trick (2017) ___________________________________________________________________ The radix 2^51 trick (2017) Author : mooreds Score : 355 points Date : 2020-05-29 15:22 UTC (7 hours ago) (HTM) web link (www.chosenplaintext.ca) (TXT) w3m dump (www.chosenplaintext.ca) | posix_me_less wrote: | Very interesting. It may seem that the bad performance of the ADC | instruction is due to bad design of the ALU, but I think this is | more of an example of how replacing (single sequential task) by | (several parallel tasks whose results are combined at the end) | allows the CPU to finish faster. | csense wrote: | It's a data flow problem. Four ADD instructions can be | parallelized. Four ADC instructions have to run serially. | a1369209993 wrote: | And in particular, sixteen ADC instructions in four chains | _shouldn 't_ run any slower than sixteen ADD instructions | (assuming four execution ports and a large enough reorder | buffer), but a: Haswell sucks, and b: even on properly- | designed CPUs, making parallelization easier is almost always | _some_ kind of win, because general-purpose CPUs always have | dispatch overhead. | cwzwarich wrote: | He's testing on Haswell, where (at least according to Agner | Fog) ADC has an extra uop compared to ADD. On Broadwell and | later, non-memory ADC is a single uop with a single cycle of | latency. He would probably see a big speedup on a newer CPU. | gautamcgoel wrote: | What I want to know is what, if any, are the implications for CPU | architecture. How should bignum arithmetic be efficiently | implemented in hardware? | kwillets wrote: | There's a whole subfield of adder design focused on carrying, | since it's a big dependency chain even within one register. I | believe this is a software version of what some adder circuits | already do (although I'll need some reading to check). | not2b wrote: | Since good bignum libraries already use tricks like this, and | many others besides (like using FFT to handle very large bignum | multiplication), it doesn't seem that the hardware needs to | change. | bobbiechen wrote: | Cool trick. I wonder how it would compare in performance to a | carry-lookahead adder [1], which uses the fact that you can | compute any carry digit directly from a combination of the | original inputs (which is increasingly long the deeper you go). | | [1] https://en.wikipedia.org/wiki/Carry-lookahead_adder | jonsen wrote: | To compare you would have to implement carry-lookahead in | software. How would you do that? | londons_explore wrote: | It can be done by AND-ing the input numbers. | | Not sure you'd gain any speed boost though. | bobbiechen wrote: | Yikes, good question. I thought about it for a bit: | | Here's a toy example adding three-bit numbers, where all | letters are bits (0 or 1) and you can normally only add one | bit at a time. a b c + d e f | ------- g h i j | | Instead of doing j = c ^ f j_carry | = c & f i = j_carry ^ (b ^ e) | | in two steps, we can condense it to j = c ^ | f i = (c & f) ^ (b ^ e) | | And similarly, we can get g and h combinationally from the | input: # "j_carry AND one | of b, e" i_carry = (b & e) | ( (c & f) & (b | e) ) | h = i_carry ^ (a ^ d) h = ( (b & e) | ((c & f) & (b | | e)) ) ^ (a ^ d) g = (a & d) | (i_carry & (a | | d)) g = (a & d) | ((b & e) | ( (c & f) & (b | e)) & | (a | d)) | | So we end up directly computing g = (a & d) | | ((b & e) | ( (c & f) & (b | e)) & (a | d)) h = ( (b | & e) | ((c & f) & (b | e)) ) ^ (a ^ d) i = (c & f) ^ | (b ^ e) j = c ^ f | | All of these innermost combinations like (a & d) you can get | just by AND / OR of the original inputs. | | But as you suggest (I think), then you need to combine these | further to actually get results. In theory they're all | parallelizable (if you picture this as tree), but I don't see | a good way to do that quickly. Though I'm no assembly expert | so maybe I'm missing something. | | I guess thinking in terms of gate delays doesn't really help | much at this abstraction level. | smithza wrote: | The post did not mention potential cost of splitting and masking | these into 5 limbs instead of 4. Perhaps I am missing that in | this proposed system, 256-bit ints are always stored in 5 | registers, then you only normalize when the result is needed? | What are the costs of splitting/masking into 5-limbs and could | the scaling impact the gains of parallel adds/subtracts | meaningfully? | Strilanc wrote: | In hardware this would be called a carry save adder [1]. | | One interesting fact about carry save adders is that the carry | part of the register can be extended in order to avoid any | carrying between words for very long periods of time. Instead of | using 53 bits to represent 52 bits, use 60 bits, and now you can | perform 256 sequential additions with no carrying between words | before the carry part saturates and needs to be handled. | | Somewhat surprisingly, it's even possible to use this | construction in the context of a quantum computation [2][3]. The | reason it's surprising is because tacking on additional registers | like this can act as an information leakage mechanism (which is | an error in a quantum computation). For analogy, you can imagine | that if you computed a public key using this trick that you would | not be comfortable handing an attacker the registers storing your | public key before you removed the carry padding that made the | computation faster. But it turns out that if you just initialize | the carry padding randomly, subtracting out of the next word at | the start to ensure everything sums to the correct result at the | end, then you can show that the information leakage and chance- | of-bad-overflow is exponentially suppressed in the length of the | carry padding. Currently the most efficient known way to | implement Shor's algorithm uses this technique [4]. | | 1: https://en.wikipedia.org/wiki/Carry-save_adder | | 2: https://arxiv.org/abs/1905.08488 | | 3: https://youtu.be/upTipX9yXNg?t=777 | | 4: https://arxiv.org/abs/1905.09749 | munificent wrote: | Ah, this post is so good. This is why I come to HN. | CliffStoll wrote: | Carries were a major problem in mechanical calculators as well. | For example, the upper section of the Curta had a special bell & | gearing to allow carries. The Friden STW - a mechanical behemoth | - had a complex set of gears within the moving carriage to allow | carries & borrows. Same with Monroe & Marchant mechanical | calculators. For electric motor driven calculators, the | additional weight of the moving carriage - as well as the gearing | - set limits on how fast the machine could add and subtract. | dreamcompiler wrote: | TLDR: Denormalized polynomial addition followed by normalization | is faster than normalized polynomial addition on Haswell. Nice | trick! | ramshorns wrote: | It's amazing that the 17 instructions at the end to tidy up all | the carries, which looks like it has to be done serially, is | still faster. But I guess each register's carry bits are | independent from the ones that get carries added from the last | register, so it could be ... mov W, E mov | V, D mov U, C mov T, B shr W, 51 shr V, | 51 shr U, 51 shr T, 51 add D, W add C, V | add B, U add A, T | | which does seem like it could be parallelized. | Suncho wrote: | > each register's carry bits are independent from the ones that | get carries added from the last register | | No they're not. Let's say B's non-carry bits are all 1. If you | carry anything from C, that will affect B's carry bits. | londons_explore wrote: | You could check if the middle 64-12-12 bits are all 1's, and | if they are branch to a slow codepath. | | It would be an incredibly rare case, so branch prediction | will always get it right, and the cost of a branch nearly | nill. | a1369209993 wrote: | If you're trying to add two 256-bit numbers, there's | _probably_ cryptography involved, and data-dependant | branches are poison to crypto code. | gpvos wrote: | Well, it's only faster if you do three or more additions before | that. But otherwise yes. | eumenides1 wrote: | This is really cool, but gives me PTSD from my compliers course. | jonsen wrote: | Were you forced to comply against your will? | asciimov wrote: | edit: Ignore this, I'm tired and dumb. | | You aren't gaining performance by reducing carries, you are | gaining performance by running these additions in parallel. | | The ALU is doing the addition in binary. Even though you add | padding to the beginning of the number chunks it's still a binary | number. When you recombine the chunks of numbers you will be | doing all those carries you feel like you got for free. | tspiteri wrote: | But the carry propagation blocks parallelism. So by reducing | the need to carry, you are getting the parallel gains. | asciimov wrote: | But you are doing the same number of carries. Op is making it | sound like you are doing less carries. | tspiteri wrote: | You're grouping carries. If you have 13 spare bits, you can | accumulate 13 additions before you have to take care of | carries. Then you have to add some overhead for bit | manipulation, but the article's point is that the | parallelism gains more than make up for the bit | manipulation. | asciimov wrote: | Right, am not saying that this isn't faster, I am saying | you aren't reducing to total number of carries. | tspiteri wrote: | You are reducing the number of carries. The article | explains it clearly. If you're not convinced by the | article, I won't try to convince you. (Edit: the last | thing I'll add: you are reducing the number of carries | for 64-bit additions. When adding 13 51-bit numbers, that | does not count as carries instruction-wise, as it's still | add, not adc.) | asciimov wrote: | I see what I was doing wrong, I was considering total bit | carries during the addition, ie... 1+1= 10 and is a one | bit carry, and not the carry operation done after | overflowing the word boundary. | | Next time, I'll keep from reading tech articles while | running on too little sleep. | | Thanks for your understanding. | tspiteri wrote: | No worries, I kinda realized just after posting the | comment, which explains why I added the edit. :) | phkahler wrote: | It depends how many numbers you add. If you compute A+B it | will be slower because the Carrie's still have to be | handled after the addition. But if you compute A+B+C+D+E | you do 4 fast addition and only deal with carry bits once. | TFA allows up to about 8000 numbers to be summed for a | single pass of carry propagation. | asciimov wrote: | Right, I understand we are parallelizing the addition | operations and the performance gain is due to adding | chunks of the numbers in parallel. | | What I am saying is the total number of carry operations | is conserved, even though they are delayed. | MauranKilom wrote: | I can't tell from your comments whether you realized yet, | but the trick is only useful if you add _several_ 256 bit | numbers. You can add up to 2^13 numbers (each one 256 | bits) before you actually have to do the four carry | shift-outs _once_. In effect, you _do_ reduce the number | of carry operations (to four, albeit slightly more | involved), and that is what ends up saving time. | | There is no performance gain if you only add two numbers, | even though that's the setup the article discusses for | 90% of its length. It seems that this is leading to some | confusion. | asciimov wrote: | Yeah, I re-read the whole thing again, and I see the | mistake I made. I took a wrong turn when op went into | base 37 numbers, which made me consider every bit carry | not just the carry on word boundary. | brohee wrote: | He didn't talk about it, but energy usage is likely going up, | so maybe if you keep doing it long enough the frequency will | have to be throttled down in order to cool down. | tspiteri wrote: | I don't think so. If a core is busy, it's going to be | consuming almost the same amount of power whether it's using | one ALU or four. So the best strategy is to finish your | processing as fast as possible so that you can idle, where | the significant power saving happens. Basically, if the | processing finishes faster, you save energy. | Slartie wrote: | What I'm wondering about now is: does that 'adc' opcode have any | other useful application except for adding numbers bigger than | the width of a single register? | | Because it does look a little bit as if it was engineered for | that particular situation, and if that was the case, with this | situation now in practice being solved much more efficiently | without a single 'adc' instruction, the opcode might suddenly be | completely useless bloat - which of course still had to be | maintained forever, even though nobody uses it in practice, | because it's part of the spec. | mrfredward wrote: | I believe adc is still commonly used as a way of reading the | carry bit. For example, assuming eax has already been zeroed, | adc eax, eax would be the idiomatic way of reading the carry | bit into eax. | | The carry flag gets set by all sorts of instructions other than | addition and subtraction (comparisons, bit shifts, | multiplication, etc) so this is more useful than you might | think. | kccqzy wrote: | The carry flag is also the canonical way to signal "something | has gone wrong" in low-level code. Example: Intel virtual | machine extension instructions (like vmwrite, vmxon) all set | the carry flag when they fail. But no, you'd usually use jc | or setc to read the flag. | giovannibajo1 wrote: | Actually, setc is the idiomatic way, and doesn't depend on | the previous value of the destination register so it's much | faster to execute because it has no dependencies | fegu wrote: | This is the reason some of the esoteric x86 instructions | actually perform slower on newer hardware. They are just there | for backwards compatibility. | waynecochran wrote: | Hopefully this will all be replaced with quantum computer that | does addition with ripple carry or uses QFT... :) | | https://link.springer.com/article/10.1007/s11432-015-5411-x | https://arxiv.org/pdf/1411.5949.pdf | londons_explore wrote: | ADC is still the quickest way if you need to add two 256 bit | numbers. | 0x0 wrote: | x86 has a lot of "useless bloat" with legacy high-level | instructions. Modern compilers avoid those and newer cpus | aren't engineered with performance in mind for those | instructions since nobody sane uses them anymore. | andrewla wrote: | That might be another reason why it is slow -- it is an old | opcode that has to be supported for compatibility, but isn't | prioritized in any of the pipelines in newer chip designs. | FartyMcFarter wrote: | It's not useless bloat though. It's probably hard to beat in | terms of code size, which is often even relevant for | performance (due to cache size considerations). | | Furthermore, the article is talking about adding many numbers, | which is just one of the possible use cases. As far as I can | tell, adc is still useful for adding a pair of 128-bit numbers | on a 64-bit CPU for example. | pachico wrote: | I really enjoyed it. I might never use what I learned by reading | it but it was so well explained. Bravo! | steerablesafe wrote: | Redundant number systems are a really great topic. | | https://en.wikipedia.org/wiki/Redundant_binary_representatio... | | https://en.wikipedia.org/wiki/Carry-save_adder | | http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf | | I have an arbitrary precision arithmetic library in my mind that | represents numbers as streams of digits starting from the most | significant digit. A redundant numeric representation could avoid | unbounded backtracking (think flipping between 0.9999... and | 1.0000... in a regular representation) and could guarantee | forward progress for the generated digits. | jchook wrote: | For my undergrad thesis I studied "Fibinary", or Fibonacci | coding. You can think of it like "base phi", using Fibonacci | numbers as the place values, and only 0,1. | | The "normalized" representation, called Zeckendorf | representation, perfectly packs error correction into the | binary, as it ensures no number will ever have consecutive | ones. | | It also has applications in data compression. | | https://en.m.wikipedia.org/wiki/Fibonacci_coding | jjoonathan wrote: | I love that the same trick works for human brains as well as | CPUs. Your brain already pays the price to represent numbers as | high level abstract concepts rather than tightly packed fields. | Just let the digits overflow, then clean them up later. | 14*14 140 + 14*4 140 + 4(16) <-- 4 tens + 16 | ones 18(16) 196 | HeWhoLurksLate wrote: | I did the same thought and broke them up a little | differently: 14*14 7*7*2*2 [b/c 14 = | 2*7] 49*2*2 98*2 196 | | For whatever reason, my brain likes using twos (and powers of | it), so I play to that strong spot wherever possible. | | _EDIT: as noted below, I made a mistake (7*7=49,not 47,) and | carried it halfway through my solution. Oops._ | iso947 wrote: | 7x7 isn't 47, and 94x2 isn't 196 | HeWhoLurksLate wrote: | Woops, fixed it. | | Thanks for noticing! | dr_zoidberg wrote: | Despite that, the breakup does hold up the correct | calculation: 14 * 14 (7 * 2) * | (7 * 2) 7 * 7 * 2 * 2 49 * 2 * 2 | 98 * 2 196 | | And on my side, 49 * 2 quickly resolves to "a hundred | minus 2" (because 49 * 2 = (50 - 1) * 2), and pretty much | the same for 196 (200 - 4). | tspiteri wrote: | > Aside: Why 13 bits instead of 12? For our purposes, we're going | to ignore the carries in the most significant limb, allowing | numbers to wrap when they overflow past 2^256 - 1 (just like how | unsigned addition works in C with normal size integer types). As | a result, we can assign 52 bits to the most significant limb and | ignore the fact that it will run out of room for carries before | the other limbs do. | | If you're going to wrap, you could go all the way and assign 64 | bits to the most significant limb; that way you save 12 bits | which you can spread to the other four limbs. You can go from | {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare | 16 bits instead of 13 bits. | gene91 wrote: | Exactly. Any one can offer an explanation why they didn't take | this path for 256-bit numbers instead? | [deleted] | dgoldstein0 wrote: | Taking a guess: if you use this form to do a bunch of | operations before normalizing, you don't need to worry as | much about intermediate overflow. E.g. you could easily add | 10 different 256 bit numbers in this form and normalize at | the end. Whereas if the leading chunk is 64 bits, every | addition can overflow. | | So 52/51/51/51/51 seems like the more generally useful choice | bonzini wrote: | Why would you ever worry about overflow of the leading | chunk (aka limb)? | romwell wrote: | To expand of that: the whole idea is that normalization is | not applied after _every_ step. | | This approach is efficient when you know you have to add a | bunch of numbers together. You add them up, and _then_ | normalize. | | The example can be, e.g. doing long multiplication, or | adding all numbers in a list. | | (Just repeating what was said in the parent comment for | emphasis) | ChrisLomont wrote: | "Unfortunately, we can't do that here - a 64-bit integer only | has so many possible values" and "The remaining 12 or 13 bits | give us the extra "digits" we need for preventing carries." | | using 64 bit registers, but only using 51 bits, has no | overflow. Using 64 bit adds in 64 bit registers overflows, | making the carry updates more costly. | beagle3 wrote: | Indeed, if you use only 48 bits, you could also parallelize | using the FP hardware - the mantissa is 52 bits, so if you use | 48 bit limbs, you have 16 rounds before carry. Which is much | less than 16 (or even 13) bits, but for processors which have | distinct FP vs. integer adders, and that can issue them in | parallel - you might get a speed boost. | majke wrote: | Funnily enough this is the code I was reading today | | https://github.com/constructor-igor/cudafy/blob/11cdd4def4e7... | s = a ^ b; // sum bits r = a & 0x7f7f7f7f; // | clear msbs t = b & 0x7f7f7f7f; // clear msbs s = | s & 0x80808080; // msb sum bits r = r + t; // | add without msbs, record carry-out in msbs r = r ^ s; | // sum of msb sum and carry-in bits, w/o carry-out | barbegal wrote: | What sort of real world use case adds large numbers like this? | commandlinefan wrote: | public-key crypto. | Y_Y wrote: | hn upvotes | brandmeyer wrote: | The design of Poly1305 was directly informed by this very | technique, with the addition (heh) of performing the arithmetic | on the x87 FPU operating in 80-bit mode. It even has some zero | padding bits placed strategically to aid the technique, hence the | security bound of 2^106 instead of 2^128 or 2^130. | | See also https://cr.yp.to/mac/poly1305-20050113.pdf ___________________________________________________________________ (page generated 2020-05-29 23:00 UTC)