[HN Gopher] The radix 2^51 trick (2017)
       ___________________________________________________________________
        
       The radix 2^51 trick (2017)
        
       Author : mooreds
       Score  : 355 points
       Date   : 2020-05-29 15:22 UTC (7 hours ago)
        
 (HTM) web link (www.chosenplaintext.ca)
 (TXT) w3m dump (www.chosenplaintext.ca)
        
       | posix_me_less wrote:
       | Very interesting. It may seem that the bad performance of the ADC
       | instruction is due to bad design of the ALU, but I think this is
       | more of an example of how replacing (single sequential task) by
       | (several parallel tasks whose results are combined at the end)
       | allows the CPU to finish faster.
        
         | csense wrote:
         | It's a data flow problem. Four ADD instructions can be
         | parallelized. Four ADC instructions have to run serially.
        
           | a1369209993 wrote:
           | And in particular, sixteen ADC instructions in four chains
           | _shouldn 't_ run any slower than sixteen ADD instructions
           | (assuming four execution ports and a large enough reorder
           | buffer), but a: Haswell sucks, and b: even on properly-
           | designed CPUs, making parallelization easier is almost always
           | _some_ kind of win, because general-purpose CPUs always have
           | dispatch overhead.
        
         | cwzwarich wrote:
         | He's testing on Haswell, where (at least according to Agner
         | Fog) ADC has an extra uop compared to ADD. On Broadwell and
         | later, non-memory ADC is a single uop with a single cycle of
         | latency. He would probably see a big speedup on a newer CPU.
        
       | gautamcgoel wrote:
       | What I want to know is what, if any, are the implications for CPU
       | architecture. How should bignum arithmetic be efficiently
       | implemented in hardware?
        
         | kwillets wrote:
         | There's a whole subfield of adder design focused on carrying,
         | since it's a big dependency chain even within one register. I
         | believe this is a software version of what some adder circuits
         | already do (although I'll need some reading to check).
        
         | not2b wrote:
         | Since good bignum libraries already use tricks like this, and
         | many others besides (like using FFT to handle very large bignum
         | multiplication), it doesn't seem that the hardware needs to
         | change.
        
       | bobbiechen wrote:
       | Cool trick. I wonder how it would compare in performance to a
       | carry-lookahead adder [1], which uses the fact that you can
       | compute any carry digit directly from a combination of the
       | original inputs (which is increasingly long the deeper you go).
       | 
       | [1] https://en.wikipedia.org/wiki/Carry-lookahead_adder
        
         | jonsen wrote:
         | To compare you would have to implement carry-lookahead in
         | software. How would you do that?
        
           | londons_explore wrote:
           | It can be done by AND-ing the input numbers.
           | 
           | Not sure you'd gain any speed boost though.
        
           | bobbiechen wrote:
           | Yikes, good question. I thought about it for a bit:
           | 
           | Here's a toy example adding three-bit numbers, where all
           | letters are bits (0 or 1) and you can normally only add one
           | bit at a time.                     a b c         + d e f
           | -------         g h i j
           | 
           | Instead of doing                   j = c ^ f         j_carry
           | = c & f         i = j_carry ^ (b ^ e)
           | 
           | in two steps, we can condense it to                   j = c ^
           | f         i = (c & f) ^ (b ^ e)
           | 
           | And similarly, we can get g and h combinationally from the
           | input:                   #                   "j_carry AND one
           | of b, e"         i_carry = (b & e) | ( (c & f) & (b | e) )
           | h = i_carry  ^ (a ^ d)         h = ( (b & e) | ((c & f) & (b
           | | e)) ) ^ (a ^ d)              g = (a & d) | (i_carry & (a |
           | d))         g = (a & d) | ((b & e) | ( (c & f) & (b | e)) &
           | (a | d))
           | 
           | So we end up directly computing                   g = (a & d)
           | | ((b & e) | ( (c & f) & (b | e)) & (a | d))         h = ( (b
           | & e) | ((c & f) & (b | e)) ) ^ (a ^ d)         i = (c & f) ^
           | (b ^ e)         j = c ^ f
           | 
           | All of these innermost combinations like (a & d) you can get
           | just by AND / OR of the original inputs.
           | 
           | But as you suggest (I think), then you need to combine these
           | further to actually get results. In theory they're all
           | parallelizable (if you picture this as tree), but I don't see
           | a good way to do that quickly. Though I'm no assembly expert
           | so maybe I'm missing something.
           | 
           | I guess thinking in terms of gate delays doesn't really help
           | much at this abstraction level.
        
       | smithza wrote:
       | The post did not mention potential cost of splitting and masking
       | these into 5 limbs instead of 4. Perhaps I am missing that in
       | this proposed system, 256-bit ints are always stored in 5
       | registers, then you only normalize when the result is needed?
       | What are the costs of splitting/masking into 5-limbs and could
       | the scaling impact the gains of parallel adds/subtracts
       | meaningfully?
        
       | Strilanc wrote:
       | In hardware this would be called a carry save adder [1].
       | 
       | One interesting fact about carry save adders is that the carry
       | part of the register can be extended in order to avoid any
       | carrying between words for very long periods of time. Instead of
       | using 53 bits to represent 52 bits, use 60 bits, and now you can
       | perform 256 sequential additions with no carrying between words
       | before the carry part saturates and needs to be handled.
       | 
       | Somewhat surprisingly, it's even possible to use this
       | construction in the context of a quantum computation [2][3]. The
       | reason it's surprising is because tacking on additional registers
       | like this can act as an information leakage mechanism (which is
       | an error in a quantum computation). For analogy, you can imagine
       | that if you computed a public key using this trick that you would
       | not be comfortable handing an attacker the registers storing your
       | public key before you removed the carry padding that made the
       | computation faster. But it turns out that if you just initialize
       | the carry padding randomly, subtracting out of the next word at
       | the start to ensure everything sums to the correct result at the
       | end, then you can show that the information leakage and chance-
       | of-bad-overflow is exponentially suppressed in the length of the
       | carry padding. Currently the most efficient known way to
       | implement Shor's algorithm uses this technique [4].
       | 
       | 1: https://en.wikipedia.org/wiki/Carry-save_adder
       | 
       | 2: https://arxiv.org/abs/1905.08488
       | 
       | 3: https://youtu.be/upTipX9yXNg?t=777
       | 
       | 4: https://arxiv.org/abs/1905.09749
        
       | munificent wrote:
       | Ah, this post is so good. This is why I come to HN.
        
       | CliffStoll wrote:
       | Carries were a major problem in mechanical calculators as well.
       | For example, the upper section of the Curta had a special bell &
       | gearing to allow carries. The Friden STW - a mechanical behemoth
       | - had a complex set of gears within the moving carriage to allow
       | carries & borrows. Same with Monroe & Marchant mechanical
       | calculators. For electric motor driven calculators, the
       | additional weight of the moving carriage - as well as the gearing
       | - set limits on how fast the machine could add and subtract.
        
       | dreamcompiler wrote:
       | TLDR: Denormalized polynomial addition followed by normalization
       | is faster than normalized polynomial addition on Haswell. Nice
       | trick!
        
       | ramshorns wrote:
       | It's amazing that the 17 instructions at the end to tidy up all
       | the carries, which looks like it has to be done serially, is
       | still faster. But I guess each register's carry bits are
       | independent from the ones that get carries added from the last
       | register, so it could be ...                 mov W, E       mov
       | V, D       mov U, C       mov T, B       shr W, 51       shr V,
       | 51       shr U, 51       shr T, 51       add D, W       add C, V
       | add B, U       add A, T
       | 
       | which does seem like it could be parallelized.
        
         | Suncho wrote:
         | > each register's carry bits are independent from the ones that
         | get carries added from the last register
         | 
         | No they're not. Let's say B's non-carry bits are all 1. If you
         | carry anything from C, that will affect B's carry bits.
        
           | londons_explore wrote:
           | You could check if the middle 64-12-12 bits are all 1's, and
           | if they are branch to a slow codepath.
           | 
           | It would be an incredibly rare case, so branch prediction
           | will always get it right, and the cost of a branch nearly
           | nill.
        
             | a1369209993 wrote:
             | If you're trying to add two 256-bit numbers, there's
             | _probably_ cryptography involved, and data-dependant
             | branches are poison to crypto code.
        
         | gpvos wrote:
         | Well, it's only faster if you do three or more additions before
         | that. But otherwise yes.
        
       | eumenides1 wrote:
       | This is really cool, but gives me PTSD from my compliers course.
        
         | jonsen wrote:
         | Were you forced to comply against your will?
        
       | asciimov wrote:
       | edit: Ignore this, I'm tired and dumb.
       | 
       | You aren't gaining performance by reducing carries, you are
       | gaining performance by running these additions in parallel.
       | 
       | The ALU is doing the addition in binary. Even though you add
       | padding to the beginning of the number chunks it's still a binary
       | number. When you recombine the chunks of numbers you will be
       | doing all those carries you feel like you got for free.
        
         | tspiteri wrote:
         | But the carry propagation blocks parallelism. So by reducing
         | the need to carry, you are getting the parallel gains.
        
           | asciimov wrote:
           | But you are doing the same number of carries. Op is making it
           | sound like you are doing less carries.
        
             | tspiteri wrote:
             | You're grouping carries. If you have 13 spare bits, you can
             | accumulate 13 additions before you have to take care of
             | carries. Then you have to add some overhead for bit
             | manipulation, but the article's point is that the
             | parallelism gains more than make up for the bit
             | manipulation.
        
               | asciimov wrote:
               | Right, am not saying that this isn't faster, I am saying
               | you aren't reducing to total number of carries.
        
               | tspiteri wrote:
               | You are reducing the number of carries. The article
               | explains it clearly. If you're not convinced by the
               | article, I won't try to convince you. (Edit: the last
               | thing I'll add: you are reducing the number of carries
               | for 64-bit additions. When adding 13 51-bit numbers, that
               | does not count as carries instruction-wise, as it's still
               | add, not adc.)
        
               | asciimov wrote:
               | I see what I was doing wrong, I was considering total bit
               | carries during the addition, ie... 1+1= 10 and is a one
               | bit carry, and not the carry operation done after
               | overflowing the word boundary.
               | 
               | Next time, I'll keep from reading tech articles while
               | running on too little sleep.
               | 
               | Thanks for your understanding.
        
               | tspiteri wrote:
               | No worries, I kinda realized just after posting the
               | comment, which explains why I added the edit. :)
        
             | phkahler wrote:
             | It depends how many numbers you add. If you compute A+B it
             | will be slower because the Carrie's still have to be
             | handled after the addition. But if you compute A+B+C+D+E
             | you do 4 fast addition and only deal with carry bits once.
             | TFA allows up to about 8000 numbers to be summed for a
             | single pass of carry propagation.
        
               | asciimov wrote:
               | Right, I understand we are parallelizing the addition
               | operations and the performance gain is due to adding
               | chunks of the numbers in parallel.
               | 
               | What I am saying is the total number of carry operations
               | is conserved, even though they are delayed.
        
               | MauranKilom wrote:
               | I can't tell from your comments whether you realized yet,
               | but the trick is only useful if you add _several_ 256 bit
               | numbers. You can add up to 2^13 numbers (each one 256
               | bits) before you actually have to do the four carry
               | shift-outs _once_. In effect, you _do_ reduce the number
               | of carry operations (to four, albeit slightly more
               | involved), and that is what ends up saving time.
               | 
               | There is no performance gain if you only add two numbers,
               | even though that's the setup the article discusses for
               | 90% of its length. It seems that this is leading to some
               | confusion.
        
               | asciimov wrote:
               | Yeah, I re-read the whole thing again, and I see the
               | mistake I made. I took a wrong turn when op went into
               | base 37 numbers, which made me consider every bit carry
               | not just the carry on word boundary.
        
         | brohee wrote:
         | He didn't talk about it, but energy usage is likely going up,
         | so maybe if you keep doing it long enough the frequency will
         | have to be throttled down in order to cool down.
        
           | tspiteri wrote:
           | I don't think so. If a core is busy, it's going to be
           | consuming almost the same amount of power whether it's using
           | one ALU or four. So the best strategy is to finish your
           | processing as fast as possible so that you can idle, where
           | the significant power saving happens. Basically, if the
           | processing finishes faster, you save energy.
        
       | Slartie wrote:
       | What I'm wondering about now is: does that 'adc' opcode have any
       | other useful application except for adding numbers bigger than
       | the width of a single register?
       | 
       | Because it does look a little bit as if it was engineered for
       | that particular situation, and if that was the case, with this
       | situation now in practice being solved much more efficiently
       | without a single 'adc' instruction, the opcode might suddenly be
       | completely useless bloat - which of course still had to be
       | maintained forever, even though nobody uses it in practice,
       | because it's part of the spec.
        
         | mrfredward wrote:
         | I believe adc is still commonly used as a way of reading the
         | carry bit. For example, assuming eax has already been zeroed,
         | adc eax, eax would be the idiomatic way of reading the carry
         | bit into eax.
         | 
         | The carry flag gets set by all sorts of instructions other than
         | addition and subtraction (comparisons, bit shifts,
         | multiplication, etc) so this is more useful than you might
         | think.
        
           | kccqzy wrote:
           | The carry flag is also the canonical way to signal "something
           | has gone wrong" in low-level code. Example: Intel virtual
           | machine extension instructions (like vmwrite, vmxon) all set
           | the carry flag when they fail. But no, you'd usually use jc
           | or setc to read the flag.
        
           | giovannibajo1 wrote:
           | Actually, setc is the idiomatic way, and doesn't depend on
           | the previous value of the destination register so it's much
           | faster to execute because it has no dependencies
        
         | fegu wrote:
         | This is the reason some of the esoteric x86 instructions
         | actually perform slower on newer hardware. They are just there
         | for backwards compatibility.
        
         | waynecochran wrote:
         | Hopefully this will all be replaced with quantum computer that
         | does addition with ripple carry or uses QFT... :)
         | 
         | https://link.springer.com/article/10.1007/s11432-015-5411-x
         | https://arxiv.org/pdf/1411.5949.pdf
        
         | londons_explore wrote:
         | ADC is still the quickest way if you need to add two 256 bit
         | numbers.
        
         | 0x0 wrote:
         | x86 has a lot of "useless bloat" with legacy high-level
         | instructions. Modern compilers avoid those and newer cpus
         | aren't engineered with performance in mind for those
         | instructions since nobody sane uses them anymore.
        
         | andrewla wrote:
         | That might be another reason why it is slow -- it is an old
         | opcode that has to be supported for compatibility, but isn't
         | prioritized in any of the pipelines in newer chip designs.
        
         | FartyMcFarter wrote:
         | It's not useless bloat though. It's probably hard to beat in
         | terms of code size, which is often even relevant for
         | performance (due to cache size considerations).
         | 
         | Furthermore, the article is talking about adding many numbers,
         | which is just one of the possible use cases. As far as I can
         | tell, adc is still useful for adding a pair of 128-bit numbers
         | on a 64-bit CPU for example.
        
       | pachico wrote:
       | I really enjoyed it. I might never use what I learned by reading
       | it but it was so well explained. Bravo!
        
       | steerablesafe wrote:
       | Redundant number systems are a really great topic.
       | 
       | https://en.wikipedia.org/wiki/Redundant_binary_representatio...
       | 
       | https://en.wikipedia.org/wiki/Carry-save_adder
       | 
       | http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf
       | 
       | I have an arbitrary precision arithmetic library in my mind that
       | represents numbers as streams of digits starting from the most
       | significant digit. A redundant numeric representation could avoid
       | unbounded backtracking (think flipping between 0.9999... and
       | 1.0000... in a regular representation) and could guarantee
       | forward progress for the generated digits.
        
         | jchook wrote:
         | For my undergrad thesis I studied "Fibinary", or Fibonacci
         | coding. You can think of it like "base phi", using Fibonacci
         | numbers as the place values, and only 0,1.
         | 
         | The "normalized" representation, called Zeckendorf
         | representation, perfectly packs error correction into the
         | binary, as it ensures no number will ever have consecutive
         | ones.
         | 
         | It also has applications in data compression.
         | 
         | https://en.m.wikipedia.org/wiki/Fibonacci_coding
        
         | jjoonathan wrote:
         | I love that the same trick works for human brains as well as
         | CPUs. Your brain already pays the price to represent numbers as
         | high level abstract concepts rather than tightly packed fields.
         | Just let the digits overflow, then clean them up later.
         | 14*14         140 + 14*4         140 + 4(16)  <-- 4 tens + 16
         | ones         18(16)         196
        
           | HeWhoLurksLate wrote:
           | I did the same thought and broke them up a little
           | differently:                  14*14        7*7*2*2 [b/c 14 =
           | 2*7]        49*2*2        98*2        196
           | 
           | For whatever reason, my brain likes using twos (and powers of
           | it), so I play to that strong spot wherever possible.
           | 
           |  _EDIT: as noted below, I made a mistake (7*7=49,not 47,) and
           | carried it halfway through my solution. Oops._
        
             | iso947 wrote:
             | 7x7 isn't 47, and 94x2 isn't 196
        
               | HeWhoLurksLate wrote:
               | Woops, fixed it.
               | 
               | Thanks for noticing!
        
               | dr_zoidberg wrote:
               | Despite that, the breakup does hold up the correct
               | calculation:                   14 * 14         (7 * 2) *
               | (7 * 2)         7 * 7 * 2 * 2         49 * 2 * 2
               | 98 * 2         196
               | 
               | And on my side, 49 * 2 quickly resolves to "a hundred
               | minus 2" (because 49 * 2 = (50 - 1) * 2), and pretty much
               | the same for 196 (200 - 4).
        
       | tspiteri wrote:
       | > Aside: Why 13 bits instead of 12? For our purposes, we're going
       | to ignore the carries in the most significant limb, allowing
       | numbers to wrap when they overflow past 2^256 - 1 (just like how
       | unsigned addition works in C with normal size integer types). As
       | a result, we can assign 52 bits to the most significant limb and
       | ignore the fact that it will run out of room for carries before
       | the other limbs do.
       | 
       | If you're going to wrap, you could go all the way and assign 64
       | bits to the most significant limb; that way you save 12 bits
       | which you can spread to the other four limbs. You can go from
       | {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare
       | 16 bits instead of 13 bits.
        
         | gene91 wrote:
         | Exactly. Any one can offer an explanation why they didn't take
         | this path for 256-bit numbers instead?
        
           | [deleted]
        
           | dgoldstein0 wrote:
           | Taking a guess: if you use this form to do a bunch of
           | operations before normalizing, you don't need to worry as
           | much about intermediate overflow. E.g. you could easily add
           | 10 different 256 bit numbers in this form and normalize at
           | the end. Whereas if the leading chunk is 64 bits, every
           | addition can overflow.
           | 
           | So 52/51/51/51/51 seems like the more generally useful choice
        
             | bonzini wrote:
             | Why would you ever worry about overflow of the leading
             | chunk (aka limb)?
        
             | romwell wrote:
             | To expand of that: the whole idea is that normalization is
             | not applied after _every_ step.
             | 
             | This approach is efficient when you know you have to add a
             | bunch of numbers together. You add them up, and _then_
             | normalize.
             | 
             | The example can be, e.g. doing long multiplication, or
             | adding all numbers in a list.
             | 
             | (Just repeating what was said in the parent comment for
             | emphasis)
        
           | ChrisLomont wrote:
           | "Unfortunately, we can't do that here - a 64-bit integer only
           | has so many possible values" and "The remaining 12 or 13 bits
           | give us the extra "digits" we need for preventing carries."
           | 
           | using 64 bit registers, but only using 51 bits, has no
           | overflow. Using 64 bit adds in 64 bit registers overflows,
           | making the carry updates more costly.
        
         | beagle3 wrote:
         | Indeed, if you use only 48 bits, you could also parallelize
         | using the FP hardware - the mantissa is 52 bits, so if you use
         | 48 bit limbs, you have 16 rounds before carry. Which is much
         | less than 16 (or even 13) bits, but for processors which have
         | distinct FP vs. integer adders, and that can issue them in
         | parallel - you might get a speed boost.
        
       | majke wrote:
       | Funnily enough this is the code I was reading today
       | 
       | https://github.com/constructor-igor/cudafy/blob/11cdd4def4e7...
       | s = a ^ b;          // sum bits         r = a & 0x7f7f7f7f; //
       | clear msbs         t = b & 0x7f7f7f7f; // clear msbs         s =
       | s & 0x80808080; // msb sum bits         r = r + t;          //
       | add without msbs, record carry-out in msbs         r = r ^ s;
       | // sum of msb sum and carry-in bits, w/o carry-out
        
       | barbegal wrote:
       | What sort of real world use case adds large numbers like this?
        
         | commandlinefan wrote:
         | public-key crypto.
        
         | Y_Y wrote:
         | hn upvotes
        
       | brandmeyer wrote:
       | The design of Poly1305 was directly informed by this very
       | technique, with the addition (heh) of performing the arithmetic
       | on the x87 FPU operating in 80-bit mode. It even has some zero
       | padding bits placed strategically to aid the technique, hence the
       | security bound of 2^106 instead of 2^128 or 2^130.
       | 
       | See also https://cr.yp.to/mac/poly1305-20050113.pdf
        
       ___________________________________________________________________
       (page generated 2020-05-29 23:00 UTC)