[HN Gopher] Benchmarking division and libdivide on Apple M1 and ...
       ___________________________________________________________________
        
       Benchmarking division and libdivide on Apple M1 and Intel AVX512
        
       Author : ridiculous_fish
       Score  : 144 points
       Date   : 2021-05-12 18:52 UTC (4 hours ago)
        
 (HTM) web link (ridiculousfish.com)
 (TXT) w3m dump (ridiculousfish.com)
        
       | CoastalCoder wrote:
       | I'm curious why the author showed the C++ source code, but not
       | the (per-architecture) disassembly.
       | 
       | I would think that's a much better starting point for trying to
       | understand the m-architectural behavior.
        
       | pkw792 wrote:
       | I have just started to use a Mac M1 Mini, and am disappointed.
       | It's incredibly slow to download or install anything. Hangs all
       | the time, takes like 5 hours to install Xcode (it's done but the
       | UI hangs leaving you to believe it has more to do). Hangs when
       | cloning a git repo. Gets stuck anywhere and everywhere. Have to
       | force kill everything and restart to knock some sense into it. I
       | was always respectful of Mac users because Windows has had its
       | problems in the past, but after using a Mac for the first time, I
       | hate it more than ever.
        
         | pkw792 wrote:
         | It's very difficult to follow and engage in technical posts
         | like these benchmarking micro-instructions and so on, where on
         | the face of it the product is simply falling on its face in the
         | most basic use-cases.
        
         | rangewookie wrote:
         | uhhh... your mac might be broken. I have one and my friend has
         | one. We both engage in cpu/gpu intensive workloads and this
         | just doesn't happen. Still within the return window? Would be
         | interesting to find out your fan is DOA or something like
         | that...
        
           | pkw792 wrote:
           | Yeah it's pretty much brand new and for sure within the
           | return window. Maybe there's something wrong with it. I was
           | expecting cool fireworks for sure, but it's been nothing but
           | a PITA thus far.
        
       | pbsd wrote:
       | On Skylake-SP's AVX-512, instructions that previously were
       | dispatched to port 0 or 1 get instead dispatched to ports 0 _and_
       | 1. So instructions like vpsrlq get zero net speedup from
       | switching to AVX-512 from AVX2. Instructions that previously ran
       | on ports 0,1,5 will now run on ports 0 and 5, for a speedup of at
       | best 1.33.
       | 
       | Multiplication will depend on whether the chip has one or two FMA
       | units. If so, you can run vpmuludq on ports 0 and 5, which is a
       | 2x speedup compared to AVX2's ports 0 and 1. This 8275CL Xeon
       | does have 2 FMA units.
       | 
       | Looking at the two inner loops, we have                   up:
       | vmovdqa64     zmm0,ZMMWORD PTR [rdi+rax*4] # p23         add
       | rax,0x10       # p0156          vpmuludq      zmm1,zmm0,zmm4 #
       | p05 or only p0         vpsrlq        zmm2,zmm1,0x20 # p0
       | vpsrlq        zmm1,zmm0,0x20 # p0         vpmuludq
       | zmm1,zmm1,zmm4 # p05 or only p0         vpandd
       | zmm1,zmm6,zmm1 # p05         vpord         zmm1,zmm1,zmm2 # p05
       | vpsubd        zmm0,zmm0,zmm1 # p05         vpsrld
       | zmm0,zmm0,0x1  # p0         vpaddd        zmm0,zmm0,zmm1 # p05
       | vpsrld        zmm0,zmm0,xmm5 # p0+p5         vpaddd
       | zmm3,zmm0,zmm3 # p05         cmp           rax,rdx
       | jb            up              up:         vmovdqa
       | ymm0,YMMWORD PTR [rdi+rax*4] # p23         add          rax,0x8
       | # p0156         vpmuludq     ymm1,ymm0,ymm4               # p01
       | vpsrlq       ymm2,ymm1,0x20               # p01         vpsrlq
       | ymm1,ymm0,0x20               # p01         vpmuludq
       | ymm1,ymm1,ymm4               # p01         vpand
       | ymm1,ymm1,ymm6               # p015         vpor
       | ymm1,ymm1,ymm2               # p015         vpsubd
       | ymm0,ymm0,ymm1               # p015         vpsrld
       | ymm0,ymm0,0x1                # p01         vpaddd
       | ymm0,ymm0,ymm1               # p015         vpsrld
       | ymm0,ymm0,xmm5               # p01+p5         vpaddd
       | ymm3,ymm0,ymm3               # p01         cmp          rax,rdx
       | jb           up
       | 
       | All other things being equal, we have on average, and counting
       | only the differing instructions, a throughput of ~2.27
       | instructions per cycle on the AVX2 loop, whereas it is somewhere
       | around ~1.45-1.60 for AVX-512, depending whether you have 1 or 2
       | FMA units to run multiplications on port 5.
       | 
       | So based on this approximation, the AVX-512 code should probably
       | run around 2*(1.5/2.27) ~ 1.33 times faster. Add to this that
       | vpmuludq is actually one of the most thermally insensitive
       | instructions around and will reduce your core's frequency by
       | 100-200 MHz, and the small speedup you see is more or less
       | explainable. (I actually do see some more noticeable speedup here
       | when switching to AVX-512; 0.25 vs 0.21).
       | 
       | PS: The Intel Icelake and later chips also manage to achieve a
       | throughput of 1/2 divisions per cycle for 32-bit divisors, and
       | 1/3 divisions per cycle for 64-bit divisors.
        
         | celrod wrote:
         | FWIW, llvm-mca estimates 448 clock cycles per 100 iterations of
         | the AVX2 loop vs 528 cycles for the AVX512 loop with
         | `-mcpu=cascadelake`. That suggests the AVX512 loop should be
         | about 2*(448/528)=1.85 times faster.
        
           | pbsd wrote:
           | llvm-mca is highly unreliable when it comes to AVX-512. It
           | thinks 3 512-bit vpaddd, vpsubd can be run per cycle.
           | Adjusting for that you get 622 cycles instead of 528.
        
       | brigade wrote:
       | > Speculatively, AVX512 processes multiplies serially, one
       | 256-bit lane at a time, losing half its parallelism.
       | 
       | Sort of, in Skylake AVX512 fuses the 256-bit p0 and p1 together
       | for one 512-bit uop, and p5 becomes 512-bit wide. So
       | theoretically you get 2x 512-bit pipelines versus AVX2's 3x
       | 256-bit pipelines (two of which can do multiplies.)
       | 
       | Unfortunately, p5 doesn't support integer multiplies, even in
       | SKUs where p5 _does_ support 512-bit floating-point multiplies.
       | So AVX512 has no additional throughput for integer multiplies on
       | current implementations.
        
         | celrod wrote:
         | p5 can do 512 bit operations, but not 256 bit, e.g. look at
         | Skylake-AVX512 and Cascadelake (Xeon benched in the blog post
         | was Cascadelake) ports for vaddpd:
         | 
         | https://uops.info/html-instr/VADDPD_YMM_YMM_YMM.html
         | 
         | Here is 256 bit VPMULUDQ: https://uops.info/html-
         | instr/VPMULUDQ_YMM_YMM_YMM.html
         | 
         | Here is 512 bit VPMULUDQ: https://uops.info/html-
         | instr/VPMULUDQ_ZMM_ZMM_ZMM.html
         | 
         | The 256 bit and 512 bit versions both have a reciprocal
         | throughput of 0.5 cycles/op, using p01 for 256 bit and p05 for
         | 512 bit (where, as you note p0 for 512 bit really means both 0
         | and 1).
         | 
         | So, given the same clock speed, this multiplication should have
         | twice the throughput with 512 bit vectors as with 256 bit. This
         | isn't true for those CPUs without p5, like icelake-client,
         | tigerlake, and rocketlake. But should be true for the Xeon
         | ridiculousfish benchmarked on.
        
       | twoodfin wrote:
       | I bet there's more than one integer division unit per core.
       | 
       | When I've been micro-optimizing performance-critical code,
       | integer division shows up as a hot spot regularly. I assume most
       | developers don't think about the performance implications of
       | coding up a / or % between two runtime values, preventing the
       | compiler from doing any strength reduction. Apple must have seen
       | this in their surely voluminous profiling of real-world
       | applications.
        
         | buildbot wrote:
         | I think you got the point most people miss - apple had a unique
         | ability to profile every app on the Mac and iOS App Store,
         | possibly in an automated way, as part of the app submission
         | pipeline. Intel and AMD could go out and profile real work
         | applications, and I'm sure they do, but to get to the same
         | level of breadth is probably not possible.
        
           | Traster wrote:
           | I'm a little skeptical of this, this is the same as saying
           | Tesla is going to have self-driving because they can record
           | all the decisions current Teslas make. The truth is that it's
           | very path dependent. For Tesla this means you can't optmize
           | getting into the scenario in the first place and for Apple it
           | means you can't actually know which code path will actually
           | be regularly used.
        
           | pvg wrote:
           | Is that really such a unique advantage for Apple? Intel and
           | AMD can work with Microsoft to achieve something similar, for
           | instance.
        
             | bch wrote:
             | I'm only speculating, but I'd think they wouldn't even
             | "have to work with" anybody - couldn't they just instrument
             | whatever they want?
        
         | mhh__ wrote:
         | The fact that Intel and AMD apparently don't prioritize integer
         | division could suggest that their profiling suggests it's not
         | worth it, but with Apple's transistor budget at the moment they
         | can afford it.
         | 
         | Also keep in mind that this Xeon might not be really made for
         | number crunching (not really sure)?
        
           | rodgerd wrote:
           | Apple's chip designers have the advantage, I assume, of being
           | able to wander down a hallway and ask what the telemetry from
           | iOS and MacOS devices are telling them about real-world use.
        
             | mhh__ wrote:
             | Most of Intel's volume is probably shipped to customers who
             | either don't care or buy _a lot_ of CPUs in one go, so the
             | advantage of this probably isn 't quite as apparent as
             | you'd imagine.
             | 
             | What can definitely play a role is that (I don't think it's
             | as much of a problem these days, but it definitely has been
             | in the past) is the standard "benchmark" suites that
             | chipmakers can beat each other over the head with e.g. I
             | think it was Itanium that had a bunch of integer functional
             | units mainly for the purpose of getting better SPEC numbers
             | rather than working on the things that actually make
             | programs fast (MEMORY) - I was maybe 1 or 2 when this chip
             | came out, so this is nth-hand gossip, however.
        
           | masklinn wrote:
           | > The fact that Intel and AMD apparently don't prioritize
           | integer division could suggest that their profiling suggests
           | it's not worth it, but with Apple's transistor budget at the
           | moment they can afford it.
           | 
           | An other possibility is that Apple has a very different
           | profiling base e.g. iOS applications, whereas Intel and AMD
           | would have more artificial workloads, or be bound by
           | workloads / profiles from scientific computing or the like
           | (video games)?
        
           | pbsd wrote:
           | Intel greatly improved their divider implementation between
           | Skylake and Icelake. The measurements in the OP are on
           | Skylake-SP, prior to these improvements.
        
           | criddell wrote:
           | Would it be fair to characterize the M1 as being made for
           | number crunching?
        
             | mhh__ wrote:
             | Refining number crunching to mean single threaded
             | performance I would say yes, or at least definitely more so
             | than the Intel chip
        
             | gameswithgo wrote:
             | Not any more than an intel/amd/etc cpu is. Like that XEON
             | cpu is gonna crunch more numbers, just due to more cores.
        
               | mhh__ wrote:
               | If it was intended to be used in the cloud for example
               | it's going to be doing more _work_ but probably designed
               | around a memory-bound load rather than integer
               | throughput.
        
       | seumars wrote:
       | I'm having a hard time concentrating on the article with that
       | background
        
         | codezero wrote:
         | Funny, what size screen are you on? My wife said the same and
         | tbh I didn't even notice it was a paper towel (funny gag) on my
         | desktop system. I may have just not paid attention.
         | 
         | Go into reader mode, the article is great.
        
       | gigatexal wrote:
       | Geez. I wonder if the M2 will just be higher clocked and more
       | cores or if they'll improve the arch even more?
        
         | sroussey wrote:
         | Yes
        
       | lmilcin wrote:
       | As somebody who worked for Intel I am deeply ashamed for this
       | result.
       | 
       | I mean, seriously, all that tradition and experience and you have
       | a phone company make circles around you on your own field.
        
       | mhh__ wrote:
       | Some notes:
       | 
       | * What's the variance of the measurements?
       | 
       | * Per core, the two processors actually (keep in mind based on
       | Intel's TDP figure) have a roughly similar power budget i.e.
       | 205/26 vs. 39/(4 or 8 depending on if you count the bigs, littles
       | or both), so taking into account that the Apple processor is on a
       | process that is something like 4 or 5 times denser it's not that
       | surprising to me that its faster.
        
       | phkahler wrote:
       | Phoronix recently did some benchmarks with AVX512 and while it
       | was (modestly if I recall) faster, it was horribly worse in terms
       | of performance per watt.
       | 
       | I really hope AMD doesn't adopt AVX512 and if they do I hope it's
       | just the minimum for software compatibility.
       | 
       | On a related note, my Ryzen 2400G does not benefit from
       | recompiling code with -march=x86-64-v3 in fact it seems a tiny
       | bit slower. I assume Zen2 and 3 will actually run faster with
       | that option.
        
       | johnklos wrote:
       | There was a time when division was expensive enough to look for
       | alternatives, but nowadays with the M1 it seems that adding even
       | one or two adds or shifts may end up to be more expensive than
       | division. My goodness, how times have changed!
        
       | rock_artist wrote:
       | I didn't read through the entire thing but...
       | 
       | * Would be wise to compare x86_64 under Rosetta as it'll support
       | some AVX translation if I remember correctly.
       | 
       | * I didn't see use of Apple's accelerate framework. To comply
       | with ARM64 additional custom Apple magic is within Private
       | extensions / ops that should use higher level frameworks such as
       | Accelerate
        
         | hajile wrote:
         | No AVX support with Rosetta2.
         | 
         | Rosetta2 supports up through SSE2. That's the latest
         | instruction set to no longer be patented as of around 2020.
         | They can use x86_64 only because AMD released x86_64 spec in
         | 1999 (even though actual chips came much later).
        
           | gsnedders wrote:
           | It certainly claims to support many things later than SSE2,
           | including everything up to SEE4.2, on this MacBook Air (M1):
           | 
           | % arch -x86_64 sysctl -a | grep machdep.cpu.features
           | 
           | machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC
           | SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE
           | SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTSE64 MON DSCPL VMX EST
           | TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 AES SEGLIM64
        
         | floatboth wrote:
         | Custom Apple instructions do... matrix stuff IIRC. Not
         | something applicable to simple number division.
        
           | mhh__ wrote:
           | Has someone actually found out what instructions do what? I
           | assume we won't get to play with them ourselves unless they
           | get the Microsoft treatment over the hidden APIs
        
       | GeekyBear wrote:
       | Some information from Anandtech's deep dive into Apple's "big"
       | Firestorm core.
       | 
       | >On the Integer side, whose in-flight instructions and renaming
       | physical register file capacity we estimate at around 354
       | entries, we find at least 7 execution ports for actual arithmetic
       | operations. These include 4 simple ALUs capable of ADD
       | instructions, 2 complex units which feature also MUL (multiply)
       | capabilities, and what appears to be a dedicated integer division
       | unit. The core is able to handle 2 branches per cycle, which I
       | think is enabled by also one or two dedicated branch forwarding
       | ports, but I wasn't able to 100% confirm the layout of the design
       | here.
       | 
       | On the floating point and vector execution side of things, the
       | new Firestorm cores are actually more impressive as they a 33%
       | increase in capabilities, enabled by Apple's addition of a fourth
       | execution pipeline. The FP rename registers here seem to land at
       | 384 entries, which is again comparatively massive. The four
       | 128-bit NEON pipelines thus on paper match the current throughput
       | capabilities of desktop cores from AMD and Intel, albeit with
       | smaller vectors. Floating-point operations throughput here is 1:1
       | with the pipeline count, meaning Firestorm can do 4 FADDs and 4
       | FMULs per cycle with respectively 3 and 4 cycles latency. That's
       | quadruple the per-cycle throughput of Intel CPUs and previous AMD
       | CPUs, and still double that of the recent Zen3, of course, still
       | running at lower frequency. This might be one reason why Apples
       | does so well in browser benchmarks (JavaScript numbers are
       | floating-point doubles).
       | 
       | Vector abilities of the 4 pipelines seem to be identical, with
       | the only instructions that see lower throughput being FP
       | divisions, reciprocals and square-root operations that only have
       | an throughput of 1, on one of the four pipes.
       | 
       | https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...
        
         | gsnedders wrote:
         | > This might be one reason why Apples does so well in browser
         | benchmarks (JavaScript numbers are floating-point doubles).
         | 
         | Reminder that browsers try to avoid using doubles for the
         | Number type, preferring integers with overflow checks. Much of
         | layout uses fixed point for subpixels, too. Using doubles all
         | the time would be a notable perf regression.
        
       | amelius wrote:
       | What's the fastest way to implement integer division in hardware?
        
         | pcwalton wrote:
         | I always assumed that CPUs used Newton's method, though that
         | could be wrong.
         | https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%8...
         | 
         | Edit: Yeah, that's only used for floating point. Looks like
         | integer division is usually an algorithm called SRT:
         | https://en.wikipedia.org/wiki/Division_algorithm#SRT_divisio...
        
       | petermcneeley wrote:
       | " Is the hardware divider pipelined, is there more than one per
       | core?"
       | 
       | Chaining the divisions (as a series of dependencies) would enable
       | one to see the full latency of a single divide. You could use
       | this data to estimate the number of divide units on the core.
        
       | torstenvl wrote:
       | Great and interesting work! The slowness of division operations
       | is overlooked too often IMHO and is key to (my approach to)
       | avoiding things like integer overflows (there may be a better way
       | than dividing TYPE_MAX by one of the operands but I don't know an
       | alternate technique). Pretty impressive if the M1 really can
       | achieve two-clock-cycle division on a consistent basis.
       | 
       | May I offer a nitpicking correction? 1.058ns compared to 6.998ns
       | is an 85% savings, not 88%. The listing you have suggests that
       | going down to 1.058ns is a bigger speed-up than going down to
       | 0.891ns.
       | 
       | (PS - Verizon's sale of Yahoo has been in the news lately so I
       | thought of you and the other regulars of the Programming chat
       | room the other day. Hope all is well.)
        
         | ridiculous_fish wrote:
         | Fixed the percentage, thank you. Hope you are doing well too!
        
         | david2ndaccount wrote:
         | Depending on what you're doing, you can usually just use the
         | compiler intrinsics and check for overflow aka,
         | `__builtin_mul_overflow` and similar instead of guarding
         | against it.
        
           | torstenvl wrote:
           | Useful to know, but if we don't care about portability we can
           | just write a function in assembler that checks the carry or
           | overflow flag or whatever the architecture's equivalent is.
        
             | mhh__ wrote:
             | https://gcc.gnu.org/wiki/DontUseInlineAsm
        
               | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-05-12 23:00 UTC)