[HN Gopher] Benchmarking division and libdivide on Apple M1 and ... ___________________________________________________________________ Benchmarking division and libdivide on Apple M1 and Intel AVX512 Author : ridiculous_fish Score : 144 points Date : 2021-05-12 18:52 UTC (4 hours ago) (HTM) web link (ridiculousfish.com) (TXT) w3m dump (ridiculousfish.com) | CoastalCoder wrote: | I'm curious why the author showed the C++ source code, but not | the (per-architecture) disassembly. | | I would think that's a much better starting point for trying to | understand the m-architectural behavior. | pkw792 wrote: | I have just started to use a Mac M1 Mini, and am disappointed. | It's incredibly slow to download or install anything. Hangs all | the time, takes like 5 hours to install Xcode (it's done but the | UI hangs leaving you to believe it has more to do). Hangs when | cloning a git repo. Gets stuck anywhere and everywhere. Have to | force kill everything and restart to knock some sense into it. I | was always respectful of Mac users because Windows has had its | problems in the past, but after using a Mac for the first time, I | hate it more than ever. | pkw792 wrote: | It's very difficult to follow and engage in technical posts | like these benchmarking micro-instructions and so on, where on | the face of it the product is simply falling on its face in the | most basic use-cases. | rangewookie wrote: | uhhh... your mac might be broken. I have one and my friend has | one. We both engage in cpu/gpu intensive workloads and this | just doesn't happen. Still within the return window? Would be | interesting to find out your fan is DOA or something like | that... | pkw792 wrote: | Yeah it's pretty much brand new and for sure within the | return window. Maybe there's something wrong with it. I was | expecting cool fireworks for sure, but it's been nothing but | a PITA thus far. | pbsd wrote: | On Skylake-SP's AVX-512, instructions that previously were | dispatched to port 0 or 1 get instead dispatched to ports 0 _and_ | 1. So instructions like vpsrlq get zero net speedup from | switching to AVX-512 from AVX2. Instructions that previously ran | on ports 0,1,5 will now run on ports 0 and 5, for a speedup of at | best 1.33. | | Multiplication will depend on whether the chip has one or two FMA | units. If so, you can run vpmuludq on ports 0 and 5, which is a | 2x speedup compared to AVX2's ports 0 and 1. This 8275CL Xeon | does have 2 FMA units. | | Looking at the two inner loops, we have up: | vmovdqa64 zmm0,ZMMWORD PTR [rdi+rax*4] # p23 add | rax,0x10 # p0156 vpmuludq zmm1,zmm0,zmm4 # | p05 or only p0 vpsrlq zmm2,zmm1,0x20 # p0 | vpsrlq zmm1,zmm0,0x20 # p0 vpmuludq | zmm1,zmm1,zmm4 # p05 or only p0 vpandd | zmm1,zmm6,zmm1 # p05 vpord zmm1,zmm1,zmm2 # p05 | vpsubd zmm0,zmm0,zmm1 # p05 vpsrld | zmm0,zmm0,0x1 # p0 vpaddd zmm0,zmm0,zmm1 # p05 | vpsrld zmm0,zmm0,xmm5 # p0+p5 vpaddd | zmm3,zmm0,zmm3 # p05 cmp rax,rdx | jb up up: vmovdqa | ymm0,YMMWORD PTR [rdi+rax*4] # p23 add rax,0x8 | # p0156 vpmuludq ymm1,ymm0,ymm4 # p01 | vpsrlq ymm2,ymm1,0x20 # p01 vpsrlq | ymm1,ymm0,0x20 # p01 vpmuludq | ymm1,ymm1,ymm4 # p01 vpand | ymm1,ymm1,ymm6 # p015 vpor | ymm1,ymm1,ymm2 # p015 vpsubd | ymm0,ymm0,ymm1 # p015 vpsrld | ymm0,ymm0,0x1 # p01 vpaddd | ymm0,ymm0,ymm1 # p015 vpsrld | ymm0,ymm0,xmm5 # p01+p5 vpaddd | ymm3,ymm0,ymm3 # p01 cmp rax,rdx | jb up | | All other things being equal, we have on average, and counting | only the differing instructions, a throughput of ~2.27 | instructions per cycle on the AVX2 loop, whereas it is somewhere | around ~1.45-1.60 for AVX-512, depending whether you have 1 or 2 | FMA units to run multiplications on port 5. | | So based on this approximation, the AVX-512 code should probably | run around 2*(1.5/2.27) ~ 1.33 times faster. Add to this that | vpmuludq is actually one of the most thermally insensitive | instructions around and will reduce your core's frequency by | 100-200 MHz, and the small speedup you see is more or less | explainable. (I actually do see some more noticeable speedup here | when switching to AVX-512; 0.25 vs 0.21). | | PS: The Intel Icelake and later chips also manage to achieve a | throughput of 1/2 divisions per cycle for 32-bit divisors, and | 1/3 divisions per cycle for 64-bit divisors. | celrod wrote: | FWIW, llvm-mca estimates 448 clock cycles per 100 iterations of | the AVX2 loop vs 528 cycles for the AVX512 loop with | `-mcpu=cascadelake`. That suggests the AVX512 loop should be | about 2*(448/528)=1.85 times faster. | pbsd wrote: | llvm-mca is highly unreliable when it comes to AVX-512. It | thinks 3 512-bit vpaddd, vpsubd can be run per cycle. | Adjusting for that you get 622 cycles instead of 528. | brigade wrote: | > Speculatively, AVX512 processes multiplies serially, one | 256-bit lane at a time, losing half its parallelism. | | Sort of, in Skylake AVX512 fuses the 256-bit p0 and p1 together | for one 512-bit uop, and p5 becomes 512-bit wide. So | theoretically you get 2x 512-bit pipelines versus AVX2's 3x | 256-bit pipelines (two of which can do multiplies.) | | Unfortunately, p5 doesn't support integer multiplies, even in | SKUs where p5 _does_ support 512-bit floating-point multiplies. | So AVX512 has no additional throughput for integer multiplies on | current implementations. | celrod wrote: | p5 can do 512 bit operations, but not 256 bit, e.g. look at | Skylake-AVX512 and Cascadelake (Xeon benched in the blog post | was Cascadelake) ports for vaddpd: | | https://uops.info/html-instr/VADDPD_YMM_YMM_YMM.html | | Here is 256 bit VPMULUDQ: https://uops.info/html- | instr/VPMULUDQ_YMM_YMM_YMM.html | | Here is 512 bit VPMULUDQ: https://uops.info/html- | instr/VPMULUDQ_ZMM_ZMM_ZMM.html | | The 256 bit and 512 bit versions both have a reciprocal | throughput of 0.5 cycles/op, using p01 for 256 bit and p05 for | 512 bit (where, as you note p0 for 512 bit really means both 0 | and 1). | | So, given the same clock speed, this multiplication should have | twice the throughput with 512 bit vectors as with 256 bit. This | isn't true for those CPUs without p5, like icelake-client, | tigerlake, and rocketlake. But should be true for the Xeon | ridiculousfish benchmarked on. | twoodfin wrote: | I bet there's more than one integer division unit per core. | | When I've been micro-optimizing performance-critical code, | integer division shows up as a hot spot regularly. I assume most | developers don't think about the performance implications of | coding up a / or % between two runtime values, preventing the | compiler from doing any strength reduction. Apple must have seen | this in their surely voluminous profiling of real-world | applications. | buildbot wrote: | I think you got the point most people miss - apple had a unique | ability to profile every app on the Mac and iOS App Store, | possibly in an automated way, as part of the app submission | pipeline. Intel and AMD could go out and profile real work | applications, and I'm sure they do, but to get to the same | level of breadth is probably not possible. | Traster wrote: | I'm a little skeptical of this, this is the same as saying | Tesla is going to have self-driving because they can record | all the decisions current Teslas make. The truth is that it's | very path dependent. For Tesla this means you can't optmize | getting into the scenario in the first place and for Apple it | means you can't actually know which code path will actually | be regularly used. | pvg wrote: | Is that really such a unique advantage for Apple? Intel and | AMD can work with Microsoft to achieve something similar, for | instance. | bch wrote: | I'm only speculating, but I'd think they wouldn't even | "have to work with" anybody - couldn't they just instrument | whatever they want? | mhh__ wrote: | The fact that Intel and AMD apparently don't prioritize integer | division could suggest that their profiling suggests it's not | worth it, but with Apple's transistor budget at the moment they | can afford it. | | Also keep in mind that this Xeon might not be really made for | number crunching (not really sure)? | rodgerd wrote: | Apple's chip designers have the advantage, I assume, of being | able to wander down a hallway and ask what the telemetry from | iOS and MacOS devices are telling them about real-world use. | mhh__ wrote: | Most of Intel's volume is probably shipped to customers who | either don't care or buy _a lot_ of CPUs in one go, so the | advantage of this probably isn 't quite as apparent as | you'd imagine. | | What can definitely play a role is that (I don't think it's | as much of a problem these days, but it definitely has been | in the past) is the standard "benchmark" suites that | chipmakers can beat each other over the head with e.g. I | think it was Itanium that had a bunch of integer functional | units mainly for the purpose of getting better SPEC numbers | rather than working on the things that actually make | programs fast (MEMORY) - I was maybe 1 or 2 when this chip | came out, so this is nth-hand gossip, however. | masklinn wrote: | > The fact that Intel and AMD apparently don't prioritize | integer division could suggest that their profiling suggests | it's not worth it, but with Apple's transistor budget at the | moment they can afford it. | | An other possibility is that Apple has a very different | profiling base e.g. iOS applications, whereas Intel and AMD | would have more artificial workloads, or be bound by | workloads / profiles from scientific computing or the like | (video games)? | pbsd wrote: | Intel greatly improved their divider implementation between | Skylake and Icelake. The measurements in the OP are on | Skylake-SP, prior to these improvements. | criddell wrote: | Would it be fair to characterize the M1 as being made for | number crunching? | mhh__ wrote: | Refining number crunching to mean single threaded | performance I would say yes, or at least definitely more so | than the Intel chip | gameswithgo wrote: | Not any more than an intel/amd/etc cpu is. Like that XEON | cpu is gonna crunch more numbers, just due to more cores. | mhh__ wrote: | If it was intended to be used in the cloud for example | it's going to be doing more _work_ but probably designed | around a memory-bound load rather than integer | throughput. | seumars wrote: | I'm having a hard time concentrating on the article with that | background | codezero wrote: | Funny, what size screen are you on? My wife said the same and | tbh I didn't even notice it was a paper towel (funny gag) on my | desktop system. I may have just not paid attention. | | Go into reader mode, the article is great. | gigatexal wrote: | Geez. I wonder if the M2 will just be higher clocked and more | cores or if they'll improve the arch even more? | sroussey wrote: | Yes | lmilcin wrote: | As somebody who worked for Intel I am deeply ashamed for this | result. | | I mean, seriously, all that tradition and experience and you have | a phone company make circles around you on your own field. | mhh__ wrote: | Some notes: | | * What's the variance of the measurements? | | * Per core, the two processors actually (keep in mind based on | Intel's TDP figure) have a roughly similar power budget i.e. | 205/26 vs. 39/(4 or 8 depending on if you count the bigs, littles | or both), so taking into account that the Apple processor is on a | process that is something like 4 or 5 times denser it's not that | surprising to me that its faster. | phkahler wrote: | Phoronix recently did some benchmarks with AVX512 and while it | was (modestly if I recall) faster, it was horribly worse in terms | of performance per watt. | | I really hope AMD doesn't adopt AVX512 and if they do I hope it's | just the minimum for software compatibility. | | On a related note, my Ryzen 2400G does not benefit from | recompiling code with -march=x86-64-v3 in fact it seems a tiny | bit slower. I assume Zen2 and 3 will actually run faster with | that option. | johnklos wrote: | There was a time when division was expensive enough to look for | alternatives, but nowadays with the M1 it seems that adding even | one or two adds or shifts may end up to be more expensive than | division. My goodness, how times have changed! | rock_artist wrote: | I didn't read through the entire thing but... | | * Would be wise to compare x86_64 under Rosetta as it'll support | some AVX translation if I remember correctly. | | * I didn't see use of Apple's accelerate framework. To comply | with ARM64 additional custom Apple magic is within Private | extensions / ops that should use higher level frameworks such as | Accelerate | hajile wrote: | No AVX support with Rosetta2. | | Rosetta2 supports up through SSE2. That's the latest | instruction set to no longer be patented as of around 2020. | They can use x86_64 only because AMD released x86_64 spec in | 1999 (even though actual chips came much later). | gsnedders wrote: | It certainly claims to support many things later than SSE2, | including everything up to SEE4.2, on this MacBook Air (M1): | | % arch -x86_64 sysctl -a | grep machdep.cpu.features | | machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC | SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE | SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTSE64 MON DSCPL VMX EST | TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 AES SEGLIM64 | floatboth wrote: | Custom Apple instructions do... matrix stuff IIRC. Not | something applicable to simple number division. | mhh__ wrote: | Has someone actually found out what instructions do what? I | assume we won't get to play with them ourselves unless they | get the Microsoft treatment over the hidden APIs | GeekyBear wrote: | Some information from Anandtech's deep dive into Apple's "big" | Firestorm core. | | >On the Integer side, whose in-flight instructions and renaming | physical register file capacity we estimate at around 354 | entries, we find at least 7 execution ports for actual arithmetic | operations. These include 4 simple ALUs capable of ADD | instructions, 2 complex units which feature also MUL (multiply) | capabilities, and what appears to be a dedicated integer division | unit. The core is able to handle 2 branches per cycle, which I | think is enabled by also one or two dedicated branch forwarding | ports, but I wasn't able to 100% confirm the layout of the design | here. | | On the floating point and vector execution side of things, the | new Firestorm cores are actually more impressive as they a 33% | increase in capabilities, enabled by Apple's addition of a fourth | execution pipeline. The FP rename registers here seem to land at | 384 entries, which is again comparatively massive. The four | 128-bit NEON pipelines thus on paper match the current throughput | capabilities of desktop cores from AMD and Intel, albeit with | smaller vectors. Floating-point operations throughput here is 1:1 | with the pipeline count, meaning Firestorm can do 4 FADDs and 4 | FMULs per cycle with respectively 3 and 4 cycles latency. That's | quadruple the per-cycle throughput of Intel CPUs and previous AMD | CPUs, and still double that of the recent Zen3, of course, still | running at lower frequency. This might be one reason why Apples | does so well in browser benchmarks (JavaScript numbers are | floating-point doubles). | | Vector abilities of the 4 pipelines seem to be identical, with | the only instructions that see lower throughput being FP | divisions, reciprocals and square-root operations that only have | an throughput of 1, on one of the four pipes. | | https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de... | gsnedders wrote: | > This might be one reason why Apples does so well in browser | benchmarks (JavaScript numbers are floating-point doubles). | | Reminder that browsers try to avoid using doubles for the | Number type, preferring integers with overflow checks. Much of | layout uses fixed point for subpixels, too. Using doubles all | the time would be a notable perf regression. | amelius wrote: | What's the fastest way to implement integer division in hardware? | pcwalton wrote: | I always assumed that CPUs used Newton's method, though that | could be wrong. | https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%8... | | Edit: Yeah, that's only used for floating point. Looks like | integer division is usually an algorithm called SRT: | https://en.wikipedia.org/wiki/Division_algorithm#SRT_divisio... | petermcneeley wrote: | " Is the hardware divider pipelined, is there more than one per | core?" | | Chaining the divisions (as a series of dependencies) would enable | one to see the full latency of a single divide. You could use | this data to estimate the number of divide units on the core. | torstenvl wrote: | Great and interesting work! The slowness of division operations | is overlooked too often IMHO and is key to (my approach to) | avoiding things like integer overflows (there may be a better way | than dividing TYPE_MAX by one of the operands but I don't know an | alternate technique). Pretty impressive if the M1 really can | achieve two-clock-cycle division on a consistent basis. | | May I offer a nitpicking correction? 1.058ns compared to 6.998ns | is an 85% savings, not 88%. The listing you have suggests that | going down to 1.058ns is a bigger speed-up than going down to | 0.891ns. | | (PS - Verizon's sale of Yahoo has been in the news lately so I | thought of you and the other regulars of the Programming chat | room the other day. Hope all is well.) | ridiculous_fish wrote: | Fixed the percentage, thank you. Hope you are doing well too! | david2ndaccount wrote: | Depending on what you're doing, you can usually just use the | compiler intrinsics and check for overflow aka, | `__builtin_mul_overflow` and similar instead of guarding | against it. | torstenvl wrote: | Useful to know, but if we don't care about portability we can | just write a function in assembler that checks the carry or | overflow flag or whatever the architecture's equivalent is. | mhh__ wrote: | https://gcc.gnu.org/wiki/DontUseInlineAsm | [deleted] ___________________________________________________________________ (page generated 2021-05-12 23:00 UTC)