[HN Gopher] ARM or x86? ISA Doesn't Matter (2021) ___________________________________________________________________ ARM or x86? ISA Doesn't Matter (2021) Author : NavinF Score : 73 points Date : 2023-05-14 20:38 UTC (2 hours ago) (HTM) web link (chipsandcheese.com) (TXT) w3m dump (chipsandcheese.com) | TheLoafOfBread wrote: | What does matter is standardization. For example a booting | process. When I have x86 image of Windows/Linux, I can boot it on | any x86 processor. When I have ARM image, well, then I can boot | it on a SoC it is built for and that's big maybe because if | outside peripherals are different (i.e. different LCD driver) or | lives on different pins of SoC, then I am screwed and will have | at best partially working system. | | Standardization is something what will carry x86 very far into | the future despite its infectivity on low power devices. | dehrmann wrote: | > When I have x86 image of Windows/Linux, I can boot it on any | x86 processor. | | Where this gets absurd is modern Debian supports i686 and up. | You should be able to get a 27-year-old Pentium Pro to boot the | same image as a Raptor Lake CPU. | nubinetwork wrote: | > When I have ARM image, well, then I can boot it on a SoC it | is built for ... or lives on different pins of SoC, then I am | screwed and will have at best partially working system. | | That's not even the half of it either... what firmware does the | board run? U-boot is nice, but sometimes you aren't lucky and | you're stuck with something proprietary. Although if you're | extremely lucky, you'll have a firmware that supports efi | kernels. | dtx1 wrote: | yeah, no. Tooling support, driver support and general | Optimization matters. So does platform maturity. | | You don't want your phone to run x86 (and it won't for a while) | and though possible its a pain to deal with an arm server at the | moment because some random library you use just won't be | compatible. And If single threaded performance matters, ARM is | behind by a decade. | dehrmann wrote: | It sounds like Atom could have found a home in phones. | circuit10 wrote: | Having an Oracle Cloud Free Tier ARM VPS it's surprising how | much just works, I think the only thing I couldn't run was | Chrome Remote Desktop (yes, I want to remote into my VPS | sometimes, for example it's the easiest way to leave a GUI | program running in the background without leaving my PC on) and | only a few other things needed extra steps. But it's probably a | lot different on desktop or if you're running different types | of programs | wolf550e wrote: | Some libraries have x86 SIMD code but no ARM SIMD code, so | benchmarking real world use cases you compare SIMD vs scalar | code and x86 is much faster. Server side libraries for ARM | are a less mature situation than x86. | rektide wrote: | We don't know, is the only good answer. We haven't done much | trying in the past 10 years. | | Intel's Lakefield was doing quite well in the tablet/MID space. | It also had the disadvantage of comparatively ancient Atom- | eaque cores- far worse than Intel's new E cores-and a massive | massive huge Skylake core. | | ARM is no longer behind by all that much on single threaded. On | geekbench a m2 can do 1916 points, a 7950 2300points. Slightly | bigger gap on Cinebench, 1580 Vs 2050. A big part of the gap | here is almost certainly the Hz being so different. | | We just don't know. There's old beliefs we have held but we had | so little evidence for these biases then. X86 rarely tried to | be really tiny, had so much more to learn if it was to succeed. | ARM rarely tried to be big, and has been learning. There's | scant evidence there are real limiting factors for either. | jsheard wrote: | > You don't want your phone to run x86 (and it won't for a | while) | | It's easily forgotten but there were Android phones which used | Intel x86 processors, such as the early Asus Zenfones. They | didn't stick though. | hedora wrote: | These benchmarks suggest arm has been at single threaded | performance parity on server since 2020: | | https://www.anandtech.com/show/15578/cloud-clash-amazon-grav... | | (Apple Silicon blows them away on laptops, of course.) | tyingq wrote: | I imagine some of the remaining gap might be places where | inline ASM or things like SIMD, AVX, etc, exist. Where | there's been more years and a larger set of people optimizing | that ASM for x86/64 servers. | kaelinl wrote: | This isn't the point of the article. | | The article is commenting on CPU design: area efficiency, power | efficiency, design cost, etc. They're proposing that the reason | x86 CPUs have historically beat ARM CPUs in performance, and | the reason ARM CPUs have historically beat x86 CPUs in power | efficiency, has nothing to do with the design of the ISA | itself. You could build an ARM CPU to beat an x86 CPU in high | performance computing, or vice versa. They're saying that the | format of the instructions and the particular way the | operations are structured isn't the driving factor. Instead, | it's just a historical arteract of how the ISAs were used. | | In other words, yes, there are plenty of ecosystem reasons that | these two (and potentially, more) families of chips are better | for some things vs. others, but if the two companies swapped | their ISAs 30 years ago we might see exactly the same ecosystem | just with different instruction formats. | isidor3 wrote: | It is interesting to me how both instruction sets have converged | on splitting operations into simpler micro ops. The author | briefly mentions RISC-V as having "better" core instructions, but | it makes me wonder if having the best possible instructions would | even help that much. | | If you made a CPU that directly ran off of some convergent | microcode, would you then lose because of bandwidth of getting | those instructions to the chip? Or is compressing instruction | streams already a pretty-well-solved problem if you're able to do | it from a clean slate, instead of being tied to what instruction | representations a chip happened to be using many years ago? | circuit10 wrote: | > If you made a CPU that directly ran off of some convergent | microcode | | I think that's the original idea behind RISC | mafribe wrote: | Microcode is often attributed to [1] from 1952. | | [1] M. V. Wilkes, J. B. Stringer, _Micro-programming and the | design of the control circuits in an electronic digital | computer._ | kwhitefoot wrote: | It's arguable that Babbage's design for the Analytical | Engine included microcode. | | See, i.a., https://www.fourmilab.ch/babbage/glossary.html | isidor3 wrote: | Yes, and obviously ARM didn't chose the instructions in its | reduced set optimally, if the best implementations require | those instructions to be split into smaller ones. But that | doesn't really speak to if that's because it's just _better_ | to pack instructions that way, or because these | implementations of ARM and x86 just need to do it to be | performant in spite of deficiencies in their instruction | sets. | api wrote: | Decoder complexity matters. ARM with its single instruction width | allows arbitrarily parallel decoders with only linear growth in | transistor count. X86 with its many widths and formats requires | decoders that grow exponentially in complexity with parallelism, | consuming more silicon and power to achieve higher levels of | instruction level parallelism. It requires a degree of brute | force with many possible size branches being explored at once | among other expensive tricks. | | This is one of the major areas where the instruction sets are not | equal. ARM has a distinct efficiency advantage. | userbinator wrote: | How is it exponential? It's only a multiplicative increase in | decode positions. | thechao wrote: | It's not exponential; it's not even quadratic (it is | superlinear), if you put any thought into the design. I worked | on an x86 part with 15 decoders/fetch unit. The area was | annoying, but unimportant. (We didn't commit 15 ops/cycle; just | pc-boundary determination.) | | I've also worked on ARM/custom GPU ISAs. The limiting rate is | the total complexity of the ISA, not the encoding density. | | In fact, from an I$ point-of-view, the tighter x86 encodings | are a pretty good win -- at least a few % on very long fetch | sequences. | jeffbee wrote: | Does anyone actually care about this? The x86 decoders are not | large on modern implementations, and putting more transistors | on dice is a well-solved problem. | api wrote: | It uses more power. The decoder is like another ALU that is | always screaming at 100%. It means you can easily keep up | with ARM in speed but not power efficiency. | jeffbee wrote: | It doesn't seem to matter _in practice_. Current generation | Intel CPUs and the Apple M2 have very similar performance | at the same power levels. | rowanG077 wrote: | How did you arrive at that conclusion? Comparing the M2 | Max vs 13650HX it's very obvious the M2 Max uses a LOT | less power. It's not even close, it's less then HALF the | power. The M2 max has a little worse performance. But it | manages to beat the Intel in some benchmarks. | jeffbee wrote: | You don't have to let the Intel chips scale up the power | like that. You can lock them to whatever power level | suits you. An i7-1370P configured at 20W has broadly | similar performance to an M2. | rowanG077 wrote: | Mind linking me some power measurements at same wattage? | I didn't even know you could set a power target on Intel | or Appl Mx. Well you can disable turbo boost on Intel but | even then intel blows over their own marketed TDP by a | lot. | jeffbee wrote: | Intel introduced the "running average power limit" over | ten years ago. https://lkml.indiana.edu/hypermail/linux/k | ernel/1304.0/01322... | rowanG077 wrote: | Rapl doesn't allow setting a limit which is always | obeyed. You can set PL1 and PL2 limits. But intel CPUs | will gladly go over those limits in the short term. For | example when running a benchmark. That's why I asked for | specific benchmarks which include power measurements. | | For example: https://www.notebookcheck.net/i7-1360P-vs-M2 | _14731_14521.247... | | this shows the M2 has a little worse performance compared | to the 1360P. But the 1360P requires 2.5x the power to | achieve that. | Panzer04 wrote: | Apple's chips are on better process nodes, which confuses | the issue. That being said, you really have to test chips | at the same power level to get an idea of performance per | watt in a comparison. | | You can easily double CPU power for only a few hundred | MHz or 10-20% extra performance. | | See https://www.pcworld.com/article/1359352/cool-down-a- | deep-div..., which benchmarks chips at different power | limits for an example. | rowanG077 wrote: | Yes I agree with you. That doesn't mean this is easy to | achieve. With the exception of AMD chips it's | unfortunately very hard to simply "benchmark with a fixed | power budget". | arp242 wrote: | How much power does it (roughly) use? Are we talking about | 1% of the overall usage? 10%? 50%? | tester756 wrote: | this article states different, so how is it? | mafribe wrote: | One could argue that one of the reasons why SIMD | instructions, and indeed GPUs, are popular, is because they | amortise the (transistor and power) cost of decoding over | more compute units, in the case of GPUs over many more. | | There are also other considerations, like rolling back state | in OOO machines, or precise exceptions. All this becomes more | complex with an x86-style instruction set. | tux3 wrote: | The x86 decoders consume a reasonable amount of power, but the | trouble is making them wider without affecting that. | | I have an AMD CPU. Zen CPUs come with a fairly wide backend. But | the frontend is what it is (especially early Zen), and without | SMT it's essentially impossible to keep all those execution units | fed. It's not that 8 x86 decoders wouldn't be a benefit, it's | just that more decoders isn't cheap in x86 cores, each extra | decoder is a serious cost. | | If you compare with the big ARM cores, having a wide frontend is | not a complex research problem or an impractical cost. 8 wide ARM | decode is completely practical. You even have open source | superscalar RISC-V cores just publicly available on Github | running on FPGAs with 8 wide decode. Large frontends are | (relatively) cheap and easy, if you're not x86. | | So when we notice that the narrower x86 CPU's decode doesn't | consume that much (a "drop in the ocean"), that's because it was | designed narrower to keep the PPA reasonable! The reason I can't | feed my Zen backend isn't because having a wide frontend is | useless and I should just enable SMT anyways, it's because x86 | makes wide decodes much less practical than competing | architectures. | tester756 wrote: | >The x86 decoders consume a reasonable amount of power | | this article states otherwise, so how is it? | ip26 wrote: | It's a trade-off between problems. Variable length instructions | are not as trivial to decode wide, so you need more cleverness | here. However, fixed length instructions decrease code density, | which asks more of the instruction cache. Note Zen4 has a 32 KB | L1 instruction cache while the M1 has a 192 KB L1 instruction | cache, requiring extra cleverness here instead to handle the | higher latency and area. Meanwhile, micro-op caches hide both | problems. | | The are ripple effects to consider as well. The large L1 caches | of M1 (320 KB total) put capacity pressure on L2, towards | larger sizes and/or away from inclusive policy. See the 12MB | shared L2. Meanwhile, the narrower decode of Zen4 puts pressure | on things like branch prediction accuracy & mispredict | correction latency - if you predicted the wrong codepath, you | can't catch up as quickly. See the large branch predictors on | Zen4. | pclmulqdq wrote: | ARM used to find itself on the wrong side of this tradeoff in | the era of 4-wide x86 decode units and 4-6 wide ARM decoders. | They lost too much perf to cache size for the decoder width | to make up for it. | | It's unclear to me if they will pull ahead on the perf/area | game with the era of 8-wide x86 decoders coming. | codedokode wrote: | x86 also uses 2-address instructions which means that you | often need to use moves between registers (additional | instructions), example: [1]. ARM uses 3-address instructions. | | Also, x86 code is compact, but not as compact as in era of | 8080 [2] - here addition and multiplication require 3 bytes | each, 6 bytes total. To my surprise, ARM has an add-multiply | instruction and it uses just 4 bytes (instead of 8) [3]. | | And RISC-V uses 6 bytes because of shortened instruction for | addition [4] | | Of course, this simple function cannot be a replacement for | proper analysis, but it seems that x86 code is not | significantly denser. | | Also to my great disappoitment none of those CPUs has checked | overflow for arithmetic operation. | | [1] https://godbolt.org/z/jsoccE5jv | | [2] https://godbolt.org/z/jTMs1MEzh | | [3] https://godbolt.org/z/nGb8qKcxe | | [4] https://godbolt.org/z/x9c115crY | cesarb wrote: | > Note Zen4 has a 32 KB L1 instruction cache while the M1 has | a 192 KB L1 instruction cache, requiring extra cleverness | here instead to handle the higher latency and area. | | There's another factor here: to have a low latency, the L1 | cache has to be indexed by the bits which don't change when | translating from virtual addresses to physical addresses. | That makes it harder to have a larger low-latency L1 cache | when the native page size is 4KiB (AMD/Intel) instead of | 16KiB (Apple M1/M2). | | That is, most of the "cleverness" allowing for a larger L1 | instruction cache is simply a larger page size. | mjevans wrote: | If the instructions are on average 3-4x less dense (take | about that much more space) than the trade-off in | associativity granularity and corresponding increase in | cache size are logical. The logical management circuits | would be around the same size, though the number of memory | cells and corresponding cost in silicon, power / thermal, | and signal propagation / layout issues remain. | phkahler wrote: | >> I have an AMD CPU. Zen CPUs come with a fairly wide backend. | But the frontend is what it is... | | Zen 5 is widening the front end. My guess is with scaling | coming to an end, one nice tweak with Zen 6 should be darn near | the end of the performance road for a bit. Not saying the | actual end, but it should be one of those sweet spots where you | build a PC and it's really good for years to come. | | I'm still running Raven Ridge and have no need to upgrade, but | I will when I can get double the cores or more at double the | IPC or more, and maybe at lower power ;-) | dehrmann wrote: | This would explain part of why Apple hasn't been pushing M2 for | the data center. Its chips are a better fit for bursty human | workloads, not server workloads. | rowanG077 wrote: | Apple doesn't see value in going after the server market. | KerrAvon wrote: | Apple Silicon chips aren't for sale outside Apple, and Apple | hasn't made any products relevant to data centers since they | terminated the Xserve line as part of the PowerPC -> Intel | transition. | dehrmann wrote: | They have a CPU that's been labeled some version of fastest | or most efficient, they're hungry for more revenue, but | somehow have no interest in the data center market? There | must be a reason. ___________________________________________________________________ (page generated 2023-05-14 23:00 UTC)