[HN Gopher] ARM or x86? ISA Doesn't Matter (2021)
       ___________________________________________________________________
        
       ARM or x86? ISA Doesn't Matter (2021)
        
       Author : NavinF
       Score  : 73 points
       Date   : 2023-05-14 20:38 UTC (2 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | TheLoafOfBread wrote:
       | What does matter is standardization. For example a booting
       | process. When I have x86 image of Windows/Linux, I can boot it on
       | any x86 processor. When I have ARM image, well, then I can boot
       | it on a SoC it is built for and that's big maybe because if
       | outside peripherals are different (i.e. different LCD driver) or
       | lives on different pins of SoC, then I am screwed and will have
       | at best partially working system.
       | 
       | Standardization is something what will carry x86 very far into
       | the future despite its infectivity on low power devices.
        
         | dehrmann wrote:
         | > When I have x86 image of Windows/Linux, I can boot it on any
         | x86 processor.
         | 
         | Where this gets absurd is modern Debian supports i686 and up.
         | You should be able to get a 27-year-old Pentium Pro to boot the
         | same image as a Raptor Lake CPU.
        
         | nubinetwork wrote:
         | > When I have ARM image, well, then I can boot it on a SoC it
         | is built for ... or lives on different pins of SoC, then I am
         | screwed and will have at best partially working system.
         | 
         | That's not even the half of it either... what firmware does the
         | board run? U-boot is nice, but sometimes you aren't lucky and
         | you're stuck with something proprietary. Although if you're
         | extremely lucky, you'll have a firmware that supports efi
         | kernels.
        
       | dtx1 wrote:
       | yeah, no. Tooling support, driver support and general
       | Optimization matters. So does platform maturity.
       | 
       | You don't want your phone to run x86 (and it won't for a while)
       | and though possible its a pain to deal with an arm server at the
       | moment because some random library you use just won't be
       | compatible. And If single threaded performance matters, ARM is
       | behind by a decade.
        
         | dehrmann wrote:
         | It sounds like Atom could have found a home in phones.
        
         | circuit10 wrote:
         | Having an Oracle Cloud Free Tier ARM VPS it's surprising how
         | much just works, I think the only thing I couldn't run was
         | Chrome Remote Desktop (yes, I want to remote into my VPS
         | sometimes, for example it's the easiest way to leave a GUI
         | program running in the background without leaving my PC on) and
         | only a few other things needed extra steps. But it's probably a
         | lot different on desktop or if you're running different types
         | of programs
        
           | wolf550e wrote:
           | Some libraries have x86 SIMD code but no ARM SIMD code, so
           | benchmarking real world use cases you compare SIMD vs scalar
           | code and x86 is much faster. Server side libraries for ARM
           | are a less mature situation than x86.
        
         | rektide wrote:
         | We don't know, is the only good answer. We haven't done much
         | trying in the past 10 years.
         | 
         | Intel's Lakefield was doing quite well in the tablet/MID space.
         | It also had the disadvantage of comparatively ancient Atom-
         | eaque cores- far worse than Intel's new E cores-and a massive
         | massive huge Skylake core.
         | 
         | ARM is no longer behind by all that much on single threaded. On
         | geekbench a m2 can do 1916 points, a 7950 2300points. Slightly
         | bigger gap on Cinebench, 1580 Vs 2050. A big part of the gap
         | here is almost certainly the Hz being so different.
         | 
         | We just don't know. There's old beliefs we have held but we had
         | so little evidence for these biases then. X86 rarely tried to
         | be really tiny, had so much more to learn if it was to succeed.
         | ARM rarely tried to be big, and has been learning. There's
         | scant evidence there are real limiting factors for either.
        
         | jsheard wrote:
         | > You don't want your phone to run x86 (and it won't for a
         | while)
         | 
         | It's easily forgotten but there were Android phones which used
         | Intel x86 processors, such as the early Asus Zenfones. They
         | didn't stick though.
        
         | hedora wrote:
         | These benchmarks suggest arm has been at single threaded
         | performance parity on server since 2020:
         | 
         | https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...
         | 
         | (Apple Silicon blows them away on laptops, of course.)
        
           | tyingq wrote:
           | I imagine some of the remaining gap might be places where
           | inline ASM or things like SIMD, AVX, etc, exist. Where
           | there's been more years and a larger set of people optimizing
           | that ASM for x86/64 servers.
        
         | kaelinl wrote:
         | This isn't the point of the article.
         | 
         | The article is commenting on CPU design: area efficiency, power
         | efficiency, design cost, etc. They're proposing that the reason
         | x86 CPUs have historically beat ARM CPUs in performance, and
         | the reason ARM CPUs have historically beat x86 CPUs in power
         | efficiency, has nothing to do with the design of the ISA
         | itself. You could build an ARM CPU to beat an x86 CPU in high
         | performance computing, or vice versa. They're saying that the
         | format of the instructions and the particular way the
         | operations are structured isn't the driving factor. Instead,
         | it's just a historical arteract of how the ISAs were used.
         | 
         | In other words, yes, there are plenty of ecosystem reasons that
         | these two (and potentially, more) families of chips are better
         | for some things vs. others, but if the two companies swapped
         | their ISAs 30 years ago we might see exactly the same ecosystem
         | just with different instruction formats.
        
       | isidor3 wrote:
       | It is interesting to me how both instruction sets have converged
       | on splitting operations into simpler micro ops. The author
       | briefly mentions RISC-V as having "better" core instructions, but
       | it makes me wonder if having the best possible instructions would
       | even help that much.
       | 
       | If you made a CPU that directly ran off of some convergent
       | microcode, would you then lose because of bandwidth of getting
       | those instructions to the chip? Or is compressing instruction
       | streams already a pretty-well-solved problem if you're able to do
       | it from a clean slate, instead of being tied to what instruction
       | representations a chip happened to be using many years ago?
        
         | circuit10 wrote:
         | > If you made a CPU that directly ran off of some convergent
         | microcode
         | 
         | I think that's the original idea behind RISC
        
           | mafribe wrote:
           | Microcode is often attributed to [1] from 1952.
           | 
           | [1] M. V. Wilkes, J. B. Stringer, _Micro-programming and the
           | design of the control circuits in an electronic digital
           | computer._
        
             | kwhitefoot wrote:
             | It's arguable that Babbage's design for the Analytical
             | Engine included microcode.
             | 
             | See, i.a., https://www.fourmilab.ch/babbage/glossary.html
        
           | isidor3 wrote:
           | Yes, and obviously ARM didn't chose the instructions in its
           | reduced set optimally, if the best implementations require
           | those instructions to be split into smaller ones. But that
           | doesn't really speak to if that's because it's just _better_
           | to pack instructions that way, or because these
           | implementations of ARM and x86 just need to do it to be
           | performant in spite of deficiencies in their instruction
           | sets.
        
       | api wrote:
       | Decoder complexity matters. ARM with its single instruction width
       | allows arbitrarily parallel decoders with only linear growth in
       | transistor count. X86 with its many widths and formats requires
       | decoders that grow exponentially in complexity with parallelism,
       | consuming more silicon and power to achieve higher levels of
       | instruction level parallelism. It requires a degree of brute
       | force with many possible size branches being explored at once
       | among other expensive tricks.
       | 
       | This is one of the major areas where the instruction sets are not
       | equal. ARM has a distinct efficiency advantage.
        
         | userbinator wrote:
         | How is it exponential? It's only a multiplicative increase in
         | decode positions.
        
         | thechao wrote:
         | It's not exponential; it's not even quadratic (it is
         | superlinear), if you put any thought into the design. I worked
         | on an x86 part with 15 decoders/fetch unit. The area was
         | annoying, but unimportant. (We didn't commit 15 ops/cycle; just
         | pc-boundary determination.)
         | 
         | I've also worked on ARM/custom GPU ISAs. The limiting rate is
         | the total complexity of the ISA, not the encoding density.
         | 
         | In fact, from an I$ point-of-view, the tighter x86 encodings
         | are a pretty good win -- at least a few % on very long fetch
         | sequences.
        
         | jeffbee wrote:
         | Does anyone actually care about this? The x86 decoders are not
         | large on modern implementations, and putting more transistors
         | on dice is a well-solved problem.
        
           | api wrote:
           | It uses more power. The decoder is like another ALU that is
           | always screaming at 100%. It means you can easily keep up
           | with ARM in speed but not power efficiency.
        
             | jeffbee wrote:
             | It doesn't seem to matter _in practice_. Current generation
             | Intel CPUs and the Apple M2 have very similar performance
             | at the same power levels.
        
               | rowanG077 wrote:
               | How did you arrive at that conclusion? Comparing the M2
               | Max vs 13650HX it's very obvious the M2 Max uses a LOT
               | less power. It's not even close, it's less then HALF the
               | power. The M2 max has a little worse performance. But it
               | manages to beat the Intel in some benchmarks.
        
               | jeffbee wrote:
               | You don't have to let the Intel chips scale up the power
               | like that. You can lock them to whatever power level
               | suits you. An i7-1370P configured at 20W has broadly
               | similar performance to an M2.
        
               | rowanG077 wrote:
               | Mind linking me some power measurements at same wattage?
               | I didn't even know you could set a power target on Intel
               | or Appl Mx. Well you can disable turbo boost on Intel but
               | even then intel blows over their own marketed TDP by a
               | lot.
        
               | jeffbee wrote:
               | Intel introduced the "running average power limit" over
               | ten years ago. https://lkml.indiana.edu/hypermail/linux/k
               | ernel/1304.0/01322...
        
               | rowanG077 wrote:
               | Rapl doesn't allow setting a limit which is always
               | obeyed. You can set PL1 and PL2 limits. But intel CPUs
               | will gladly go over those limits in the short term. For
               | example when running a benchmark. That's why I asked for
               | specific benchmarks which include power measurements.
               | 
               | For example: https://www.notebookcheck.net/i7-1360P-vs-M2
               | _14731_14521.247...
               | 
               | this shows the M2 has a little worse performance compared
               | to the 1360P. But the 1360P requires 2.5x the power to
               | achieve that.
        
               | Panzer04 wrote:
               | Apple's chips are on better process nodes, which confuses
               | the issue. That being said, you really have to test chips
               | at the same power level to get an idea of performance per
               | watt in a comparison.
               | 
               | You can easily double CPU power for only a few hundred
               | MHz or 10-20% extra performance.
               | 
               | See https://www.pcworld.com/article/1359352/cool-down-a-
               | deep-div..., which benchmarks chips at different power
               | limits for an example.
        
               | rowanG077 wrote:
               | Yes I agree with you. That doesn't mean this is easy to
               | achieve. With the exception of AMD chips it's
               | unfortunately very hard to simply "benchmark with a fixed
               | power budget".
        
             | arp242 wrote:
             | How much power does it (roughly) use? Are we talking about
             | 1% of the overall usage? 10%? 50%?
        
             | tester756 wrote:
             | this article states different, so how is it?
        
           | mafribe wrote:
           | One could argue that one of the reasons why SIMD
           | instructions, and indeed GPUs, are popular, is because they
           | amortise the (transistor and power) cost of decoding over
           | more compute units, in the case of GPUs over many more.
           | 
           | There are also other considerations, like rolling back state
           | in OOO machines, or precise exceptions. All this becomes more
           | complex with an x86-style instruction set.
        
       | tux3 wrote:
       | The x86 decoders consume a reasonable amount of power, but the
       | trouble is making them wider without affecting that.
       | 
       | I have an AMD CPU. Zen CPUs come with a fairly wide backend. But
       | the frontend is what it is (especially early Zen), and without
       | SMT it's essentially impossible to keep all those execution units
       | fed. It's not that 8 x86 decoders wouldn't be a benefit, it's
       | just that more decoders isn't cheap in x86 cores, each extra
       | decoder is a serious cost.
       | 
       | If you compare with the big ARM cores, having a wide frontend is
       | not a complex research problem or an impractical cost. 8 wide ARM
       | decode is completely practical. You even have open source
       | superscalar RISC-V cores just publicly available on Github
       | running on FPGAs with 8 wide decode. Large frontends are
       | (relatively) cheap and easy, if you're not x86.
       | 
       | So when we notice that the narrower x86 CPU's decode doesn't
       | consume that much (a "drop in the ocean"), that's because it was
       | designed narrower to keep the PPA reasonable! The reason I can't
       | feed my Zen backend isn't because having a wide frontend is
       | useless and I should just enable SMT anyways, it's because x86
       | makes wide decodes much less practical than competing
       | architectures.
        
         | tester756 wrote:
         | >The x86 decoders consume a reasonable amount of power
         | 
         | this article states otherwise, so how is it?
        
         | ip26 wrote:
         | It's a trade-off between problems. Variable length instructions
         | are not as trivial to decode wide, so you need more cleverness
         | here. However, fixed length instructions decrease code density,
         | which asks more of the instruction cache. Note Zen4 has a 32 KB
         | L1 instruction cache while the M1 has a 192 KB L1 instruction
         | cache, requiring extra cleverness here instead to handle the
         | higher latency and area. Meanwhile, micro-op caches hide both
         | problems.
         | 
         | The are ripple effects to consider as well. The large L1 caches
         | of M1 (320 KB total) put capacity pressure on L2, towards
         | larger sizes and/or away from inclusive policy. See the 12MB
         | shared L2. Meanwhile, the narrower decode of Zen4 puts pressure
         | on things like branch prediction accuracy & mispredict
         | correction latency - if you predicted the wrong codepath, you
         | can't catch up as quickly. See the large branch predictors on
         | Zen4.
        
           | pclmulqdq wrote:
           | ARM used to find itself on the wrong side of this tradeoff in
           | the era of 4-wide x86 decode units and 4-6 wide ARM decoders.
           | They lost too much perf to cache size for the decoder width
           | to make up for it.
           | 
           | It's unclear to me if they will pull ahead on the perf/area
           | game with the era of 8-wide x86 decoders coming.
        
           | codedokode wrote:
           | x86 also uses 2-address instructions which means that you
           | often need to use moves between registers (additional
           | instructions), example: [1]. ARM uses 3-address instructions.
           | 
           | Also, x86 code is compact, but not as compact as in era of
           | 8080 [2] - here addition and multiplication require 3 bytes
           | each, 6 bytes total. To my surprise, ARM has an add-multiply
           | instruction and it uses just 4 bytes (instead of 8) [3].
           | 
           | And RISC-V uses 6 bytes because of shortened instruction for
           | addition [4]
           | 
           | Of course, this simple function cannot be a replacement for
           | proper analysis, but it seems that x86 code is not
           | significantly denser.
           | 
           | Also to my great disappoitment none of those CPUs has checked
           | overflow for arithmetic operation.
           | 
           | [1] https://godbolt.org/z/jsoccE5jv
           | 
           | [2] https://godbolt.org/z/jTMs1MEzh
           | 
           | [3] https://godbolt.org/z/nGb8qKcxe
           | 
           | [4] https://godbolt.org/z/x9c115crY
        
           | cesarb wrote:
           | > Note Zen4 has a 32 KB L1 instruction cache while the M1 has
           | a 192 KB L1 instruction cache, requiring extra cleverness
           | here instead to handle the higher latency and area.
           | 
           | There's another factor here: to have a low latency, the L1
           | cache has to be indexed by the bits which don't change when
           | translating from virtual addresses to physical addresses.
           | That makes it harder to have a larger low-latency L1 cache
           | when the native page size is 4KiB (AMD/Intel) instead of
           | 16KiB (Apple M1/M2).
           | 
           | That is, most of the "cleverness" allowing for a larger L1
           | instruction cache is simply a larger page size.
        
             | mjevans wrote:
             | If the instructions are on average 3-4x less dense (take
             | about that much more space) than the trade-off in
             | associativity granularity and corresponding increase in
             | cache size are logical. The logical management circuits
             | would be around the same size, though the number of memory
             | cells and corresponding cost in silicon, power / thermal,
             | and signal propagation / layout issues remain.
        
         | phkahler wrote:
         | >> I have an AMD CPU. Zen CPUs come with a fairly wide backend.
         | But the frontend is what it is...
         | 
         | Zen 5 is widening the front end. My guess is with scaling
         | coming to an end, one nice tweak with Zen 6 should be darn near
         | the end of the performance road for a bit. Not saying the
         | actual end, but it should be one of those sweet spots where you
         | build a PC and it's really good for years to come.
         | 
         | I'm still running Raven Ridge and have no need to upgrade, but
         | I will when I can get double the cores or more at double the
         | IPC or more, and maybe at lower power ;-)
        
       | dehrmann wrote:
       | This would explain part of why Apple hasn't been pushing M2 for
       | the data center. Its chips are a better fit for bursty human
       | workloads, not server workloads.
        
         | rowanG077 wrote:
         | Apple doesn't see value in going after the server market.
        
         | KerrAvon wrote:
         | Apple Silicon chips aren't for sale outside Apple, and Apple
         | hasn't made any products relevant to data centers since they
         | terminated the Xserve line as part of the PowerPC -> Intel
         | transition.
        
           | dehrmann wrote:
           | They have a CPU that's been labeled some version of fastest
           | or most efficient, they're hungry for more revenue, but
           | somehow have no interest in the data center market? There
           | must be a reason.
        
       ___________________________________________________________________
       (page generated 2023-05-14 23:00 UTC)