[HN Gopher] Nvidia Unveils Grace: A High-Performance Arm CPU for... ___________________________________________________________________ Nvidia Unveils Grace: A High-Performance Arm CPU for Use in Big AI Systems Author : haakon Score : 249 points Date : 2021-04-12 16:32 UTC (6 hours ago) (HTM) web link (www.anandtech.com) (TXT) w3m dump (www.anandtech.com) | crb002 wrote: | +1 ECC RAM | legulere wrote: | Big Data, Big AI, what's next? Big Bullshit? | jhgb wrote: | Nah, that's already been here for quite a while. | rexreed wrote: | Honestly the bottom down-voted comment has it right. What AI | application is actually driving demand here? What can't be | accomplished now (or with reasonable expenditures) that can be | accomplished by this one CPU that will be released in 2 yrs? What | AI applications will need this 2 yrs from now that don't need it | now? | | I understand the here-and-now AI applications. But this is | smelling more like Big AI Hype than Big AI need. | cracker_jacks wrote: | "640K ought to be enough for anybody." | cma wrote: | Real business-class features we want to know about: | | Will they auto-detect workloads and cripple performance (like the | mining stuff recently)? Only work through special drivers with | extra licensing feeds depending on the name of the building it is | in (data center vs office)? | rubatuga wrote: | Market segmentation is practiced by every chip company that you | use. Intel: ECC. AMD: ROCM. Qualcomm: cost as percentage of the | phone price. | cma wrote: | I still think Nvidia takes it further. | volta83 wrote: | Every company does market segmentation: it makes sense to | have customers that want a feature pay more for it. | | Still, every company does it differently. | | For example, both NVIDIA and AMD compute GPUs are | necessarily more expensive than gamer GPUs because of | hardware costs (e.g. HBM). | | However, NVIDIA gamer GPUs can do CUDA, while AMD gamer | GPUs can't do ROCm. | | The reason is that NVIDIA has 1 architecture for gaming and | compute (Ampere), while AMD has two different architectures | (RDNA and CDNA). | cma wrote: | It's common, but only possible in a very dominant | position or with competitors that are borderline | colluding. | volta83 wrote: | You must be the only gamer in the world that wants an | HBM2e GPU for gaming that's 10x more expensive while only | delivering a negligible improvement in FPS. | cma wrote: | I'm only talking about driver/license locks, not | different ram types. | Aissen wrote: | GPU-to-CPU interface >900GB/sec NVLink 4. What kind of | interconnect is that ? Is that even physically realistic ? | freeone3000 wrote: | Depends on how big you want to make it. If they're willing to | go four inches, that'd do it with existing per-pin speeds from | NVLink 3. | rincebrain wrote: | Well, according to [1], NVIDIA lists NVLink 3.0 as being 50 | Gb/s per lane per direction, and lists the total maximum | bandwidth of NVSwitch for Ampere (using NVLink 3.0) as 900 GB/s | each direction, so it doesn't seem completely out of reach. | | [1] - https://en.wikipedia.org/wiki/NVLink | Aissen wrote: | With 50Gb/s per lane, that would be 144 lanes to reach | 900GB/s. Quite impressive. | [deleted] | rincebrain wrote: | Fascinatingly, NVIDIA's own docs [1] claim GPU<->GPU | bandwidth on that device of 600 GB/s (though they claim | total aggregate bandwidth of 9.6 TB/s). Which would be | what, 96 and 1536 lanes, respectively? That's quite the | pinout. | | [1] - https://www.nvidia.com/en-us/data-center/nvlink/ | robomartin wrote: | Well, PCIe 6 x16 will do 128 GB/s. Of course, the real question | is how many transactions per second you get. For the PCIe 6 16 | lanes it's about 64 GT/s. | | Speaking in general terms, data rate and transaction rate don't | necessarily match because a transaction might require the | transmitter to wait for the receiver to check packet integrity | and then issue acknowledgement to the transmitter before a new | packet can be sent. | | Yet another case, again, speaking in general terms, would be | the case of having to insert wait states to deal with memory | access or other processor architecture issues. | | Simple example, on the STM32 processor you cannot toggle I/O in | software at anywhere close to the CPU clock rate due to | architectural constraints (to include the instruction set). On | a processor running at 48 MHz you can only do a max toggle rate | of about 3 MHz (toggle rate = number of state transitions per | second). | alexhutcheson wrote: | The fact that they are using a Neoverse core licensed from ARM | seems to imply that there won't be another generation for | NVidia's Denver/Carmel microarchitectures. Somewhat of a shame, | because those microarchitectures were unorthodox in some ways, | and it would have been interesting to see where that line of | evolution would have lead. | | I believe this leaves Apple, ARM, Fujitsu, and Marvell as the | only companies currently designing and selling cores that | implement the ARM instruction set. That may drop to 3 in the next | generation, since it's not obvious that Marvell's ThunderX3 cores | are really seeing enough traction to be be worth the non- | recurring engineering costs of a custom core. Are there any | others? | klelatti wrote: | Designing but not yet selling Qualcomm / Nuvia? | alexhutcheson wrote: | Yeah will be interesting to see if and when they bring a | design to market. | Bluestein wrote: | The whole combination of AI and the name gives "watched over by | machines of loving grace" a whole new twist, eh? | TheMagicHorsey wrote: | Is anyone but Apple making big investments in ARM for the | desktop? This is another ARM for the datacenter design. | | If other companies don't make genuine investments in ARM for the | desktop there's a real chance that Apple will get a huge an | difficult to assail application performance advantage as | application developers begin to focus on making Mac apps first, | and port to x86 as an afterthought. | | Something similar happened back in the day when Intel was the de | facto king, and everything on other platforms was a handicapped | afterthought. | | I wouldn't want to have my desktops be 15 to 30% slower than Macs | running the same software, simply because of emulation or lack of | local optimizations. | | So I'm really looking forward to ARM competition on the desktop. | callesgg wrote: | Super parallell arm chips could that not be a future thing for | nvidia or another chip manufacturer. A normal CPU die with | thousands of independent Cores. | modeless wrote: | I hope they make workstations. I want to see some competition for | the eventual Apple Silicon Mac Pro. | macksd wrote: | You probably mean less powerful than this, but they do: | https://www.nvidia.com/en-us/deep-learning- | ai/solutions/work.... | modeless wrote: | Yes they make workstations, but they don't make ARM | workstations. Yet. They already have ARM chips they could use | for it, but they went with x86 instead despite the fact that | they have to purchase the x86 chips from their direct | competitor. Also, yes, less than $100k starting price would | be nice. | dhruvdh wrote: | They are licensing ARM cores; which as of now cannot compete | with Apple silicon. | | While there are using some future ARM core, and I've read | rumors that future designs might try to emulate what has made | Apple cores successful; we cannot say whether Apple designs | will stagnate or continue to improve at current rate. | | There is potential for competition from Qualcomm after their | Nuvia acquisition though. | adgjlsfhk1 wrote: | It seems weird to me to say that arm cores can't compete with | apple silicon given that apple doesn't own fabs. They are | using arm cores on TSMC silicon (exactly the same as this). | seabrookmx wrote: | > They are using arm cores on TSMC silicon (exactly the | same as this) | | No the Apple Silicon chips use the arm _instruction set_ | but they do not use their core design. Apple designs their | core in house, much like Qualcomm does with snapdragon. | Both of these companies have an architectural license which | allows them to do this. | tibbydudeza wrote: | Qualcomm no longer makes their own cores - they just use | ARM reference IP designs since the Kryo. | | That will probably change with their Nuvia acquisition. | ac29 wrote: | Maybe not in single threaded performance, but Apple has no | server grade parts. Ampere, for example, is shipping an 80 | core ARM N1 processor that puts out some truely impressive | multithreaded performance. An M1 Mac is an entirely different | market - making a fast 4+4 core laptop processor doesn't | neccesarily translate into making a fast 64+ core server | processor. | devmor wrote: | What do you mean ARM cores can't compete with Apple silicon? | "Apple silicon" are ARM cores. | dharmab wrote: | Apple Silicon is compatible with the ARM instruction set | but they are not "just ARM cores" in their internal design. | mlyle wrote: | He means cores made by ARM, not cores implementing the ARM | ISA. Currently, the cores designed by ARM cannot touch the | Apple M1. | [deleted] | titzer wrote: | I think Apple did Arm an unbelievable favor by absolutely | trouncing all CPU competitors with the M1. By being so fast, | Apple's chip attracts many new languages and compiler backends | to Arm that want a piece of that sweet performance pie. Which | means that other vendors will want to have arm offerings, and | not, e.g. RISCv5. | | I have no idea what Apple's plans for the M1 chip are, but if | they had manufacturing capacity, they could put oodles of these | chips into datacenters and workstations the world over and | basically eat the x86 high-performance market. The fact that | the chip uses so little power (15W) means they can absolutely | cram them into servers where CPUs can easily consume 180W. That | means 10x the number of chips for the same power, and not all | concentrated in one spot. A lot of very interesting server | designs are now possible. | klelatti wrote: | It's hard to imagine that until a few months ago it was very | difficult to get a decent Arm desktop / laptop. I imagine | lots of developers working now to fix outstanding Arm bugs / | issues. | giantrobot wrote: | While I'm sure lots of projects have actual ARM-related | bugs, there was a whole class of "we didn't expect this | platform/arch combination" compilation bugs that have seen | fixes lately. It's not that the code has bugs on ARM, a lot | of OSS has been compiling on ARM for a decade (or more) | thanks to Raspberry Pis, Chromebooks, and Android but built | scripts didn't understand "darwin/arm64". Back in December | installing stuff on an M1 Mac via Homebrew was a pain but | it's gotten significantly easier over the past few months. | | But a million (est) new general purpose ARM computers | hitting the population certainly affects the prioritizing | of ARM issues in a bug tracker. | mhh__ wrote: | > compiler backends to Arm that want a piece of that sweet | performance pie | | How many compilers didn't support ARM? | GrumpyNl wrote: | I need a new video card and there are no Nvidia to buy, all is | bought by miners. Will it go the same with this card? | redtriumph wrote: | Currently, there are no plans for consumer-grade CPUs. Even | this new CPU class is shipping in 2023. | remexre wrote: | > Today at GTC 2021 NVIDIA announces its first CPU | | Wait, Nvidia's been making ARM CPUs for years now; most memorably | Project Denver. | 015a wrote: | Arguably, most memorably, Tegra; the CPU/GPU which powers the | Nintendo Switch. | Jasper_ wrote: | That uses a licensed ARM Cortex design under the hood. | jdsully wrote: | NVIDIA called it their first "data center CPU". Our helpful | reporter simplified it to the point of being flat out wrong. | Not uncommon. | justin66 wrote: | I expected more from a site called VideoCardz. | titzer wrote: | Given that there are essentially no architectural details here | other than bandwidth estimates, and the release timeline is in | 2023, how exactly does this count as "unveiling"? Headline should | read: "NVidia working on new arm chip due in two years", or | something else much more bland. | mrlento234 wrote: | Not quite. CSCS supercomputing center in Switzerland have | already started receiving the hardware | (https://www.cscs.ch/science/computer-science- | hpc/2021/cscs-d...). Perhaps, we may see some benchmarks. To | wider HPC users, it will be only available in 2023 as the | article mentioned. | IanCutress wrote: | I suspect that's more racks of storage, not racks of compute. | Nothing to suggest it's compute. | seniorivn wrote: | as i understand it's compute, just not cpu compute, those | cpu are designed to be good enough for cuda servers | DetroitThrow wrote: | Hey Ian, I love reading your posts on Anandtech, you're a | fantastic technical communicator. | titzer wrote: | Hopefully some architectural details are forthcoming then! | But that is not what is in this article. | allie1 wrote: | As AMD proved us, a lot can happen in 3 years | valine wrote: | I like the sound of a non-Apple arm chip for workstations. Given | my positive experience with the M1 I'd be perfectly happy never | using x86 again after this market niche is filled. | webaholic wrote: | I don't think this will be anywhere near as good as the M1, | since they are using the ARM Neoverse cores. | ac29 wrote: | Apple throws a lot of transistors at their 4 performance | cores in the M1 to get the performance they do - its not | clear that approach would realistically scale to a | workstation CPU with 16, 32, or more cores (at least not with | current fab capabilities). | awill wrote: | Me too. But my decades old steam collection isn't looking | forward to it. That's one advantage of cloud gaming. It won't | matter what your desktop runs on. | nabla9 wrote: | Finally news from Nvidia that really moved markets. | Nvidia +4.68%, Intel -4.65% AMD -4.47% | 01100011 wrote: | I wonder how permanent this is. As a Nvidian who sells his | shares as soon as they vest and who owns some Intel for | diversification, I wonder if I should load up on Intel? You | really can't compete with their fab availability. Having a | great design means nothing unless you can get TSMC to grant you | production capacity. | nabla9 wrote: | TSMC takes orders years ahead and builds capacity to match | working together with big customers. Those who pay more | (price per unit and large volume) get first shot. That's why | Apple is always first, followed by Nvidia and AMD, then | Qualcomm. | | There is bottled demand because Intel's failure to deliver | was not fully anticipated by anyone. | gchadwick wrote: | It'd be interesting to know if NVidia are going for an ARMv9 | core, in particular if they'll have a core with an SVE2 | implementation. | | It may be they don't want to detract from focus on the GPUs for | vector computation so prefer a CPU without much vector muscle. | | Also interesting that they're picking up an arm core rather than | continuing with their own design. Something to do with the | potential takeover (the merged company would only want to support | so many micro-architectural lines)? | adrian_b wrote: | They have said clearly that the core is licensed from ARM and | one of the Neoverse future models. | | There was no information whether it will have any good SVE2 | implementation. On the contrary they insisted only on the | integer performance and on the high-speed memory interface. | dragontamer wrote: | Neoverse V1 has SVE, Neoverse E or N do not. | | "E" is efficiency, N is standard, V is high-speed. IIRC, N is | the overall winner in performance/watt. Efficiency cores have | the lowest clock speed (overall use the least amount of | watts/power). V purposefully goes beyond the performance/watt | curve for higher per-core compute capabilities | Teongot wrote: | Neoverse-N2 will have SVE2 (source https://github.com/gcc- | mirror/gcc/blob/master/gcc/config/aar... ) | gchadwick wrote: | Here's Anandtech's article on the previous Neoverse V1/N2 | announcement: https://www.anandtech.com/show/16073/arm- | announces-neoverse-... arm weren't saying anything official | but Anandtech did a little digging and reckons V1 is SVE 1 | and v8 and N2 could be Armv9 with SVE 2. | | I'd suspect NVidia would be using the V1 here as it's the | higher performing core, but not way to be certain. | klelatti wrote: | This has got me wondering whether an Nvidia owned Arm could | limit SVE2 implementations so as not to compete with Nvidia's | GPU. That would certainly be the case for Arm designed cores - | not a desirable outcome. | MikeCapone wrote: | I doubt it, it's not like the market for acceleration is | stagnant and saturated and they need to steal some | marketshare points from one side to help the other. | | It's all greenfield and growing so far, they'll win more by | having the very best products they can make on both sides. | mlyle wrote: | You'd think. But it wouldn't be the first time a new | product is hampered to not slightly theoretically | cannibalize an existing product family. | theonlyklas wrote: | I think they will use SVE2 because I assume they'll need to | perform vector reads/writes to NVLink connected peripherals to | reach that 900GB/s GPU-to-CPU bandwidth metric they described. | api wrote: | Tangent: Apple should bring back the Xserve with their M1 line, | or alternately license the M1 core IP to another company to | produce a differently-branded server-oriented chip. The | performance of that thing is mind blowing and I don't see how | this would compete with or harm their desktop and mobile | business. | bombcar wrote: | How much of that performance is on-chip memory and how | usable/scalable is that? An Xserve that is limited to one CPU | and can't have more RAM would pretty mediocre. | AnthonyMouse wrote: | The cheapest available Epyc (7313P) has 16 cores and dual | socket systems have up to 128 cores and 256 threads. Server | workloads are massively parallel, so a 4+4 core M1 would be | embarrassed and Apple wouldn't want to subject themselves to | that comparison. | | But another reason they won't do it is that TSMC has a finite | amount of 5nm fab capacity. They can't make more of the chips | than they already do. | api wrote: | I'm thinking of a 64-core M1. It would not be the laptop | chip. | ac29 wrote: | A 4+4 core M1 is 16 billion transistors. Some of that is | the little cores, GPU, etc, but its not clear to me its | practical to get, say 8x larger. That would be 128 billion | transistors. As a point of comparison, NVIDIA's RTX 3090 is | 28B transistors, and thats a huge, expensive chip. | [deleted] | [deleted] | rektide wrote: | There's a lot of interconnects (CCIX, CXL, OpenCAPI, NVLink, | GenZ) brewing. Nvidia going big is, hopefully, a move that will | prompt some uptake from the other chip makers. 900GBps link, more | than main memory: big numbers there. Side note, I miss AMD being | actively involved with interconnects. InfinityFabric seems core | to everything they are doing, but back in the HyperTransport days | it was something known, that folks could build products for, | interoperate with. Not many did, but it's still frustrating | seeing AMD keeping cards so much closer to the chest. | filereaper wrote: | Looks like NVidia broke up with POWER on IBM and made their own | chip. | | They have interconnects from Mellanox, GPUs and their own CPUs | now. | | I suspect the supercomputing lists will be dominated by NVidia | now. | arcanus wrote: | That is certainly the trend. AMD is bringing Frontier online | later this year, which might be the only counter to this. | DonHopkins wrote: | I love the name "Grace", after Grace Hopper. | paulmd wrote: | There's a tendency to use first names to refer to women in | professional settings or political power that is somewhat | infantilizing and demeaning. | | I doubt anyone really deliberately sets out to be like "haha | yessss today I shall elide this woman's credentials", but this | is one of those unconscious gender-bias things that is | commonplace in our society and is probably best to try and make | a point of avoiding. | | https://news.cornell.edu/stories/2018/07/when-last-comes-fir... | | https://metro.co.uk/2018/03/04/referring-to-women-by-their-f... | | (etc etc) | | I'd prefer they used "Hopper" instead, in the same way they | have chosen to refer to previous architectures by the last | names of their namesakes (Maxwell, Pascal, Ampere, Volta, | Kepler, Fermi, etc). I'd see that as being more professionally | respectful for her contributions. | | But yes I very much like the idea of naming it after Hopper. | bloak wrote: | Perhaps you're being downvoted because it's a tangent. It's a | real phenomenon, though, and an interesting one. Of course | there are many things that influence which parts of someone's | full name get used, and if the tendency is a problem it's a | trivial one compared to all the other problems that women | face, but, yes, in general it would probably be a good idea | to be more consistent in this respect. | | Vaguely related: J. K. Rowling's "real" full name is Joanne | Rowling. The publisher "thought a book by an obviously female | author might not appeal to the target audience of young | boys". | | There's another famous (in the UK at least) computer | scientist called Hopper: Andy Hopper. So "G.B.M. Hopper", | perhaps? That would have more gravitas than "Andy"! | hderms wrote: | I feel like there's a non-zero chance they named it Grace | instead of Hopper so their new architecture doesn't sound | like a bug or a frog or something. You could be right, though | trynumber9 wrote: | Hopper was already reserved for an Nvidia GPU: | https://en.wikipedia.org/wiki/Hopper_(microarchitecture) | paulmd wrote: | Yeah, I dunno what is going on with that, I assumed that | had changed if they were going to use the name "grace" for | another product. | | I guess I'm not sure if "Hopper" refers to the product as a | whole (like Tegra) and early leakers misunderstood that, or | whether Hopper is the name of the microarchitecture and | "Grace" is the product, or if it's changed from Hopper to | Grace because they didn't like the name, or what. | | Otherwise it's a little awkward to have products named both | "grace" and "hopper"... | lprd wrote: | So is ARM the future at this point? After seeing how well Apple's | M1 performed against a traditional AMD/Intel CPU, it has me | wondering. I used to think that ARM was really only suited for | smaller devices. | hilios wrote: | Depends, performance wise it should be able to compete with or | even outperform x86 in many areas. A big problem until now was | cross compatibility regarding peripherals, which complicates | running a common OS on ARM chips from different vendors. There | is currently a standardization effort (Arm SystemReady SR) that | might help with that issue though. | Hamuko wrote: | Based on initial testing, AWS EC2 instances with ARM chips | performed as well if not better than the Intel instances, but | they cost 20% less. The only drawback that I've really | encountered thus far was that it complicates the build process. | moistbar wrote: | Does ARM have a uniquely complex build process, or is it the | mix of architectures that makes it more difficult? | sumtechguy wrote: | ARM is all over the place with its ISA. x86 has the benefit | that most companies made it 'IBM compatible'. There are one | off x86 ISAs but they are mostly forgotten at this point. | The ARM CPU family itself is fairly consistent (mostly), | but included hardware is a very mixed bag. The x86 has on | the other hand the history of build it to make it work like | IBM. All the way from how things boot up, memory space | addresses, must have I/O, etc. ARM on the other hand may or | may not have that depending on which ISA you target or are | creating. Things like the raspberry PI has changed some of | that as many are mimicking the broadcom ISA and | specifically that with the raspberry pi one. The x86 arch | has also picked up some interesting baggage along the way | because of what it is. We can mostly ignore it but it is | there. For example you would not build a ARM board these | days with an IDE interface but some of those bits still | exist in the x86 world. | | ARM is more of a tool kit to build different purpose built | computers (you even see them show up in usb sticks). While | x86 is particular ISA that has a long history behind it. So | you may see something like 'Amazon builds its own ARM | computers'. That means they spun their own boards, built | their own toolchains (more likely recompiled existing | ones), and probably have their own OS distro to match. Each | one of those is a fairly large endeavor to do. When you see | something like 'Amazon builds its own x86 boards', they | have shaved out the other two parts of that and are | focusing on hardware. That they are building their own | means they see the value in owning the whole stack. Also if | you have your own distro means you usually have to 'own' | building the whole thing. So I can go grab an x86 gcc stack | from my repo provider. They will need to act as the repo | owner and build it themselves and keep up with the patches. | Depending on what has been added that can be quite the task | all by itself. | Hamuko wrote: | Mix of architectures and the fact that our normal CI server | is still x86-based and really didn't want to do ARM builds. | ksec wrote: | Based on Future ARM Neoverse, so basically nothing much to see | here from CPU perspective, What really stands out, are those | ridiculous number from its Memory system and Interconnect. | | CPU: LPDDR5X with ECC Memory at 500+GB/s Memory Bandwidth. ( | Something Apple may dip into. R.I.P for Mac with upgradable | Memory ) | | GPU: HBM2e at _2000_ GB /s. Yes, three zeros, this is not a typo. | | NVLink: 500GB/s | | This will surely further solidify CUDA dominance. Not entirely | sure how Intel's XE with OneAPI and AMD's ROCm is going to | compete. | Dylan16807 wrote: | > GPU: HBM2e at 2000 GB/s. Yes, three zeros, this is not a | typo. | | It's a good step forward but your average consumer GPU is | already around a quarter to a third of that and a Radeon VII | had 1000 GB/s two years ago. | jabl wrote: | The Nvidia A100 80GB already provides 2 TB/s mem BW today. | Also using HBM2e. | m_mueller wrote: | I think what you're missing here is the NVLink part. The fact | that you can get a small cluster of these linked up like that | for 400k, all wrapped in a box, makes HPC quite a bit more | accessible. Even 5 years ago, if you wanted to run a regional | sized weather model at reasonable resolution, you needed to | have some serious funding (say, nation states or oil / | insurance companies). Nowadays you could do it with some | angel investment and get one of these Nvidia boxes and just | program them like they're one GPU. | kllrnohj wrote: | Critically it's CPU to GPU NVLink here, not the "boring" | GPU to GPU NVLink that's common on Quadros. 500GB/s | bandwidth between CPU & GPU massively changes when & how | you can GPU accelerate things, that's a 10X difference over | the status quo. | kimixa wrote: | Also "cpu->cpu" NVLink is interesting. Though it was my | understanding that NVLink is point-to-point, and would | require some massive switching system to be able to | access any node in the cluster anywhere near that rate | without some locality bias (IE nodes on the "first" | downstream switch are faster to access and less | contention) | de6u99er wrote: | Don't know if it's just me but this product looks like a beta- | product for early adopters. | rektide wrote: | It's initially for two huge HPC systems. It'll be interesting | to see what kind of availability it ever has to the rest of the | world. | lprd wrote: | So is ARM the future at this point? After seeing how well Apple's | M1 performed against a traditional AMD/Intel CPU, it has me | wondering. I used to think that ARM was really only suited for | smaller devices. | fulafel wrote: | The instruction set doesn't make a significant difference | technically, the main things about them are monopolies | (patents) tied to ISAs, and sw compatibility. | rvanlaar wrote: | I'm interested in your thoughts on why this doesn't make a | significant difference. From what I've read, the M1 has a lot | of tricks up its sleeve that are next to impossible on X86. | For example ARM instructions can be decoded in parallel. | kllrnohj wrote: | It will come down entirely to who can sustain a good CPU core. | | Currently Apple is the only company making performance- | competitive ARM cores that can make a reasonable justification | for an architecture switch. | | Otherwise AMD's CPUs are still ahead of everyone else, | including all other ARM CPU cores not made by Apple. And even | Intel is still faster in places where performance matters more | than power efficiency (eg, desktop & PC gaming) | aeyes wrote: | Amazons ARM chips are performance competitive as well, for | many workloads you can expect at least similar performance | per core at the same clock speed. | floatboth wrote: | Arm's Neoverse cores are doing pretty well in the datacenter | space -- on AWS, the Graviton2 instances are currently the | best ones for lots of use cases. It's clear that core designs | by Arm are really good. The problem currently is the lag | between the design being done and various vendors' chips | incorporating it. | | upd: oh also in the HPC world, Fujitsu with the A64FX seems | to be like the best thing ever now | rubatuga wrote: | Fujitsu flying under the radar while having the fastest cpu | ever made haha | kllrnohj wrote: | Graviton2 is competitive sometimes with Epyc, but also | falls far behind in some tests (eg, Java performance is a | bloodbath). Overall across majority tests, Neoverse | consistently comes up short of Milan even when Neoverse is | given a core-count advantage. And critically the per-core | performance of Graviton2 / Neoverse is worse, and per-core | performance is what matters to consumer space. | | But it can't just be competitive it needs to be | significantly better in order for the consumer space to | care. Nobody is going to run Windows on ARM just to get | equivalent performance to Windows on X86, especially not | when that means most apps will be worse. That's what's | really impressive about the M1, and so far is very unique | to Apple's ARM cpus. | | > oh also in the HPC world, Fujitsu with the A64FX seems to | be like the best thing ever now | | A64FX doesn't appear to be a particularly good CPU core, | rather it's a SIMD powerhouse. It's the AVX-512 problem - | when you can use it, it can be great. But you mostly can't, | so it's mostly dead weight. Obviously in HPC space this is | different scenario entirely, but that's not going to | translate to consumer space at all (and it's not an ARM | advantage, either - 512bit SIMD hit consumer space via x86 | first with Intel's Rocket Lake). | klelatti wrote: | Not sure why you're placing so much weight on Epyc | outperforming Graviton but discounting designs / use | cases where Arm is clearly now better. Plus it's clear | that we are just at the beginning of a period where some | firms with very deep pockets are starting to invest | seriously in Arm on the server and the desktop. | | If x64 ISA had major advantages over Arm then that would | be significant, but I've not heard anyone make that case: | instead it's a debate about how big the Arm advantage is. | | Can x64 remain competitive in some segments: probably and | inertia will work in its favour. I do think it's | inevitable that we will see a major shift to Arm though. | huac wrote: | so then we think about what makes Apple's M1 so good. one | hard-to-replicate factor is that they designed their hardware | and software together, the ops which MacOS uses often are | heavily optimized on chip. | | but one factor that you can replicate is colocating memory, | CPU, and GPU, the system-on-chip architecture. that's what | Nvidia looks to be going after with Grace, and I'm sure | they've learned lessons from their integrated designs e.g. | Jetson. very excited to see how this plays out! | kllrnohj wrote: | > one hard-to-replicate factor is that they designed their | hardware and software together, the ops which MacOS uses | often are heavily optimized on chip. | | Not really, they are still just using the same ARM ISA as | everyone else. The only hardware/software integration magic | of the M1 so far seems to be the x86 memory model emulation | mode, which others could definitely replicate. | | > but one factor that you can replicate is colocating | memory, CPU, and GPU, the system-on-chip architecture. | | AMD introduced that in the x86 world back in 2013 with | their Kavari APU ( https://www.zdnet.com/article/a-closer- | look-at-amds-heteroge... ), and it's been fairly typical | since then for on-die integrated GPUs on all ISAs. | dkjaudyeqooe wrote: | ARM is the present, RISC-V is the future and Intel is the past. | | The magic of Apple's M1 comes from the engineers who worked on | the CPU implementation and the TSMC process. | | The architecture has some impact on performance but I think it | is simplicity and and ease of implementation that factors most | into how well it can perform (as per the RISC idea). In that | sense Intel lags for small, fast and efficient processors | because their legacy architecture pays a penalty for decoding | and translation (into simpler ops) overhead. Eventually designs | will abandon ARM for RISC-V for similar reasons as well as | financial ones. | | Really, today it's a question of who has the best | implementation of any given architecture. | mhh__ wrote: | The next decade is ARM's for the taking, _but_ if Intel and AMD | can make good cores then it 's not anywhere close to slam dunk. | | One of the reasons why M1 is good is pure and simple that it | has a pretty enormous transistor budget, not solely because | it's ARM. | api wrote: | Being ARM has something to do with it. The x86 instruction | decoder may be only about ~5% of the die, but it's 5% of the | die that has to run _all the time_. Think about how warm your | CPU gets when you run e.g. heavy FPU loads and then imagine | that 's happening all the time. You can see the power | difference right there. | | It's also very hard to achieve more than 4X parallelism | (though I think Ice Lake got 6X at some additional cost) in | decode, making instruction level parallelism harder. X86's | hack to get around this is SMT/hyperthreading to keep the | core fed with 2X instruction streams, but that adds a lot | more complexity and is a security minefield. | | Last but not least: ARM's looser default memory model allows | for more read/write reordering and a simpler cache. | | ARM has a distinct simplicity and low-overhead advantage over | X86/X64. | NortySpock wrote: | > x86 instruction decoder may be only about ~5% of the die | | What percent of the die is an ARM instruction decoder? | duskwuff wrote: | Much less. x86 instruction decoding is complicated by the | fact that instructions are variable-width and are byte- | aligned (i.e. any instruction can begin at any address). | This makes decoding more than one instruction per clock | cycle complicated -- I believe the silicon has to try | decoding instructions at every possible offset within the | decode buffer, then mask out the instructions which are | actually inside another instruction. | | ARM A32/A64 instruction decoding is dramatically simpler | -- all instructions are 32 bits wide and word-aligned, so | decoding them in parallel is trivial. T32 ("Thumb") is a | bit more complex, but still easier than x86. | monocasa wrote: | I totally agree with the core of your argument (aarch64 | decoding is inherently simpler and more power efficient | than x86), but I'll throw out there that it's not quite | as bad as you say on x86 as there's some nonobvious | efficiencies (I've been writing a parallel x86 decoder). | | What nearly everyone uses is a 16 byte buffer aligned to | the program counter being fed into the first stage | decode. This first stage, yes has to look at each byte | offset as if it could be a new instruction, but doesn't | have to do full decode. It only finds instruction length | information. From there you feed this length information | in and do full decode on the byte offsets that represent | actual instruction boundaries. That's how you end up with | x86 cores with '4 wide decode' despite needing to | initially look at each byte. | | Now for the efficiencies. Each length decoder for each | byte offset isn't symmetric. Only the length decoder at | offset 0 in the buffer has to handle everything, and the | other length decoders can simply flag "I can't handle | this", and the buffer won't be shifted down past where | they were on the next cycle and the byte 0 decoder can | fix up any goofiness. Because of this, they can | | * be stripped out of instructions that aren't really used | much anymore if that helps them | | * can be stripped of weird cases like handling crazy | usages of prefix bytes | | * don't have to handle instructions bigger than their | portion of the decode buffer. For instance a length | decoder starting at byte 12 can't handle more than a 4 | byte instruction anyway, so that can simplify it's logic | considerably. That means that the simpler length decoders | end up feeding into the higher stack up full decoder | selection, so some of the overhead cancels out in a nice | way. | | On top of that, I think that 5% includes pieces like the | microcode ROMs. Modern ARM cores almost certainly have | (albeit much smaller) microcode ROMs as well to handle | the more complex state transitions. | | Once again, totally agreed with your main point, but it's | closer than what the general public consensus says. | ant6n wrote: | I wonder whether a modern byte-sized instruction encoding | would sort of look like Unicode, where every byte is self | synchronizing... I guess it can be even weaker than that, | probably only every second or fourth byte needs to | synchronize. | pbsd wrote: | The x86 decoder is not running all the time; the uops cache | and the LSD exist precisely to avoid this. With | instructions fed from the decoders you can only sustain 4 | instructions per cycle, while to get to 5 or 6 your | instructions need to be coming from either the uops cache | or the LSD. In the case of the Zen 3, the cache can deliver | 8 uops per cycle to the pipeline (but the overall thoughput | is limited elsewhere at 6)! | | Furthermore, the high-performance ARM designs, starting | with the Cortex-A77, started using the same trick---the | 6-wide execution happens only when instructions are being | fed from the decoded macro-op cache. | ant6n wrote: | How can you run 8 instructions at the same time if you | only have 16 general purpose registers? You'd have to | either be doing float ops or constantly spilling. So I'm | integer code, how many of those instructions are just | moving data between memory and registers (push/pop?). | | I'd say ARM has a big advantage for instruction level | parallelism with 32 registers. | mhh__ wrote: | Register renaming for a start, and this is about decoding | not execution | ant6n wrote: | Okay fair. But the bigger subject is inherent performance | advantage of the architecture. You don't just want to | decode many instructions per cycle, you also want to | issue them. So decoding width and issuing width are | related. | | And it seems to me that ARM has an advantage here. If you | want execute 8 instructions in parallel, you gotta | actually have 8 independent things that need to get | executed. I guess you could have a giant out of order | buffer, and include stack locations in your register | renaming scheme, but it seems much easier to find | parallelism if a bunch of adjacent instructions are | explicitly independent. Which is much easier if you have | more registers - the compiler can then help the cpu | keeping all those instruction units fed. | mhh__ wrote: | The decoder might not be running strictly all the time, | but I would wager that for some applications at least it | doesn't make much of a difference. For HPC or DSP or | whatever where you spend a lot of time in relatively | dense loops the uop cache is probably big enough to ease | the strain on the decoder, but for sparser code | (Compilers come to mind, lots of function calls and | memory bound work) I wouldn't be surprised if it didn't | make as much difference. | | I have vTune installed so I guess I could investigate | this if I dig out the right PMCs | pbsd wrote: | I agree; compiler-type code will miss the cache most of | the time. A simple test with clang++ compiling some | nontrivial piece of C++: 0 | lsd_uops | 1,092,318,746 idq_dsb_uops | ( +- 0.49% ) 4,045,959,682 idq_mite_uops | ( +- 0.06% ) | | The LSD is disabled in this chip (Skylake) due to errata, | but we can see only 1/5th of the uops come from the uops | cache. However, the more relevant experiment in terms of | power is how many cycles is the cache active instead of | the decoders: 0 | lsd_cycles_active | 378,993,057 idq_dsb_cycles | ( +- 0.18% ) 1,616,999,501 idq_mite_cycles | ( +- 0.07% ) | | The ratio is similar: the regular decoders are not active | only around 1/5th of the time. | | In comparison, gzipping a 20M file looks a lot better: | 0 lsd_cycles_active | 2,900,847,992 idq_dsb_cycles | ( +- 0.07% ) 407,705,985 idq_mite_cycles | ( +- 0.33% ) | mhh__ wrote: | This is why I said it's ARM's for the taking. | | I'm not familiar with how ARM's memory model effects the | cache design - Source? | jayd16 wrote: | Another reason is the something like 150% memory bandwidth | and I'm sure there are other simple wins along those lines. | | The M1 isn't necessarily a win for Arm in general. Other | manufacturers weren't competing before and its yet to be seen | if they will. | mhh__ wrote: | It's the memory stupid! | to11mtm wrote: | Specifically, the memory -latency-. | | By going on-package there's almost certainly latency | advantages in addition to the much-vaunted bandwidth | gains. | | That's going to pan out to better perf, and likely better | power usage as well. | NathanielK wrote: | 150% compared to what? | jayd16 wrote: | The latest i9 and the latest Ryzen 9, ie the competition. | NathanielK wrote: | Intel Tigerlake and Amd Renoir both support 128bit | LPDDR4x at 4266MHz. Maybe you're confusing the desktop | chips that use conventional DDR4? The M1 isn't | competitive with them. | jayd16 wrote: | Oh those are pretty new and I haven't seen any benchmarks | with LPDDR in an equivalent laptop chip. Do you have a | link to any? | ravi-delia wrote: | I've seen things like this a lot, and it's a bit confusing. | If parts of the M1's performance are due to throwing compute | at the problem, why hasn't Intel been doing that for years? | What about ARM, or the M1, allowed this to happen? | NathanielK wrote: | Intel has. Many M1 design choices are fairly typical for | desktop x86 chips, but unheard of with ARM. | | For example, the M1 has 128 bit wide memory. This has been | standard for decades on the desktop(dual channel), but | unheard of in cellphones. The M1 also has similar amounts | of cache to the new AMD and Intel chips, but thats several | times more than the latest snapdragon. Qualcomm also | doesn't just design for the latest node. Most of their | volume is on cheaper, less dense nodes. | dpatterbee wrote: | Buying the majority of TSMC's 5nm process output helped. | It's a combination of good engineering, the most advanced | process, and intel shitting themselves I would say. | tambourine_man wrote: | >...is pure and simple that it has a pretty enormous | transistor budget | | There's a lot of brute force, yes, but it's not the only | reason. There are lots of smart design decisions as well. | amelius wrote: | Yes, but those decisions optimize for the single user | laptop case, not for e.g. servers. | mhh__ wrote: | "One of the reasons" I did say. | tambourine_man wrote: | True, I misread it. | phendrenad2 wrote: | It really comes down to how well they can emulate X86. People | aren't going to give up access to 3 decades of Windows | software. | pjerem wrote: | I'm sure ARM already took over x86 if you have a wider | definition of personal computers. And a lot of people | already gave up access to 3 decades of Windows software by | using their phone or tablet as their main device. | | Plus, most of the last decade software is software that | runs on some sort of VM or another (be it JVM, CLR, a | Javascript engine or even LLVM). | | Soon (in years), x86 will only be needed by professionals | that are tied to really old software. And those particular | needs will probably be satisfied by decent emulation. | kllrnohj wrote: | > Soon (in years), x86 will only be needed by | professionals that are tied to really old software. | | There's also that PC & console gaming markets, which are | not small and have not made any movements of any kind | towards ARM so far. | bitwize wrote: | > So is ARM the future at this point? | | The near future. A few years out, RISC-V is gonna change | everything. | CalChris wrote: | Apple isn't entering the cloud market. Moreover the M1 isn't a | cloud cpu. The M1 SOC emphasizes low latency and performance | per watt over throughput. | 1MachineElf wrote: | I wonder what percentage of it's supported toolchain components | will be proprietary. | CalChris wrote: | _Grace, in contrast, is a much safer project for NVIDIA; they're | merely licensing Arm cores rather than building their own ..._ | | NVIDIA is buying ARM. | klelatti wrote: | Trying to buy Arm. | | Multiple competition investigations permitting. ___________________________________________________________________ (page generated 2021-04-12 23:00 UTC)