[HN Gopher] Nvidia Unveils Grace: A High-Performance Arm CPU for...
       ___________________________________________________________________
        
       Nvidia Unveils Grace: A High-Performance Arm CPU for Use in Big AI
       Systems
        
       Author : haakon
       Score  : 249 points
       Date   : 2021-04-12 16:32 UTC (6 hours ago)
        
 (HTM) web link (www.anandtech.com)
 (TXT) w3m dump (www.anandtech.com)
        
       | crb002 wrote:
       | +1 ECC RAM
        
       | legulere wrote:
       | Big Data, Big AI, what's next? Big Bullshit?
        
         | jhgb wrote:
         | Nah, that's already been here for quite a while.
        
       | rexreed wrote:
       | Honestly the bottom down-voted comment has it right. What AI
       | application is actually driving demand here? What can't be
       | accomplished now (or with reasonable expenditures) that can be
       | accomplished by this one CPU that will be released in 2 yrs? What
       | AI applications will need this 2 yrs from now that don't need it
       | now?
       | 
       | I understand the here-and-now AI applications. But this is
       | smelling more like Big AI Hype than Big AI need.
        
         | cracker_jacks wrote:
         | "640K ought to be enough for anybody."
        
       | cma wrote:
       | Real business-class features we want to know about:
       | 
       | Will they auto-detect workloads and cripple performance (like the
       | mining stuff recently)? Only work through special drivers with
       | extra licensing feeds depending on the name of the building it is
       | in (data center vs office)?
        
         | rubatuga wrote:
         | Market segmentation is practiced by every chip company that you
         | use. Intel: ECC. AMD: ROCM. Qualcomm: cost as percentage of the
         | phone price.
        
           | cma wrote:
           | I still think Nvidia takes it further.
        
             | volta83 wrote:
             | Every company does market segmentation: it makes sense to
             | have customers that want a feature pay more for it.
             | 
             | Still, every company does it differently.
             | 
             | For example, both NVIDIA and AMD compute GPUs are
             | necessarily more expensive than gamer GPUs because of
             | hardware costs (e.g. HBM).
             | 
             | However, NVIDIA gamer GPUs can do CUDA, while AMD gamer
             | GPUs can't do ROCm.
             | 
             | The reason is that NVIDIA has 1 architecture for gaming and
             | compute (Ampere), while AMD has two different architectures
             | (RDNA and CDNA).
        
               | cma wrote:
               | It's common, but only possible in a very dominant
               | position or with competitors that are borderline
               | colluding.
        
               | volta83 wrote:
               | You must be the only gamer in the world that wants an
               | HBM2e GPU for gaming that's 10x more expensive while only
               | delivering a negligible improvement in FPS.
        
               | cma wrote:
               | I'm only talking about driver/license locks, not
               | different ram types.
        
       | Aissen wrote:
       | GPU-to-CPU interface >900GB/sec NVLink 4. What kind of
       | interconnect is that ? Is that even physically realistic ?
        
         | freeone3000 wrote:
         | Depends on how big you want to make it. If they're willing to
         | go four inches, that'd do it with existing per-pin speeds from
         | NVLink 3.
        
         | rincebrain wrote:
         | Well, according to [1], NVIDIA lists NVLink 3.0 as being 50
         | Gb/s per lane per direction, and lists the total maximum
         | bandwidth of NVSwitch for Ampere (using NVLink 3.0) as 900 GB/s
         | each direction, so it doesn't seem completely out of reach.
         | 
         | [1] - https://en.wikipedia.org/wiki/NVLink
        
           | Aissen wrote:
           | With 50Gb/s per lane, that would be 144 lanes to reach
           | 900GB/s. Quite impressive.
        
             | [deleted]
        
             | rincebrain wrote:
             | Fascinatingly, NVIDIA's own docs [1] claim GPU<->GPU
             | bandwidth on that device of 600 GB/s (though they claim
             | total aggregate bandwidth of 9.6 TB/s). Which would be
             | what, 96 and 1536 lanes, respectively? That's quite the
             | pinout.
             | 
             | [1] - https://www.nvidia.com/en-us/data-center/nvlink/
        
         | robomartin wrote:
         | Well, PCIe 6 x16 will do 128 GB/s. Of course, the real question
         | is how many transactions per second you get. For the PCIe 6 16
         | lanes it's about 64 GT/s.
         | 
         | Speaking in general terms, data rate and transaction rate don't
         | necessarily match because a transaction might require the
         | transmitter to wait for the receiver to check packet integrity
         | and then issue acknowledgement to the transmitter before a new
         | packet can be sent.
         | 
         | Yet another case, again, speaking in general terms, would be
         | the case of having to insert wait states to deal with memory
         | access or other processor architecture issues.
         | 
         | Simple example, on the STM32 processor you cannot toggle I/O in
         | software at anywhere close to the CPU clock rate due to
         | architectural constraints (to include the instruction set). On
         | a processor running at 48 MHz you can only do a max toggle rate
         | of about 3 MHz (toggle rate = number of state transitions per
         | second).
        
       | alexhutcheson wrote:
       | The fact that they are using a Neoverse core licensed from ARM
       | seems to imply that there won't be another generation for
       | NVidia's Denver/Carmel microarchitectures. Somewhat of a shame,
       | because those microarchitectures were unorthodox in some ways,
       | and it would have been interesting to see where that line of
       | evolution would have lead.
       | 
       | I believe this leaves Apple, ARM, Fujitsu, and Marvell as the
       | only companies currently designing and selling cores that
       | implement the ARM instruction set. That may drop to 3 in the next
       | generation, since it's not obvious that Marvell's ThunderX3 cores
       | are really seeing enough traction to be be worth the non-
       | recurring engineering costs of a custom core. Are there any
       | others?
        
         | klelatti wrote:
         | Designing but not yet selling Qualcomm / Nuvia?
        
           | alexhutcheson wrote:
           | Yeah will be interesting to see if and when they bring a
           | design to market.
        
       | Bluestein wrote:
       | The whole combination of AI and the name gives "watched over by
       | machines of loving grace" a whole new twist, eh?
        
       | TheMagicHorsey wrote:
       | Is anyone but Apple making big investments in ARM for the
       | desktop? This is another ARM for the datacenter design.
       | 
       | If other companies don't make genuine investments in ARM for the
       | desktop there's a real chance that Apple will get a huge an
       | difficult to assail application performance advantage as
       | application developers begin to focus on making Mac apps first,
       | and port to x86 as an afterthought.
       | 
       | Something similar happened back in the day when Intel was the de
       | facto king, and everything on other platforms was a handicapped
       | afterthought.
       | 
       | I wouldn't want to have my desktops be 15 to 30% slower than Macs
       | running the same software, simply because of emulation or lack of
       | local optimizations.
       | 
       | So I'm really looking forward to ARM competition on the desktop.
        
       | callesgg wrote:
       | Super parallell arm chips could that not be a future thing for
       | nvidia or another chip manufacturer. A normal CPU die with
       | thousands of independent Cores.
        
       | modeless wrote:
       | I hope they make workstations. I want to see some competition for
       | the eventual Apple Silicon Mac Pro.
        
         | macksd wrote:
         | You probably mean less powerful than this, but they do:
         | https://www.nvidia.com/en-us/deep-learning-
         | ai/solutions/work....
        
           | modeless wrote:
           | Yes they make workstations, but they don't make ARM
           | workstations. Yet. They already have ARM chips they could use
           | for it, but they went with x86 instead despite the fact that
           | they have to purchase the x86 chips from their direct
           | competitor. Also, yes, less than $100k starting price would
           | be nice.
        
         | dhruvdh wrote:
         | They are licensing ARM cores; which as of now cannot compete
         | with Apple silicon.
         | 
         | While there are using some future ARM core, and I've read
         | rumors that future designs might try to emulate what has made
         | Apple cores successful; we cannot say whether Apple designs
         | will stagnate or continue to improve at current rate.
         | 
         | There is potential for competition from Qualcomm after their
         | Nuvia acquisition though.
        
           | adgjlsfhk1 wrote:
           | It seems weird to me to say that arm cores can't compete with
           | apple silicon given that apple doesn't own fabs. They are
           | using arm cores on TSMC silicon (exactly the same as this).
        
             | seabrookmx wrote:
             | > They are using arm cores on TSMC silicon (exactly the
             | same as this)
             | 
             | No the Apple Silicon chips use the arm _instruction set_
             | but they do not use their core design. Apple designs their
             | core in house, much like Qualcomm does with snapdragon.
             | Both of these companies have an architectural license which
             | allows them to do this.
        
               | tibbydudeza wrote:
               | Qualcomm no longer makes their own cores - they just use
               | ARM reference IP designs since the Kryo.
               | 
               | That will probably change with their Nuvia acquisition.
        
           | ac29 wrote:
           | Maybe not in single threaded performance, but Apple has no
           | server grade parts. Ampere, for example, is shipping an 80
           | core ARM N1 processor that puts out some truely impressive
           | multithreaded performance. An M1 Mac is an entirely different
           | market - making a fast 4+4 core laptop processor doesn't
           | neccesarily translate into making a fast 64+ core server
           | processor.
        
           | devmor wrote:
           | What do you mean ARM cores can't compete with Apple silicon?
           | "Apple silicon" are ARM cores.
        
             | dharmab wrote:
             | Apple Silicon is compatible with the ARM instruction set
             | but they are not "just ARM cores" in their internal design.
        
             | mlyle wrote:
             | He means cores made by ARM, not cores implementing the ARM
             | ISA. Currently, the cores designed by ARM cannot touch the
             | Apple M1.
        
         | [deleted]
        
         | titzer wrote:
         | I think Apple did Arm an unbelievable favor by absolutely
         | trouncing all CPU competitors with the M1. By being so fast,
         | Apple's chip attracts many new languages and compiler backends
         | to Arm that want a piece of that sweet performance pie. Which
         | means that other vendors will want to have arm offerings, and
         | not, e.g. RISCv5.
         | 
         | I have no idea what Apple's plans for the M1 chip are, but if
         | they had manufacturing capacity, they could put oodles of these
         | chips into datacenters and workstations the world over and
         | basically eat the x86 high-performance market. The fact that
         | the chip uses so little power (15W) means they can absolutely
         | cram them into servers where CPUs can easily consume 180W. That
         | means 10x the number of chips for the same power, and not all
         | concentrated in one spot. A lot of very interesting server
         | designs are now possible.
        
           | klelatti wrote:
           | It's hard to imagine that until a few months ago it was very
           | difficult to get a decent Arm desktop / laptop. I imagine
           | lots of developers working now to fix outstanding Arm bugs /
           | issues.
        
             | giantrobot wrote:
             | While I'm sure lots of projects have actual ARM-related
             | bugs, there was a whole class of "we didn't expect this
             | platform/arch combination" compilation bugs that have seen
             | fixes lately. It's not that the code has bugs on ARM, a lot
             | of OSS has been compiling on ARM for a decade (or more)
             | thanks to Raspberry Pis, Chromebooks, and Android but built
             | scripts didn't understand "darwin/arm64". Back in December
             | installing stuff on an M1 Mac via Homebrew was a pain but
             | it's gotten significantly easier over the past few months.
             | 
             | But a million (est) new general purpose ARM computers
             | hitting the population certainly affects the prioritizing
             | of ARM issues in a bug tracker.
        
           | mhh__ wrote:
           | > compiler backends to Arm that want a piece of that sweet
           | performance pie
           | 
           | How many compilers didn't support ARM?
        
       | GrumpyNl wrote:
       | I need a new video card and there are no Nvidia to buy, all is
       | bought by miners. Will it go the same with this card?
        
         | redtriumph wrote:
         | Currently, there are no plans for consumer-grade CPUs. Even
         | this new CPU class is shipping in 2023.
        
       | remexre wrote:
       | > Today at GTC 2021 NVIDIA announces its first CPU
       | 
       | Wait, Nvidia's been making ARM CPUs for years now; most memorably
       | Project Denver.
        
         | 015a wrote:
         | Arguably, most memorably, Tegra; the CPU/GPU which powers the
         | Nintendo Switch.
        
           | Jasper_ wrote:
           | That uses a licensed ARM Cortex design under the hood.
        
         | jdsully wrote:
         | NVIDIA called it their first "data center CPU". Our helpful
         | reporter simplified it to the point of being flat out wrong.
         | Not uncommon.
        
           | justin66 wrote:
           | I expected more from a site called VideoCardz.
        
       | titzer wrote:
       | Given that there are essentially no architectural details here
       | other than bandwidth estimates, and the release timeline is in
       | 2023, how exactly does this count as "unveiling"? Headline should
       | read: "NVidia working on new arm chip due in two years", or
       | something else much more bland.
        
         | mrlento234 wrote:
         | Not quite. CSCS supercomputing center in Switzerland have
         | already started receiving the hardware
         | (https://www.cscs.ch/science/computer-science-
         | hpc/2021/cscs-d...). Perhaps, we may see some benchmarks. To
         | wider HPC users, it will be only available in 2023 as the
         | article mentioned.
        
           | IanCutress wrote:
           | I suspect that's more racks of storage, not racks of compute.
           | Nothing to suggest it's compute.
        
             | seniorivn wrote:
             | as i understand it's compute, just not cpu compute, those
             | cpu are designed to be good enough for cuda servers
        
             | DetroitThrow wrote:
             | Hey Ian, I love reading your posts on Anandtech, you're a
             | fantastic technical communicator.
        
           | titzer wrote:
           | Hopefully some architectural details are forthcoming then!
           | But that is not what is in this article.
        
         | allie1 wrote:
         | As AMD proved us, a lot can happen in 3 years
        
       | valine wrote:
       | I like the sound of a non-Apple arm chip for workstations. Given
       | my positive experience with the M1 I'd be perfectly happy never
       | using x86 again after this market niche is filled.
        
         | webaholic wrote:
         | I don't think this will be anywhere near as good as the M1,
         | since they are using the ARM Neoverse cores.
        
           | ac29 wrote:
           | Apple throws a lot of transistors at their 4 performance
           | cores in the M1 to get the performance they do - its not
           | clear that approach would realistically scale to a
           | workstation CPU with 16, 32, or more cores (at least not with
           | current fab capabilities).
        
         | awill wrote:
         | Me too. But my decades old steam collection isn't looking
         | forward to it. That's one advantage of cloud gaming. It won't
         | matter what your desktop runs on.
        
       | nabla9 wrote:
       | Finally news from Nvidia that really moved markets.
       | Nvidia +4.68%,        Intel  -4.65%        AMD    -4.47%
        
         | 01100011 wrote:
         | I wonder how permanent this is. As a Nvidian who sells his
         | shares as soon as they vest and who owns some Intel for
         | diversification, I wonder if I should load up on Intel? You
         | really can't compete with their fab availability. Having a
         | great design means nothing unless you can get TSMC to grant you
         | production capacity.
        
           | nabla9 wrote:
           | TSMC takes orders years ahead and builds capacity to match
           | working together with big customers. Those who pay more
           | (price per unit and large volume) get first shot. That's why
           | Apple is always first, followed by Nvidia and AMD, then
           | Qualcomm.
           | 
           | There is bottled demand because Intel's failure to deliver
           | was not fully anticipated by anyone.
        
       | gchadwick wrote:
       | It'd be interesting to know if NVidia are going for an ARMv9
       | core, in particular if they'll have a core with an SVE2
       | implementation.
       | 
       | It may be they don't want to detract from focus on the GPUs for
       | vector computation so prefer a CPU without much vector muscle.
       | 
       | Also interesting that they're picking up an arm core rather than
       | continuing with their own design. Something to do with the
       | potential takeover (the merged company would only want to support
       | so many micro-architectural lines)?
        
         | adrian_b wrote:
         | They have said clearly that the core is licensed from ARM and
         | one of the Neoverse future models.
         | 
         | There was no information whether it will have any good SVE2
         | implementation. On the contrary they insisted only on the
         | integer performance and on the high-speed memory interface.
        
           | dragontamer wrote:
           | Neoverse V1 has SVE, Neoverse E or N do not.
           | 
           | "E" is efficiency, N is standard, V is high-speed. IIRC, N is
           | the overall winner in performance/watt. Efficiency cores have
           | the lowest clock speed (overall use the least amount of
           | watts/power). V purposefully goes beyond the performance/watt
           | curve for higher per-core compute capabilities
        
             | Teongot wrote:
             | Neoverse-N2 will have SVE2 (source https://github.com/gcc-
             | mirror/gcc/blob/master/gcc/config/aar... )
        
           | gchadwick wrote:
           | Here's Anandtech's article on the previous Neoverse V1/N2
           | announcement: https://www.anandtech.com/show/16073/arm-
           | announces-neoverse-... arm weren't saying anything official
           | but Anandtech did a little digging and reckons V1 is SVE 1
           | and v8 and N2 could be Armv9 with SVE 2.
           | 
           | I'd suspect NVidia would be using the V1 here as it's the
           | higher performing core, but not way to be certain.
        
         | klelatti wrote:
         | This has got me wondering whether an Nvidia owned Arm could
         | limit SVE2 implementations so as not to compete with Nvidia's
         | GPU. That would certainly be the case for Arm designed cores -
         | not a desirable outcome.
        
           | MikeCapone wrote:
           | I doubt it, it's not like the market for acceleration is
           | stagnant and saturated and they need to steal some
           | marketshare points from one side to help the other.
           | 
           | It's all greenfield and growing so far, they'll win more by
           | having the very best products they can make on both sides.
        
             | mlyle wrote:
             | You'd think. But it wouldn't be the first time a new
             | product is hampered to not slightly theoretically
             | cannibalize an existing product family.
        
         | theonlyklas wrote:
         | I think they will use SVE2 because I assume they'll need to
         | perform vector reads/writes to NVLink connected peripherals to
         | reach that 900GB/s GPU-to-CPU bandwidth metric they described.
        
       | api wrote:
       | Tangent: Apple should bring back the Xserve with their M1 line,
       | or alternately license the M1 core IP to another company to
       | produce a differently-branded server-oriented chip. The
       | performance of that thing is mind blowing and I don't see how
       | this would compete with or harm their desktop and mobile
       | business.
        
         | bombcar wrote:
         | How much of that performance is on-chip memory and how
         | usable/scalable is that? An Xserve that is limited to one CPU
         | and can't have more RAM would pretty mediocre.
        
         | AnthonyMouse wrote:
         | The cheapest available Epyc (7313P) has 16 cores and dual
         | socket systems have up to 128 cores and 256 threads. Server
         | workloads are massively parallel, so a 4+4 core M1 would be
         | embarrassed and Apple wouldn't want to subject themselves to
         | that comparison.
         | 
         | But another reason they won't do it is that TSMC has a finite
         | amount of 5nm fab capacity. They can't make more of the chips
         | than they already do.
        
           | api wrote:
           | I'm thinking of a 64-core M1. It would not be the laptop
           | chip.
        
             | ac29 wrote:
             | A 4+4 core M1 is 16 billion transistors. Some of that is
             | the little cores, GPU, etc, but its not clear to me its
             | practical to get, say 8x larger. That would be 128 billion
             | transistors. As a point of comparison, NVIDIA's RTX 3090 is
             | 28B transistors, and thats a huge, expensive chip.
        
       | [deleted]
        
       | [deleted]
        
       | rektide wrote:
       | There's a lot of interconnects (CCIX, CXL, OpenCAPI, NVLink,
       | GenZ) brewing. Nvidia going big is, hopefully, a move that will
       | prompt some uptake from the other chip makers. 900GBps link, more
       | than main memory: big numbers there. Side note, I miss AMD being
       | actively involved with interconnects. InfinityFabric seems core
       | to everything they are doing, but back in the HyperTransport days
       | it was something known, that folks could build products for,
       | interoperate with. Not many did, but it's still frustrating
       | seeing AMD keeping cards so much closer to the chest.
        
       | filereaper wrote:
       | Looks like NVidia broke up with POWER on IBM and made their own
       | chip.
       | 
       | They have interconnects from Mellanox, GPUs and their own CPUs
       | now.
       | 
       | I suspect the supercomputing lists will be dominated by NVidia
       | now.
        
         | arcanus wrote:
         | That is certainly the trend. AMD is bringing Frontier online
         | later this year, which might be the only counter to this.
        
       | DonHopkins wrote:
       | I love the name "Grace", after Grace Hopper.
        
         | paulmd wrote:
         | There's a tendency to use first names to refer to women in
         | professional settings or political power that is somewhat
         | infantilizing and demeaning.
         | 
         | I doubt anyone really deliberately sets out to be like "haha
         | yessss today I shall elide this woman's credentials", but this
         | is one of those unconscious gender-bias things that is
         | commonplace in our society and is probably best to try and make
         | a point of avoiding.
         | 
         | https://news.cornell.edu/stories/2018/07/when-last-comes-fir...
         | 
         | https://metro.co.uk/2018/03/04/referring-to-women-by-their-f...
         | 
         | (etc etc)
         | 
         | I'd prefer they used "Hopper" instead, in the same way they
         | have chosen to refer to previous architectures by the last
         | names of their namesakes (Maxwell, Pascal, Ampere, Volta,
         | Kepler, Fermi, etc). I'd see that as being more professionally
         | respectful for her contributions.
         | 
         | But yes I very much like the idea of naming it after Hopper.
        
           | bloak wrote:
           | Perhaps you're being downvoted because it's a tangent. It's a
           | real phenomenon, though, and an interesting one. Of course
           | there are many things that influence which parts of someone's
           | full name get used, and if the tendency is a problem it's a
           | trivial one compared to all the other problems that women
           | face, but, yes, in general it would probably be a good idea
           | to be more consistent in this respect.
           | 
           | Vaguely related: J. K. Rowling's "real" full name is Joanne
           | Rowling. The publisher "thought a book by an obviously female
           | author might not appeal to the target audience of young
           | boys".
           | 
           | There's another famous (in the UK at least) computer
           | scientist called Hopper: Andy Hopper. So "G.B.M. Hopper",
           | perhaps? That would have more gravitas than "Andy"!
        
           | hderms wrote:
           | I feel like there's a non-zero chance they named it Grace
           | instead of Hopper so their new architecture doesn't sound
           | like a bug or a frog or something. You could be right, though
        
           | trynumber9 wrote:
           | Hopper was already reserved for an Nvidia GPU:
           | https://en.wikipedia.org/wiki/Hopper_(microarchitecture)
        
             | paulmd wrote:
             | Yeah, I dunno what is going on with that, I assumed that
             | had changed if they were going to use the name "grace" for
             | another product.
             | 
             | I guess I'm not sure if "Hopper" refers to the product as a
             | whole (like Tegra) and early leakers misunderstood that, or
             | whether Hopper is the name of the microarchitecture and
             | "Grace" is the product, or if it's changed from Hopper to
             | Grace because they didn't like the name, or what.
             | 
             | Otherwise it's a little awkward to have products named both
             | "grace" and "hopper"...
        
       | lprd wrote:
       | So is ARM the future at this point? After seeing how well Apple's
       | M1 performed against a traditional AMD/Intel CPU, it has me
       | wondering. I used to think that ARM was really only suited for
       | smaller devices.
        
         | hilios wrote:
         | Depends, performance wise it should be able to compete with or
         | even outperform x86 in many areas. A big problem until now was
         | cross compatibility regarding peripherals, which complicates
         | running a common OS on ARM chips from different vendors. There
         | is currently a standardization effort (Arm SystemReady SR) that
         | might help with that issue though.
        
         | Hamuko wrote:
         | Based on initial testing, AWS EC2 instances with ARM chips
         | performed as well if not better than the Intel instances, but
         | they cost 20% less. The only drawback that I've really
         | encountered thus far was that it complicates the build process.
        
           | moistbar wrote:
           | Does ARM have a uniquely complex build process, or is it the
           | mix of architectures that makes it more difficult?
        
             | sumtechguy wrote:
             | ARM is all over the place with its ISA. x86 has the benefit
             | that most companies made it 'IBM compatible'. There are one
             | off x86 ISAs but they are mostly forgotten at this point.
             | The ARM CPU family itself is fairly consistent (mostly),
             | but included hardware is a very mixed bag. The x86 has on
             | the other hand the history of build it to make it work like
             | IBM. All the way from how things boot up, memory space
             | addresses, must have I/O, etc. ARM on the other hand may or
             | may not have that depending on which ISA you target or are
             | creating. Things like the raspberry PI has changed some of
             | that as many are mimicking the broadcom ISA and
             | specifically that with the raspberry pi one. The x86 arch
             | has also picked up some interesting baggage along the way
             | because of what it is. We can mostly ignore it but it is
             | there. For example you would not build a ARM board these
             | days with an IDE interface but some of those bits still
             | exist in the x86 world.
             | 
             | ARM is more of a tool kit to build different purpose built
             | computers (you even see them show up in usb sticks). While
             | x86 is particular ISA that has a long history behind it. So
             | you may see something like 'Amazon builds its own ARM
             | computers'. That means they spun their own boards, built
             | their own toolchains (more likely recompiled existing
             | ones), and probably have their own OS distro to match. Each
             | one of those is a fairly large endeavor to do. When you see
             | something like 'Amazon builds its own x86 boards', they
             | have shaved out the other two parts of that and are
             | focusing on hardware. That they are building their own
             | means they see the value in owning the whole stack. Also if
             | you have your own distro means you usually have to 'own'
             | building the whole thing. So I can go grab an x86 gcc stack
             | from my repo provider. They will need to act as the repo
             | owner and build it themselves and keep up with the patches.
             | Depending on what has been added that can be quite the task
             | all by itself.
        
             | Hamuko wrote:
             | Mix of architectures and the fact that our normal CI server
             | is still x86-based and really didn't want to do ARM builds.
        
       | ksec wrote:
       | Based on Future ARM Neoverse, so basically nothing much to see
       | here from CPU perspective, What really stands out, are those
       | ridiculous number from its Memory system and Interconnect.
       | 
       | CPU: LPDDR5X with ECC Memory at 500+GB/s Memory Bandwidth. (
       | Something Apple may dip into. R.I.P for Mac with upgradable
       | Memory )
       | 
       | GPU: HBM2e at _2000_ GB /s. Yes, three zeros, this is not a typo.
       | 
       | NVLink: 500GB/s
       | 
       | This will surely further solidify CUDA dominance. Not entirely
       | sure how Intel's XE with OneAPI and AMD's ROCm is going to
       | compete.
        
         | Dylan16807 wrote:
         | > GPU: HBM2e at 2000 GB/s. Yes, three zeros, this is not a
         | typo.
         | 
         | It's a good step forward but your average consumer GPU is
         | already around a quarter to a third of that and a Radeon VII
         | had 1000 GB/s two years ago.
        
           | jabl wrote:
           | The Nvidia A100 80GB already provides 2 TB/s mem BW today.
           | Also using HBM2e.
        
           | m_mueller wrote:
           | I think what you're missing here is the NVLink part. The fact
           | that you can get a small cluster of these linked up like that
           | for 400k, all wrapped in a box, makes HPC quite a bit more
           | accessible. Even 5 years ago, if you wanted to run a regional
           | sized weather model at reasonable resolution, you needed to
           | have some serious funding (say, nation states or oil /
           | insurance companies). Nowadays you could do it with some
           | angel investment and get one of these Nvidia boxes and just
           | program them like they're one GPU.
        
             | kllrnohj wrote:
             | Critically it's CPU to GPU NVLink here, not the "boring"
             | GPU to GPU NVLink that's common on Quadros. 500GB/s
             | bandwidth between CPU & GPU massively changes when & how
             | you can GPU accelerate things, that's a 10X difference over
             | the status quo.
        
               | kimixa wrote:
               | Also "cpu->cpu" NVLink is interesting. Though it was my
               | understanding that NVLink is point-to-point, and would
               | require some massive switching system to be able to
               | access any node in the cluster anywhere near that rate
               | without some locality bias (IE nodes on the "first"
               | downstream switch are faster to access and less
               | contention)
        
       | de6u99er wrote:
       | Don't know if it's just me but this product looks like a beta-
       | product for early adopters.
        
         | rektide wrote:
         | It's initially for two huge HPC systems. It'll be interesting
         | to see what kind of availability it ever has to the rest of the
         | world.
        
       | lprd wrote:
       | So is ARM the future at this point? After seeing how well Apple's
       | M1 performed against a traditional AMD/Intel CPU, it has me
       | wondering. I used to think that ARM was really only suited for
       | smaller devices.
        
         | fulafel wrote:
         | The instruction set doesn't make a significant difference
         | technically, the main things about them are monopolies
         | (patents) tied to ISAs, and sw compatibility.
        
           | rvanlaar wrote:
           | I'm interested in your thoughts on why this doesn't make a
           | significant difference. From what I've read, the M1 has a lot
           | of tricks up its sleeve that are next to impossible on X86.
           | For example ARM instructions can be decoded in parallel.
        
         | kllrnohj wrote:
         | It will come down entirely to who can sustain a good CPU core.
         | 
         | Currently Apple is the only company making performance-
         | competitive ARM cores that can make a reasonable justification
         | for an architecture switch.
         | 
         | Otherwise AMD's CPUs are still ahead of everyone else,
         | including all other ARM CPU cores not made by Apple. And even
         | Intel is still faster in places where performance matters more
         | than power efficiency (eg, desktop & PC gaming)
        
           | aeyes wrote:
           | Amazons ARM chips are performance competitive as well, for
           | many workloads you can expect at least similar performance
           | per core at the same clock speed.
        
           | floatboth wrote:
           | Arm's Neoverse cores are doing pretty well in the datacenter
           | space -- on AWS, the Graviton2 instances are currently the
           | best ones for lots of use cases. It's clear that core designs
           | by Arm are really good. The problem currently is the lag
           | between the design being done and various vendors' chips
           | incorporating it.
           | 
           | upd: oh also in the HPC world, Fujitsu with the A64FX seems
           | to be like the best thing ever now
        
             | rubatuga wrote:
             | Fujitsu flying under the radar while having the fastest cpu
             | ever made haha
        
             | kllrnohj wrote:
             | Graviton2 is competitive sometimes with Epyc, but also
             | falls far behind in some tests (eg, Java performance is a
             | bloodbath). Overall across majority tests, Neoverse
             | consistently comes up short of Milan even when Neoverse is
             | given a core-count advantage. And critically the per-core
             | performance of Graviton2 / Neoverse is worse, and per-core
             | performance is what matters to consumer space.
             | 
             | But it can't just be competitive it needs to be
             | significantly better in order for the consumer space to
             | care. Nobody is going to run Windows on ARM just to get
             | equivalent performance to Windows on X86, especially not
             | when that means most apps will be worse. That's what's
             | really impressive about the M1, and so far is very unique
             | to Apple's ARM cpus.
             | 
             | > oh also in the HPC world, Fujitsu with the A64FX seems to
             | be like the best thing ever now
             | 
             | A64FX doesn't appear to be a particularly good CPU core,
             | rather it's a SIMD powerhouse. It's the AVX-512 problem -
             | when you can use it, it can be great. But you mostly can't,
             | so it's mostly dead weight. Obviously in HPC space this is
             | different scenario entirely, but that's not going to
             | translate to consumer space at all (and it's not an ARM
             | advantage, either - 512bit SIMD hit consumer space via x86
             | first with Intel's Rocket Lake).
        
               | klelatti wrote:
               | Not sure why you're placing so much weight on Epyc
               | outperforming Graviton but discounting designs / use
               | cases where Arm is clearly now better. Plus it's clear
               | that we are just at the beginning of a period where some
               | firms with very deep pockets are starting to invest
               | seriously in Arm on the server and the desktop.
               | 
               | If x64 ISA had major advantages over Arm then that would
               | be significant, but I've not heard anyone make that case:
               | instead it's a debate about how big the Arm advantage is.
               | 
               | Can x64 remain competitive in some segments: probably and
               | inertia will work in its favour. I do think it's
               | inevitable that we will see a major shift to Arm though.
        
           | huac wrote:
           | so then we think about what makes Apple's M1 so good. one
           | hard-to-replicate factor is that they designed their hardware
           | and software together, the ops which MacOS uses often are
           | heavily optimized on chip.
           | 
           | but one factor that you can replicate is colocating memory,
           | CPU, and GPU, the system-on-chip architecture. that's what
           | Nvidia looks to be going after with Grace, and I'm sure
           | they've learned lessons from their integrated designs e.g.
           | Jetson. very excited to see how this plays out!
        
             | kllrnohj wrote:
             | > one hard-to-replicate factor is that they designed their
             | hardware and software together, the ops which MacOS uses
             | often are heavily optimized on chip.
             | 
             | Not really, they are still just using the same ARM ISA as
             | everyone else. The only hardware/software integration magic
             | of the M1 so far seems to be the x86 memory model emulation
             | mode, which others could definitely replicate.
             | 
             | > but one factor that you can replicate is colocating
             | memory, CPU, and GPU, the system-on-chip architecture.
             | 
             | AMD introduced that in the x86 world back in 2013 with
             | their Kavari APU ( https://www.zdnet.com/article/a-closer-
             | look-at-amds-heteroge... ), and it's been fairly typical
             | since then for on-die integrated GPUs on all ISAs.
        
         | dkjaudyeqooe wrote:
         | ARM is the present, RISC-V is the future and Intel is the past.
         | 
         | The magic of Apple's M1 comes from the engineers who worked on
         | the CPU implementation and the TSMC process.
         | 
         | The architecture has some impact on performance but I think it
         | is simplicity and and ease of implementation that factors most
         | into how well it can perform (as per the RISC idea). In that
         | sense Intel lags for small, fast and efficient processors
         | because their legacy architecture pays a penalty for decoding
         | and translation (into simpler ops) overhead. Eventually designs
         | will abandon ARM for RISC-V for similar reasons as well as
         | financial ones.
         | 
         | Really, today it's a question of who has the best
         | implementation of any given architecture.
        
         | mhh__ wrote:
         | The next decade is ARM's for the taking, _but_ if Intel and AMD
         | can make good cores then it 's not anywhere close to slam dunk.
         | 
         | One of the reasons why M1 is good is pure and simple that it
         | has a pretty enormous transistor budget, not solely because
         | it's ARM.
        
           | api wrote:
           | Being ARM has something to do with it. The x86 instruction
           | decoder may be only about ~5% of the die, but it's 5% of the
           | die that has to run _all the time_. Think about how warm your
           | CPU gets when you run e.g. heavy FPU loads and then imagine
           | that 's happening all the time. You can see the power
           | difference right there.
           | 
           | It's also very hard to achieve more than 4X parallelism
           | (though I think Ice Lake got 6X at some additional cost) in
           | decode, making instruction level parallelism harder. X86's
           | hack to get around this is SMT/hyperthreading to keep the
           | core fed with 2X instruction streams, but that adds a lot
           | more complexity and is a security minefield.
           | 
           | Last but not least: ARM's looser default memory model allows
           | for more read/write reordering and a simpler cache.
           | 
           | ARM has a distinct simplicity and low-overhead advantage over
           | X86/X64.
        
             | NortySpock wrote:
             | > x86 instruction decoder may be only about ~5% of the die
             | 
             | What percent of the die is an ARM instruction decoder?
        
               | duskwuff wrote:
               | Much less. x86 instruction decoding is complicated by the
               | fact that instructions are variable-width and are byte-
               | aligned (i.e. any instruction can begin at any address).
               | This makes decoding more than one instruction per clock
               | cycle complicated -- I believe the silicon has to try
               | decoding instructions at every possible offset within the
               | decode buffer, then mask out the instructions which are
               | actually inside another instruction.
               | 
               | ARM A32/A64 instruction decoding is dramatically simpler
               | -- all instructions are 32 bits wide and word-aligned, so
               | decoding them in parallel is trivial. T32 ("Thumb") is a
               | bit more complex, but still easier than x86.
        
               | monocasa wrote:
               | I totally agree with the core of your argument (aarch64
               | decoding is inherently simpler and more power efficient
               | than x86), but I'll throw out there that it's not quite
               | as bad as you say on x86 as there's some nonobvious
               | efficiencies (I've been writing a parallel x86 decoder).
               | 
               | What nearly everyone uses is a 16 byte buffer aligned to
               | the program counter being fed into the first stage
               | decode. This first stage, yes has to look at each byte
               | offset as if it could be a new instruction, but doesn't
               | have to do full decode. It only finds instruction length
               | information. From there you feed this length information
               | in and do full decode on the byte offsets that represent
               | actual instruction boundaries. That's how you end up with
               | x86 cores with '4 wide decode' despite needing to
               | initially look at each byte.
               | 
               | Now for the efficiencies. Each length decoder for each
               | byte offset isn't symmetric. Only the length decoder at
               | offset 0 in the buffer has to handle everything, and the
               | other length decoders can simply flag "I can't handle
               | this", and the buffer won't be shifted down past where
               | they were on the next cycle and the byte 0 decoder can
               | fix up any goofiness. Because of this, they can
               | 
               | * be stripped out of instructions that aren't really used
               | much anymore if that helps them
               | 
               | * can be stripped of weird cases like handling crazy
               | usages of prefix bytes
               | 
               | * don't have to handle instructions bigger than their
               | portion of the decode buffer. For instance a length
               | decoder starting at byte 12 can't handle more than a 4
               | byte instruction anyway, so that can simplify it's logic
               | considerably. That means that the simpler length decoders
               | end up feeding into the higher stack up full decoder
               | selection, so some of the overhead cancels out in a nice
               | way.
               | 
               | On top of that, I think that 5% includes pieces like the
               | microcode ROMs. Modern ARM cores almost certainly have
               | (albeit much smaller) microcode ROMs as well to handle
               | the more complex state transitions.
               | 
               | Once again, totally agreed with your main point, but it's
               | closer than what the general public consensus says.
        
               | ant6n wrote:
               | I wonder whether a modern byte-sized instruction encoding
               | would sort of look like Unicode, where every byte is self
               | synchronizing... I guess it can be even weaker than that,
               | probably only every second or fourth byte needs to
               | synchronize.
        
             | pbsd wrote:
             | The x86 decoder is not running all the time; the uops cache
             | and the LSD exist precisely to avoid this. With
             | instructions fed from the decoders you can only sustain 4
             | instructions per cycle, while to get to 5 or 6 your
             | instructions need to be coming from either the uops cache
             | or the LSD. In the case of the Zen 3, the cache can deliver
             | 8 uops per cycle to the pipeline (but the overall thoughput
             | is limited elsewhere at 6)!
             | 
             | Furthermore, the high-performance ARM designs, starting
             | with the Cortex-A77, started using the same trick---the
             | 6-wide execution happens only when instructions are being
             | fed from the decoded macro-op cache.
        
               | ant6n wrote:
               | How can you run 8 instructions at the same time if you
               | only have 16 general purpose registers? You'd have to
               | either be doing float ops or constantly spilling. So I'm
               | integer code, how many of those instructions are just
               | moving data between memory and registers (push/pop?).
               | 
               | I'd say ARM has a big advantage for instruction level
               | parallelism with 32 registers.
        
               | mhh__ wrote:
               | Register renaming for a start, and this is about decoding
               | not execution
        
               | ant6n wrote:
               | Okay fair. But the bigger subject is inherent performance
               | advantage of the architecture. You don't just want to
               | decode many instructions per cycle, you also want to
               | issue them. So decoding width and issuing width are
               | related.
               | 
               | And it seems to me that ARM has an advantage here. If you
               | want execute 8 instructions in parallel, you gotta
               | actually have 8 independent things that need to get
               | executed. I guess you could have a giant out of order
               | buffer, and include stack locations in your register
               | renaming scheme, but it seems much easier to find
               | parallelism if a bunch of adjacent instructions are
               | explicitly independent. Which is much easier if you have
               | more registers - the compiler can then help the cpu
               | keeping all those instruction units fed.
        
               | mhh__ wrote:
               | The decoder might not be running strictly all the time,
               | but I would wager that for some applications at least it
               | doesn't make much of a difference. For HPC or DSP or
               | whatever where you spend a lot of time in relatively
               | dense loops the uop cache is probably big enough to ease
               | the strain on the decoder, but for sparser code
               | (Compilers come to mind, lots of function calls and
               | memory bound work) I wouldn't be surprised if it didn't
               | make as much difference.
               | 
               | I have vTune installed so I guess I could investigate
               | this if I dig out the right PMCs
        
               | pbsd wrote:
               | I agree; compiler-type code will miss the cache most of
               | the time. A simple test with clang++ compiling some
               | nontrivial piece of C++:                                0
               | lsd_uops
               | 1,092,318,746      idq_dsb_uops
               | ( +-  0.49% )          4,045,959,682      idq_mite_uops
               | ( +-  0.06% )
               | 
               | The LSD is disabled in this chip (Skylake) due to errata,
               | but we can see only 1/5th of the uops come from the uops
               | cache. However, the more relevant experiment in terms of
               | power is how many cycles is the cache active instead of
               | the decoders:                                0
               | lsd_cycles_active
               | 378,993,057      idq_dsb_cycles
               | ( +-  0.18% )          1,616,999,501      idq_mite_cycles
               | ( +-  0.07% )
               | 
               | The ratio is similar: the regular decoders are not active
               | only around 1/5th of the time.
               | 
               | In comparison, gzipping a 20M file looks a lot better:
               | 0      lsd_cycles_active
               | 2,900,847,992      idq_dsb_cycles
               | ( +-  0.07% )            407,705,985      idq_mite_cycles
               | ( +-  0.33% )
        
             | mhh__ wrote:
             | This is why I said it's ARM's for the taking.
             | 
             | I'm not familiar with how ARM's memory model effects the
             | cache design - Source?
        
           | jayd16 wrote:
           | Another reason is the something like 150% memory bandwidth
           | and I'm sure there are other simple wins along those lines.
           | 
           | The M1 isn't necessarily a win for Arm in general. Other
           | manufacturers weren't competing before and its yet to be seen
           | if they will.
        
             | mhh__ wrote:
             | It's the memory stupid!
        
               | to11mtm wrote:
               | Specifically, the memory -latency-.
               | 
               | By going on-package there's almost certainly latency
               | advantages in addition to the much-vaunted bandwidth
               | gains.
               | 
               | That's going to pan out to better perf, and likely better
               | power usage as well.
        
             | NathanielK wrote:
             | 150% compared to what?
        
               | jayd16 wrote:
               | The latest i9 and the latest Ryzen 9, ie the competition.
        
               | NathanielK wrote:
               | Intel Tigerlake and Amd Renoir both support 128bit
               | LPDDR4x at 4266MHz. Maybe you're confusing the desktop
               | chips that use conventional DDR4? The M1 isn't
               | competitive with them.
        
               | jayd16 wrote:
               | Oh those are pretty new and I haven't seen any benchmarks
               | with LPDDR in an equivalent laptop chip. Do you have a
               | link to any?
        
           | ravi-delia wrote:
           | I've seen things like this a lot, and it's a bit confusing.
           | If parts of the M1's performance are due to throwing compute
           | at the problem, why hasn't Intel been doing that for years?
           | What about ARM, or the M1, allowed this to happen?
        
             | NathanielK wrote:
             | Intel has. Many M1 design choices are fairly typical for
             | desktop x86 chips, but unheard of with ARM.
             | 
             | For example, the M1 has 128 bit wide memory. This has been
             | standard for decades on the desktop(dual channel), but
             | unheard of in cellphones. The M1 also has similar amounts
             | of cache to the new AMD and Intel chips, but thats several
             | times more than the latest snapdragon. Qualcomm also
             | doesn't just design for the latest node. Most of their
             | volume is on cheaper, less dense nodes.
        
             | dpatterbee wrote:
             | Buying the majority of TSMC's 5nm process output helped.
             | It's a combination of good engineering, the most advanced
             | process, and intel shitting themselves I would say.
        
           | tambourine_man wrote:
           | >...is pure and simple that it has a pretty enormous
           | transistor budget
           | 
           | There's a lot of brute force, yes, but it's not the only
           | reason. There are lots of smart design decisions as well.
        
             | amelius wrote:
             | Yes, but those decisions optimize for the single user
             | laptop case, not for e.g. servers.
        
             | mhh__ wrote:
             | "One of the reasons" I did say.
        
               | tambourine_man wrote:
               | True, I misread it.
        
           | phendrenad2 wrote:
           | It really comes down to how well they can emulate X86. People
           | aren't going to give up access to 3 decades of Windows
           | software.
        
             | pjerem wrote:
             | I'm sure ARM already took over x86 if you have a wider
             | definition of personal computers. And a lot of people
             | already gave up access to 3 decades of Windows software by
             | using their phone or tablet as their main device.
             | 
             | Plus, most of the last decade software is software that
             | runs on some sort of VM or another (be it JVM, CLR, a
             | Javascript engine or even LLVM).
             | 
             | Soon (in years), x86 will only be needed by professionals
             | that are tied to really old software. And those particular
             | needs will probably be satisfied by decent emulation.
        
               | kllrnohj wrote:
               | > Soon (in years), x86 will only be needed by
               | professionals that are tied to really old software.
               | 
               | There's also that PC & console gaming markets, which are
               | not small and have not made any movements of any kind
               | towards ARM so far.
        
         | bitwize wrote:
         | > So is ARM the future at this point?
         | 
         | The near future. A few years out, RISC-V is gonna change
         | everything.
        
         | CalChris wrote:
         | Apple isn't entering the cloud market. Moreover the M1 isn't a
         | cloud cpu. The M1 SOC emphasizes low latency and performance
         | per watt over throughput.
        
       | 1MachineElf wrote:
       | I wonder what percentage of it's supported toolchain components
       | will be proprietary.
        
       | CalChris wrote:
       | _Grace, in contrast, is a much safer project for NVIDIA; they're
       | merely licensing Arm cores rather than building their own ..._
       | 
       | NVIDIA is buying ARM.
        
         | klelatti wrote:
         | Trying to buy Arm.
         | 
         | Multiple competition investigations permitting.
        
       ___________________________________________________________________
       (page generated 2021-04-12 23:00 UTC)