[HN Gopher] Graviton 3: First Impressions ___________________________________________________________________ Graviton 3: First Impressions Author : ingve Score : 136 points Date : 2022-05-29 13:30 UTC (9 hours ago) (HTM) web link (chipsandcheese.com) (TXT) w3m dump (chipsandcheese.com) | jadbox wrote: | I'd love to see benchmarks for webservers like Node or Py. | wmf wrote: | Related Graviton 3 benchmarks: | https://www.phoronix.com/scan.php?page=article&item=graviton... | bullen wrote: | What about memory contention when many cores try to write/read | the same memory? | | There is no point to add more cores if they can't cooperate. | | How come I'm the only one pointning this out? | | I think 4 cores will max out the memory contention, so keep on | piling these 128 core heaters on. But they will not outlive a | simple Raspberry 4!? | electricshampo1 wrote: | The whole chip in general will be used in aggregate by | independent vms/containers etc that do NOT read and write to | the same memory. Some kernel datastructures within a given vm | are still shared, ditto for within a single process, but good | design minimizes that (per cpu/thread data structures, sharded | locks, etc etc). | MaxBarraclough wrote: | I don't think they were referring to contention across VM | boundaries. | gpderetta wrote: | Reading the same memory usually ok. | | Writing us not, but respecting the single writer principle is | usually rule zero of parallel programming optimisation. | | If you mean reading/writing to the same memory bus in general, | then yes, the bus need to be sized according to the need of the | expected loads (i.e. the machine need to be balanced). | Sirened wrote: | It's likely that it's going to need a post on its own since | it's an extremely complicated topic. Someone else wrote an | awesome post about this for the Neoverse 2 chips [1] and they | found that with LSE atomics, the N2 performs as well or better | than Icelake. Given gravitron3 has a much wider fabric, I would | assume this lead only improves. | | [1] https://travisdowns.github.io/blog/2020/07/06/concurrency- | co... | bullen wrote: | Ah, yes I remember this post, but it reads pretty cryptic to | me. I would like to know what the slowdowns actually become | in practice, does it add latency to the execution of other | threads and how will the machine as a whole behave? | | I know M4 had much better multicore shared memory perf. than | M3, but now both of those are old and I don't have users to | test anything now. | WhitneyLand wrote: | How much can SVE instructions help with machine learning? | | I've wondered why Apple Silicon made the trade off decision to | not include SVE support yet, given that support for lower | precision FP vectorization seems like it could have made their | NVidia perf gap smaller. | tomrod wrote: | Very interesting! I'm not terribly well versed in ARM vs x86 so | its helpful to see these kinds of benchmarks and reports. | | One bit of feedback for the author: the sliding scale is helpful, | but the y axes are different between the visualizations so you | cannot see the apples to apples comparison needed. Suggest | regenerating those. | rwmj wrote: | _> GCC will flat out refuse to emit SVE instructions (at least in | our limited experience), even if you use assembly,_ | | This seems ... wrong? I haven't tried it but according to the | link below SVE2 intrinsics are supported in GCC 10 (and Clang 9): | | https://community.arm.com/arm-community-blogs/b/tools-softwa... | adrian_b wrote: | Yes, gcc 10.1 has introduced support for the SVE2 intrinsics | (ACLE). | | Moreover, starting with the 8.1 version, gcc began to use SVE | in certain cases when it succeeded to auto-vectorize loops (if | the correct -march option had been used). | | Nevertheless, many Linux distributions are still shipped with | older gcc versions, so SVE/SVE2 does not work with the | available compiler or cross-compiler. | | You must upgrade gcc to 10.1 or a newer version. | ykevinator2 wrote: | No burstable graviton3's yet :-( | DeathArrow wrote: | Such a shame we can't play with a socketed CPU like this and a | motherboard with EFI support as a workstation. | jazzythom wrote: | I hate reading about all the new chips I can't afford. If only | there was a standardized univeral open source motherboard and | some type of subscription model where I would always have the | best chip at the latest fab mailed straight to me on release. I | mean I only just got my hands on 32 core Epyc. Linus Torvalds has | had a Threadripper 3970x for years and I still can't afford it | and I'm still jealous, although to be fair my C skills hit their | limit when I tried to write pong. I don't like the idea of | building a new computer around a chip. It's messy and stupid. | These systems can be made modular if the motherboards packed | unnecessary bandwidth into the interconnect/planar. | Erlangen wrote: | I don't understand these graphs titled "Branch Predictor Pattern | Recognition". What do they mean? Could someone here explain it a | bit in detail? Thanks ahead. | Hizonner wrote: | It feels like we've gone badly wrong somewhere when processors | have to spend so many of their resources guessing about the | program. I am not saying I have a solution, just that feeling. | staticassertion wrote: | IDK, that seems like how brains work, and brains are pretty | cool. They guess all the time in order to save time. | Cthulhu_ wrote: | It always did feel like a weird hack to me to avoid parts of | the CPU to be idle. I mean the performance benefits are there, | but it's at the cost of power usage in the end. | | Can branch prediction be turned off on a compiler or | application level? If you're optimizing for energy use that is. | Disclaimer: I don't actually know if disabling branch | prediction is more energy efficient. | imtringued wrote: | Turning off branch prediction sounds like a weird hack that | serves no purpose, just underclock and undervolt your CPU if | you care about power consumption that much. | Veedrac wrote: | Disabling branch prediction would have such a catastrophic | effect on performance, there is no way it would pay for | itself. Actually this is true for most parts of a CPU; | Apple's little cores are extremely power efficient and yet | they are fairly impressive out-of-order designs. It would | take a very meaningful redesign of how a CPU works to beat a | processor like that, at least at useful speeds. | [deleted] | tyingq wrote: | There's the Mill CPU, which sounds terrific on paper. Hard to | gauge when it might turn into something commercially usable | though. | 0xCMP wrote: | Mill would definitely make things more interesting. They were | supposed to have their simulator online a while ago, but | sounds like they needed to redo the work on the compiler | (from what I understood). Once that comes out it sounds like | the next step is getting the simulator online for people to | play with. | cesaref wrote: | I thought this was the reasoning behind Itanium, the idea that | scheduling could be worked out in advance by the compiler | (probably profile guided from tests or something like that) | which would reduce the latency and silicon cost of | implementations. | | However, it wasn't exactly a raging success, with I think the | predicted amazing compiler tech not materialising, but maybe it | is the right answer, but the implementation was wrong? I'm no | CPU expert... | Hizonner wrote: | I'm not sure what happened with Itanium. | | I do think a big part of the problem is that people want to | distribute binaries that will run on a lot of CPUs that are | physically really different inside. But nowadays there's JIT | compilation even for JavaScript, so you could distribute | something like LLVM, or even (ecch) JavaScript itself, and | have the "compiler scheduling" happen at installation time or | even at program start. | imtringued wrote: | You can't distribute LLVM for that purpose without defining | a stable format like WebAssembly or SPIR-V. | Veedrac wrote: | Itanium was a really badly designed architecture, which a lot | of people skip over when they try to draw analogies to it. It | was a worst of three worlds, in that it was big and hot like | an out-of-order, it had the serial dependency issues of an | in-order, and it had all the complexity of fancy static | scheduling without that fancy scheduling actually working. | | There have been a small number of attempts since Itanium, | like NVIDIA's Denver, which make for much better baselines. I | don't think those are anywhere close to optimal designs, or | really that they tried hard enough to solve in-order issues | at all, but they at least seem sane. | speed_spread wrote: | Would Itanium have been better served with bytecode and a | modern JIT? Also, doesn't RISC-V kinda get back on that | VLIW track with macro-ops fusion, using a very basic | instruction set and letting the compiler figure out the | best way to order stuff to help target CPU make sense of | it? | nine_k wrote: | I heard that the desire to make x86 emulation performant on | Itanium made things really bad, compared to a "clean" VLIW | architecture. | canarypilot wrote: | Why would you consider prediction based on dynamic conditions | to be the sign of a dystopian optimization cycle? Isn't it | mostly intuitive that interesting program executions are not | going to be things you can determine statically (otherwise your | compiler would have cleaned them up for you with inlining | etc.), or could be determined statically but at too great cost | to meet execution deadlines (JiTs and so on), or resource | constraints (you don't really want N code clones specialising | each branch backtrace to create strictly predictable chains). | | Or is the worry on the other side; that processors have gotten | so out-of-order that only huge dedication to guesswork can keep | the beast sated? I don't see this as a million miles from | software techniques in JiT compilers to optimistically optimize | and later de-deoptimize when an assumption proves wrong. | | I think you might be right to be nervous if you wrote programs | that took fairly regular data and did fairly regular things to | it. But, as Itanium learned the hard way, programs have much | more dynamic, emergent and interesting behaviour than that! | [deleted] | amelius wrote: | I guess the fear is that the CPU might start guessing wrong, | causing your program to miss deadlines. Also, the heuristics | are practically useless for realtime computing, where timings | must be guaranteed. | nine_k wrote: | I suppose that if you assume in-order execution and count | the clock cycles, you should get a guaranteed lower bound | of performance. It may be, say, 30-40% of the performance | you really observe, but having some headroom should feel | good. | rwmj wrote: | Uli Drepper has this tool which you can use to annotate source | code with explanations of which optimisations are applied. In | this case it would rely on GCC recognizing branches which are | hard to predict (eg. a branch in an inner loop which is data- | dependent), and I'm not sure GCC is able to do that. | | https://github.com/drepper/optmark | bastawhiz wrote: | Isn't that the whole promise of general purpose computing? That | you don't need to find specialized hardware for most workloads? | Nobody wants to be out shopping for CPUs that have features | that align particularly well with their use case, then | switching to different CPUs when they need to release an update | or some customer comes along with data that runs less | efficiently with the algorithms as written. | | Since processors are expensive and hard to change, they do | tricks to allow themselves to be used more efficiently in | common cases. That seems like a reasonable behavior to me. | adrian_b wrote: | A majority of the non-deterministic and speculative hardware | mechanisms that exist in a modern CPU are required due to the | consequences of one single hardware design decision: to use a | data cache memory. | | The data cache memory is one of the solutions to avoid the | extremely long latency of loading data from a DRAM memory. | | The alternative to a data cache memory is to have a hierarchy | of memories with different speeds, which are addressed | explicitly. | | The latter variant is sometimes chosen for embedded computers | where determinism is more important than programmer | convenience. However, for general-purpose computers this | variant could be acceptable only if the hierarchy of memories | would be managed automatically by a high-level language | compiler. | | It appears that writing a compiler that could handle the | allocation of data into a heterogeneous set of memories and the | transfers between them is a more difficult task than designing | a CPU that becomes an order of magnitude more complex due to | having a hierarchy a data cache memories and a long list of | other hardware mechanisms that must be added due to the | existence of the data cache memory. | | Once it is decided that the CPU must have a data cache memory, | a lot of other hardware design decisions follow from it. | | Because there is an inverse relationship between the load | latency and the data cache memory size, the cache memory must | be split into a multi-level hierarchy of cache memories. | | To reduce the number of cache misses, data cache prefetchers | must be added, to speculatively fill the cache lines in advance | of load requests. | | Now, when a data cache exists, most loads have a small latency, | but from time to time there still is a cache miss, when the | latency is huge, long enough to execute hundreds of | instructions. | | There are 2 solutions to the problem of finding instructions to | be executed during cache misses, instead of stalling the CPU: | simultaneous multi-threading and out-of-order execution. | | For explicitly addressed heterogeneous memories, neither of | these 2 hardware mechanisms is needed, because independent | instructions can be scheduled statically to overlap the memory | transfers. With a data cache, this is not possible, because it | cannot be predicted statically when cache misses will occur | (mainly due to the activity of other execution threads, but | even an if-then-else can prevent the static prediction of the | cache state, unless additional load instructions are inserted | by the compiler, to ensure that the cache state does not depend | on the selected branch of the conditional statement; this does | not work for external library functions or other execution | threads). | | With a data cache memory, one or both of SMT and OoOE must be | implemented. If out-of-order execution is implemented, then the | number of registers needed to avoid false dependencies between | instructions becomes larger than it is convenient to encode in | the instructions. so register renaming must also be | implemented. | | And so on. | | In conclusion, to avoid the huge amount of resources needed by | a CPU for guessing about the programs, the solution would be a | high-level language compiler able to transparently allocate the | data into a hierarchy of heterogeneous memories and schedule | transfers between them when needed, like the compilers do now | for register allocation, loading and storing. | | Unfortunately nobody has succeeded to demonstrate a good | compiler of this kind. | | Moreover, the existing compilers have frequently difficulties | in discovering the optimal allocation and transfer schedule for | registers, which is a simpler problem. | | Doing efficiently the same for a hierarchy of heterogeneous | memories seems out-of-reach for the current compilers. | solarexplorer wrote: | We do have these architectures already in the embedded space | and as DSPs. I suppose, they would be interesting for | supercomputers as well. But for general purpose CPUs they | would be a difficult sell. Since the memory size and latency | would be part of the ISA, binaries could not run unchanged on | different memory configurations, you would need another | software layer to take care of that. Context switching and | memory mapping would also need some rethinking. Of course, | all of this can be solved, but it would make adoption more | difficult. | | And last not least, unknown memory latency is not the only | source of problems, branch (mis)predictions are another. And | they have the same remedies as cache misses: multithreading | and speculative execution. | | So if you wanted to get rid of branch prediction as well, you | could come up with something like the CRAY-1. | adrian_b wrote: | You are right that a kind of multi-threading can be useful | to mitigate the effects of branch mispredictions. | | However, for this, fine-grained multi-threading is enough. | Simultaneous multi-threading does not bring any advantage, | because the thread with the mispredicted branch cannot | progress. | | Out-of-order execution cannot be used during branch | mispredictions, so like I have said, both SMT and OoOE are | techniques useful only when a data cache memory exists. | | Any CPU with pipelined instruction execution needs a branch | predictor and it needs to execute speculatively the | instructions on the predicted path, in order to avoid the | pipeline stalls caused by control dependencies between | instructions. An instruction cache memory is also always | needed for a CPU with pipelined instruction execution, to | ensure that the instruction fetch rate is high enough. | | Unlike simultaneous multi-threading, fine-grained multi- | threading is useful in a CPU without a data cache memory, | not only because it can hide the latencies of branch | mispredictions, but also because it can hide the latencies | of any long operations, like it is done in all GPUs. | | Fine-grained multi-threading is significantly simpler to | implement than simultaneous multi-threading. | mhh__ wrote: | People have tried over and over again to "fix" this and it | hasn't worked. | | The interesting probabilities are all decided at runtime. | | Now we have AI workloads there is a place for a big lump of | dumb compute again, but not in general purpose code. | Dunedan wrote: | > The final result is a chip that lets AWS sell each Graviton 3 | core at a lower price, while still delivering a significant | performance boost over their previous Graviton 2 chip. | | That's not correct. AWS sells Graviton 3 based EC2 instances at a | higher price than Graviton 2 based instances! | | For example a c6g.large instance (powered by Graviton 2) costs | $0.068/hour in us-east-1, while a c7g.large instance (powered by | Graviton 3) costs $0.0725/hour [1]. Both instances have the same | core count and memory, although c7g instances have slightly | better network throughput. | | I believe that is pretty unusual as, if my memory serves me | right, newer instances family generations are usually cheaper | than the previous generation. | | [1]: https://aws.amazon.com/ec2/pricing/on-demand/ | adrian_b wrote: | Based on the first published benchmarks, even the programs | which have not been optimized for Neoverse V1, and which do not | benefit from its much faster floating-point and large-integer | computation abilities, still show a performance increase of at | least 40%, so greater than the price increase. | | So I believe that using Graviton 3 at these prices is still a | much better deal than using Graviton 2. | myroon5 wrote: | Definitely unusual, as the graph here shows: | https://github.com/patmyron/cloud/ | WASDx wrote: | Could it be due to increasing global energy prices? | usefulcat wrote: | I don't follow. You seem to be implying that Amazon would | like to reduce their electricity usage. If so, shouldn't | they be charging _less_ for the more efficient instance | type? | nine_k wrote: | No, they charge for compute, which the new CPU provides | more of, even though it consumes the same amount of | electricity as a unit. | jeffbee wrote: | It would be irrational to expect a durable lower price on | graviton. Amazon will price it lower initially to entice | customers to port their apps, but after they get a critical | mass of demand the price will rise to a steady state where it | costs the same as Intel. The only difference will be which guy | is taking your money. | zrail wrote: | Do you have a cite on Amazon raising prices like that at any | other point in their history? | greatpostman wrote: | I don't think Amazon has ever raised their prices. This | comment is based on nothing | losteric wrote: | Prime has gone up quite a bit | | Nearly every business seeks to maximize profit. Right now | AWS is in growth phase - why wouldn't they raise rates in | the future? | orf wrote: | I mean, they just raised their graviton prices between | generations. | | I don't think the point was that they would increase the | cost of existing instance types, only that over time and | generations the price will trend upwards as more workloads | shift over. | staticassertion wrote: | I wouldn't call that "raising prices"... you can still | use Graviton 2 if it's a better price for you. | jhugo wrote: | I dunno, this take is a bit weird to me. The work we did to | support Graviton wasn't "moving" from Intel to ARM, it was | making our build pipeline arch-agnostic. If Intel one day | works out cheaper again we'll use it again with basically | zero friction. | ykevinator2 wrote: | Same | dilyevsky wrote: | Considering blank stares that I get when mentioning arm as | potential cost saving measure it will take years and maybe | decades before that happens by which point you're def getting | your money's worth as early adopter | spookthesunset wrote: | When is the last time Amazon has raised cloud prices? | jeffbee wrote: | Literally 6 days ago when they introduced this thing. | dragonwriter wrote: | > Literally 6 days ago when they introduced this thing. | | Offering a new option is not a price increase. You can | still do all the same things at the same prices, plus if | the new thing is more efficient for your particular task | you have an additional option. | jeffbee wrote: | When they introduced c6i they did it at the same price as | c5, even though the c6i is a lot more efficient. They're | raising the price on c7g vs. c6g to bring it closer to | the pricing of c6i, which is pretty much exactly what I | suggested? | deanCommie wrote: | You're being highly obtuse. | | Universally everyone understands "raising prices" to be - | "raising prices without any customer action". | | As in you consider your options, take into consideration | pricing, design your architecture, you deploy it, and you | get a bill. Then suddenly, later, without any action of | your own, your bill goes up. | | THAT is raising prices, and it is something AWS has | essentially never done. | | What you're describing is a situation where a customer | CHOOSES to upgrade to a new generation of instances, and | in doing so gets a larger bill. That is nowhere near the | same thing. | arpinum wrote: | Graviton 2 (c6g) also cost more than the Graviton 1 (a1) | instances | mastax wrote: | Given the surrounding context I read that sentence to mean that | focusing on compute density allowed them to sell each core at a | lower price vs focusing on performance, not that Graviton 3 is | cheaper than Graviton 2. | invalidname wrote: | While the article is interesting I would be more interested in | details about carbon footprint and cost reduction. Also how would | this impact more typical node, Java loads? | Hizonner wrote: | You know, if you wanted to improve carbon footprint, a better | place to look might be at software bloat. The sheer number of | times things get encoded and decoded to text is mind boggling. | Especially in "typical node, Java loads". | tyingq wrote: | Logging and Cybersecurity are bloaty areas as well. I've seen | plenty of AWS cost breakdowns where the cybersec functions | were the highest percentage of spend. Or desktops where | carbon black, or windows defender were using most of the CPU | or IO cycles. And networks where syslog traffic was the | biggest percentage of traffic. | Dunedan wrote: | As AWS doesn't price services based on carbon footprint, | you can't infer the carbon footprint from the cost. | | I agree however that certain AWS services are | disproportional expensive. | maxerickson wrote: | Presumably the price provides some sort of bounds. | | (Unless they are doing something like putting profits | towards some sort of carbon maximization scheme) | tyingq wrote: | Well and a fair amount of cybersec oriented services are | a pattern of _" sniff and copy every bit of data and do | something with it"_ or _" trawl all state"_. Which is | inherently heavy. | orangepurple wrote: | Norton, Symantec, and McAfee contribute greatly to global | warming in the financial services sector. At least half of | CPU cycles on employee laptops are devoted to them. | Cthulhu_ wrote: | But do they actually work? For years I've been of the | opinion that most anti-virus solutions don't actually | stop virusses, instead they give you a false sense of | security and their messaging is intentionally alarmist to | make individuals and organizations pay their subscription | fees. | | In my limited and sheltered experience, the only viruses | I've gotten in the past decade or so was from dodgy | pirated stuff or big "download" button ads on download | sites. | MrBuddyCasino wrote: | At best they don't work, in reality they are an attack | vector themselves and a performance nightmare. They | should (mostly) not exist. | MaxBarraclough wrote: | Presumably then they're knocking hours off the laptops' | battery lives? | jeffbee wrote: | Virtually 100% of cloud operating expenses are electricity, so | you can pretty much assume that if it costs less it has a lower | carbon footprint. | _joel wrote: | + Rent, support staff, development costs, regulation and | compliance, network, maintenance (cooling, fire suppression + | lots more), marketing. | | Speaking as someone who did sys admin for a small independent | cloud provider, it definitely isn't virtually 100% of | operating costs | jeffbee wrote: | No offense intended to your personal experience, but I | don't think "small independent cloud" is terribly important | in the global analysis. This paper concludes that TDP and | TCO have become the same thing, i.e. power is heat, power | is opex. | | https://www.gwern.net/docs/ai/scaling/hardware/2021-jouppi. | p... | shepherdjerred wrote: | AWS is pushing to move its internal services (most which are in | Java) to graviton, so I would expect it to be excellent for | "normal" workloads/languages ___________________________________________________________________ (page generated 2022-05-29 23:00 UTC)