[HN Gopher] Esperanto Champions the Efficiency of Its 1,092-Core... ___________________________________________________________________ Esperanto Champions the Efficiency of Its 1,092-Core RISC-V Chip Author : rbanffy Score : 91 points Date : 2021-08-29 18:12 UTC (4 hours ago) (HTM) web link (www.hpcwire.com) (TXT) w3m dump (www.hpcwire.com) | [deleted] | klelatti wrote: | Mentions that each ET-Minion core has a vector / tensor unit. | From [1] | | > The ET-Minion core, based on the open RISC-V ISA, adds | proprietary extensions optimized for machine learning. This | general-purpose 64-bit microprocessor executes instructions in | order, for maximum efficiency, while extensions support vector | and tensor operations on up to 256 bits of floating-point data | (using 16-bit or 32-bit operands) or 512 bits of integer data | (using 8-bit operands) per clock period. | | So sounds like at least 8736 SP FP operations per cycle. | | [1] https://www.esperanto.ai/technology/ | langarto wrote: | More information (HotChips 33 presentation): | https://www.esperanto.ai/wp-content/uploads/2021/08/HC2021.E... | turminal wrote: | I'm not any kind of expert in the field, but trading single chip | speed for more chips surely has it's downsides, which aren't | mentioned in the article at all. | jmercouris wrote: | Of course it does, usually systems have slower and faster | compute units to compensate for performance penalty of non | parallelisable operations | cogman10 wrote: | Biggest that comes up quick is memory bandwidth. | | The more cores you have, the more memory is needed to keep the | chips processing. | touisteur wrote: | Well, it really depends on the computational intensity your | algorithm needs. I've stumbled upon things of beauty porting | things to GPUs, especially if you're going to perform huge | amount of operations based on a very small amount of data. As | long as you don't have too much intermediate data, register | spilling, etc. these GPU things do fly. They're also very | impressive on NN-based workloads... Even something 2 or 3 | gens behind can be game changing, with some optimization | effort. Tensor libraries leave a lot on the floor to pick up, | especially if you're not using the canned 'competition | winning' networks. | Mikeb85 wrote: | Read the article. It's about ML workloads which scale well | across many cores. It's also being compared to GPUs. The whole | point of what they're doing to is to be able to pack more cores | versus a CPU but with a larger instruction set than a GPU core. | zozbot234 wrote: | I like ML but it's not a very good language for this highly | parallel HPC'ish stuff. We'll see how Rust does, it should be | a lot closer to what's actually needed here. | medo-bear wrote: | ML as in machine learning | mirker wrote: | Yes, but it's meant to do ML inference, which can be | parallelized decently. On those workloads, you can use GPUs, | which are also composed of thousands of "wimpy" cores. | R0b0t1 wrote: | A full CPU is useful for decision intensive or time series | intensive data. Normal ML inference is not necessarily either | of those. You could have more complicated neurons (or just | make normal compute tiles which they may be doing). | goldenkey wrote: | I thought the same thing back in 2015 considering the way | GPUs supposedly handle branches with warps. However, my | stock trading simulator ran way better on 3 GTX Titans | rather than the Intel "Knights Many Cores" Phi preview I | had exclusively been able to obtain. I was excited because | it had something like 100 pentium 4 cores on it, and was | supposed to be much faster than a GPU for logical code. | Dissapointment set in when the GPU stomped it performance | wise. I still don't even understand why but I do know now | that the whole "GPUs can't handle branching performantly" | is a bit overstated. Intel discontinued their Phi which I | can only gander was due to its lack of competitiveness. | sdenton4 wrote: | A standard way to handle branching in gpu code is with | masking, like so (where x is a vector, and broadcasting | is implied): M = x > 0 y = M * f(x) + (1-M) * g(x) | | So you end up evaluating both sides of the decision | branch and adding the results. But this is fine if you've | got a dumb number of cores. And often traditional cpus | wind up evaluating both branches anyway. | monocasa wrote: | > And often traditional cpus wind up evaluating both | branches anyway. | | That's actually really overstated. Evaluating both sides | isn't really something CPUs tend to do, but instead | predict one path and roll back on mispredict. This is | because the out of order hardware isn't a tree | fundamentally, but generally better thought of a ring | buffer where uncommitted state is what's between the head | and tail. Storing diverging paths is incredibly expensive | there. I'm not going to say something as strong as "it's | never been done", but but I certainly don't know of an | general purpose CPU arch that'll compute both sides of an | architectural branch, instead relying an making good | predictions down one instruction stream, then rolling | back and restarting when it's clear you mispredicted. | jasonwatkinspdx wrote: | It's not even about the expense of implementing diverging | paths in hardware. | | This concept, of exploring like a tree vs a path was | explored under the name Disjoint Eager Execution. You | know what killed it? Branch predictors. In a world where | branch predictors are maybe only 75% effective, DEE could | make sense. We live in a world where branch predictors | are _far_ better than that. So it just isn 't worth | speculating off the predicted most likely path. | monocasa wrote: | What killed it was more the effectiveness of Tomasulo | style out of order machinery, and the real problem not | being control hazards, but data hazards. DEE was thought | of in a day where memory was about as fast as the | processor. That's why it's always being compared with | cores like R3000. | monocasa wrote: | Sort of. GPU "cores" in the CPU space would be called SIMD | lanes. Apples to apples GPU cores using the CPU terminology | would put an Nvidia 3060 at 28 cores and a 3090 at 82 cores. | klyrs wrote: | Sure, but this is a coprocessor on an expansion card, similar | to a GPU. I've worked on a few systolic algorithms and this | kind of chip has massive potential in that space. TPUs have | been a big letdown in that regard, as they don't even have the | comparison operation needed for the matrix-based shortest-path | algorithm. | joe_the_user wrote: | Looking at this, I'm confused by basic questions. Is this a Mimd | or Simd architecture chip? [1] What is the memory/caching | structure here and would be fast or slow? Is this to replace a | GPU or to replace the CPU you connect to the GPU or both? Would | you get code and/or data divergence here? IE, "Many cores" seems | to imply each has it's own instructions but ML usually runs on | vectoring machines like a GPU. | | Edit: OK, I can see this has "network on a chip" architecture but | I think that only answers some of my questions. | | [1] https://en.wikipedia.org/wiki/Flynn%27s_taxonomy | jasonwatkinspdx wrote: | MIMD via 1088 cores per chip, each core has 512 bit short | vector SIMD, and a 1024 bit tensor unit. | | ~100 MB of SRAM on chip, 8 GDDR busses to DRAM off chip. | | It's purpose designed for parallel sparse matrix ML problems. | It's more efficient than both a CPU and GPU at these, as well | as faster in absolute terms, taking their numbers at face | value. | dragontamer wrote: | I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA) or | 64 lanes (AMD CDNA). | | The rest of those lanes come from MIMD techniques. | | ------- | | CPU cores are more MIMD today than SISD because of out of order | and superscalar operations. So honestly, I think it's about | time to retire Flynn's taxonomy. Everything is MIMD. | joe_the_user wrote: | _I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA) | or 64 lanes (AMD CDNA). | | The rest of those lanes come from MIMD techniques._ | | Not sure what you mean here. Are there places where different | groups of kernels can simultaneous execute different code? | qwerty456127 wrote: | With all this RISC-V hype going on I'm curious how compatible | RISC-V processors from different vendors actually are. | zozbot234 wrote: | The RISC-V folks are in the process of ratifying standards for | vector processing ("V") and lightweight packed-SIMD/DSP compute | ("P") that should make it easier to expand compatibility in | these domains. As of now, these standards are not quite ready | and Esperanto are still using proprietary extensions for their | stuff. | OneEyedRobot wrote: | So what in the heck does the CPU interconnect look like? | sakras wrote: | This looks similar to a research project I worked on called the | Hammerblade Manycore. The cores were connected by a network, | where you could read/write to an address by sending a packet | which contained the address of the word, whether you were | reading or writing, and if writing, the new value. The packet | would then hop along the network one unit per cycle until it | reached its destination. | monocasa wrote: | Classic "network on chip" for those that want the most | searchable term. | vletal wrote: | - Transistor on chip | | - CPU on chip | | - System on chip | | - Network on chip | | I'm looking forward to the next X on chip which I'm not | aware of. | trissylegs wrote: | Ive heard the term "Cluster on a chip" which I guess | would apply here too. | zh3 wrote: | Looks efficient, at least on the face of it. Certainly seems | credible (Dave Ditzel) and as a way to lowering the | cost/improving the efficiency of targeted ad-serving they could | be on to a winner. | russellbeattie wrote: | Off topic thought: "Esperanto Technologies" is apparently the | name of the company, in case you're confused with the headline | like I was. I was amused to discover their offices are literally | 3 blocks away from where I live. (So is YCombinator and a | thousand other tech companies, so not that surprising, really, | but amusing.) | | At this point I think we need to go back to descriptive old- | school 70s company names like, "West Coast Microprocessor | Solutions", "Digital Logic, Inc.", "Mountain View Artificial | Intelligence Laboratories", etc. | | You know, something that would blend into this map: | https://s3.amazonaws.com/rumsey5/silicon/11492000.jpg | | Edit: Looking at that map, some of the company names are | fantastically generic! "Electronics Corporation", "California | Devices", "General Technology", "Test International". | drmpeg wrote: | Descriptive would be "Bus Stop Systems". No, drop the | "Systems", just "Bus Stop". | Causality1 wrote: | Given recent trends I'm just glad it isn't named Esperantr | Technologr. | iamtedd wrote: | Esprantly Technify. ___________________________________________________________________ (page generated 2021-08-29 23:00 UTC)