[HN Gopher] Esperanto Champions the Efficiency of Its 1,092-Core...
       ___________________________________________________________________
        
       Esperanto Champions the Efficiency of Its 1,092-Core RISC-V Chip
        
       Author : rbanffy
       Score  : 91 points
       Date   : 2021-08-29 18:12 UTC (4 hours ago)
        
 (HTM) web link (www.hpcwire.com)
 (TXT) w3m dump (www.hpcwire.com)
        
       | [deleted]
        
       | klelatti wrote:
       | Mentions that each ET-Minion core has a vector / tensor unit.
       | From [1]
       | 
       | > The ET-Minion core, based on the open RISC-V ISA, adds
       | proprietary extensions optimized for machine learning. This
       | general-purpose 64-bit microprocessor executes instructions in
       | order, for maximum efficiency, while extensions support vector
       | and tensor operations on up to 256 bits of floating-point data
       | (using 16-bit or 32-bit operands) or 512 bits of integer data
       | (using 8-bit operands) per clock period.
       | 
       | So sounds like at least 8736 SP FP operations per cycle.
       | 
       | [1] https://www.esperanto.ai/technology/
        
       | langarto wrote:
       | More information (HotChips 33 presentation):
       | https://www.esperanto.ai/wp-content/uploads/2021/08/HC2021.E...
        
       | turminal wrote:
       | I'm not any kind of expert in the field, but trading single chip
       | speed for more chips surely has it's downsides, which aren't
       | mentioned in the article at all.
        
         | jmercouris wrote:
         | Of course it does, usually systems have slower and faster
         | compute units to compensate for performance penalty of non
         | parallelisable operations
        
         | cogman10 wrote:
         | Biggest that comes up quick is memory bandwidth.
         | 
         | The more cores you have, the more memory is needed to keep the
         | chips processing.
        
           | touisteur wrote:
           | Well, it really depends on the computational intensity your
           | algorithm needs. I've stumbled upon things of beauty porting
           | things to GPUs, especially if you're going to perform huge
           | amount of operations based on a very small amount of data. As
           | long as you don't have too much intermediate data, register
           | spilling, etc. these GPU things do fly. They're also very
           | impressive on NN-based workloads... Even something 2 or 3
           | gens behind can be game changing, with some optimization
           | effort. Tensor libraries leave a lot on the floor to pick up,
           | especially if you're not using the canned 'competition
           | winning' networks.
        
         | Mikeb85 wrote:
         | Read the article. It's about ML workloads which scale well
         | across many cores. It's also being compared to GPUs. The whole
         | point of what they're doing to is to be able to pack more cores
         | versus a CPU but with a larger instruction set than a GPU core.
        
           | zozbot234 wrote:
           | I like ML but it's not a very good language for this highly
           | parallel HPC'ish stuff. We'll see how Rust does, it should be
           | a lot closer to what's actually needed here.
        
             | medo-bear wrote:
             | ML as in machine learning
        
         | mirker wrote:
         | Yes, but it's meant to do ML inference, which can be
         | parallelized decently. On those workloads, you can use GPUs,
         | which are also composed of thousands of "wimpy" cores.
        
           | R0b0t1 wrote:
           | A full CPU is useful for decision intensive or time series
           | intensive data. Normal ML inference is not necessarily either
           | of those. You could have more complicated neurons (or just
           | make normal compute tiles which they may be doing).
        
             | goldenkey wrote:
             | I thought the same thing back in 2015 considering the way
             | GPUs supposedly handle branches with warps. However, my
             | stock trading simulator ran way better on 3 GTX Titans
             | rather than the Intel "Knights Many Cores" Phi preview I
             | had exclusively been able to obtain. I was excited because
             | it had something like 100 pentium 4 cores on it, and was
             | supposed to be much faster than a GPU for logical code.
             | Dissapointment set in when the GPU stomped it performance
             | wise. I still don't even understand why but I do know now
             | that the whole "GPUs can't handle branching performantly"
             | is a bit overstated. Intel discontinued their Phi which I
             | can only gander was due to its lack of competitiveness.
        
               | sdenton4 wrote:
               | A standard way to handle branching in gpu code is with
               | masking, like so (where x is a vector, and broadcasting
               | is implied): M = x > 0 y = M * f(x) + (1-M) * g(x)
               | 
               | So you end up evaluating both sides of the decision
               | branch and adding the results. But this is fine if you've
               | got a dumb number of cores. And often traditional cpus
               | wind up evaluating both branches anyway.
        
               | monocasa wrote:
               | > And often traditional cpus wind up evaluating both
               | branches anyway.
               | 
               | That's actually really overstated. Evaluating both sides
               | isn't really something CPUs tend to do, but instead
               | predict one path and roll back on mispredict. This is
               | because the out of order hardware isn't a tree
               | fundamentally, but generally better thought of a ring
               | buffer where uncommitted state is what's between the head
               | and tail. Storing diverging paths is incredibly expensive
               | there. I'm not going to say something as strong as "it's
               | never been done", but but I certainly don't know of an
               | general purpose CPU arch that'll compute both sides of an
               | architectural branch, instead relying an making good
               | predictions down one instruction stream, then rolling
               | back and restarting when it's clear you mispredicted.
        
               | jasonwatkinspdx wrote:
               | It's not even about the expense of implementing diverging
               | paths in hardware.
               | 
               | This concept, of exploring like a tree vs a path was
               | explored under the name Disjoint Eager Execution. You
               | know what killed it? Branch predictors. In a world where
               | branch predictors are maybe only 75% effective, DEE could
               | make sense. We live in a world where branch predictors
               | are _far_ better than that. So it just isn 't worth
               | speculating off the predicted most likely path.
        
               | monocasa wrote:
               | What killed it was more the effectiveness of Tomasulo
               | style out of order machinery, and the real problem not
               | being control hazards, but data hazards. DEE was thought
               | of in a day where memory was about as fast as the
               | processor. That's why it's always being compared with
               | cores like R3000.
        
           | monocasa wrote:
           | Sort of. GPU "cores" in the CPU space would be called SIMD
           | lanes. Apples to apples GPU cores using the CPU terminology
           | would put an Nvidia 3060 at 28 cores and a 3090 at 82 cores.
        
         | klyrs wrote:
         | Sure, but this is a coprocessor on an expansion card, similar
         | to a GPU. I've worked on a few systolic algorithms and this
         | kind of chip has massive potential in that space. TPUs have
         | been a big letdown in that regard, as they don't even have the
         | comparison operation needed for the matrix-based shortest-path
         | algorithm.
        
       | joe_the_user wrote:
       | Looking at this, I'm confused by basic questions. Is this a Mimd
       | or Simd architecture chip? [1] What is the memory/caching
       | structure here and would be fast or slow? Is this to replace a
       | GPU or to replace the CPU you connect to the GPU or both? Would
       | you get code and/or data divergence here? IE, "Many cores" seems
       | to imply each has it's own instructions but ML usually runs on
       | vectoring machines like a GPU.
       | 
       | Edit: OK, I can see this has "network on a chip" architecture but
       | I think that only answers some of my questions.
       | 
       | [1] https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
        
         | jasonwatkinspdx wrote:
         | MIMD via 1088 cores per chip, each core has 512 bit short
         | vector SIMD, and a 1024 bit tensor unit.
         | 
         | ~100 MB of SRAM on chip, 8 GDDR busses to DRAM off chip.
         | 
         | It's purpose designed for parallel sparse matrix ML problems.
         | It's more efficient than both a CPU and GPU at these, as well
         | as faster in absolute terms, taking their numbers at face
         | value.
        
         | dragontamer wrote:
         | I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA) or
         | 64 lanes (AMD CDNA).
         | 
         | The rest of those lanes come from MIMD techniques.
         | 
         | -------
         | 
         | CPU cores are more MIMD today than SISD because of out of order
         | and superscalar operations. So honestly, I think it's about
         | time to retire Flynn's taxonomy. Everything is MIMD.
        
           | joe_the_user wrote:
           | _I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA)
           | or 64 lanes (AMD CDNA).
           | 
           | The rest of those lanes come from MIMD techniques._
           | 
           | Not sure what you mean here. Are there places where different
           | groups of kernels can simultaneous execute different code?
        
       | qwerty456127 wrote:
       | With all this RISC-V hype going on I'm curious how compatible
       | RISC-V processors from different vendors actually are.
        
         | zozbot234 wrote:
         | The RISC-V folks are in the process of ratifying standards for
         | vector processing ("V") and lightweight packed-SIMD/DSP compute
         | ("P") that should make it easier to expand compatibility in
         | these domains. As of now, these standards are not quite ready
         | and Esperanto are still using proprietary extensions for their
         | stuff.
        
       | OneEyedRobot wrote:
       | So what in the heck does the CPU interconnect look like?
        
         | sakras wrote:
         | This looks similar to a research project I worked on called the
         | Hammerblade Manycore. The cores were connected by a network,
         | where you could read/write to an address by sending a packet
         | which contained the address of the word, whether you were
         | reading or writing, and if writing, the new value. The packet
         | would then hop along the network one unit per cycle until it
         | reached its destination.
        
           | monocasa wrote:
           | Classic "network on chip" for those that want the most
           | searchable term.
        
             | vletal wrote:
             | - Transistor on chip
             | 
             | - CPU on chip
             | 
             | - System on chip
             | 
             | - Network on chip
             | 
             | I'm looking forward to the next X on chip which I'm not
             | aware of.
        
               | trissylegs wrote:
               | Ive heard the term "Cluster on a chip" which I guess
               | would apply here too.
        
       | zh3 wrote:
       | Looks efficient, at least on the face of it. Certainly seems
       | credible (Dave Ditzel) and as a way to lowering the
       | cost/improving the efficiency of targeted ad-serving they could
       | be on to a winner.
        
       | russellbeattie wrote:
       | Off topic thought: "Esperanto Technologies" is apparently the
       | name of the company, in case you're confused with the headline
       | like I was. I was amused to discover their offices are literally
       | 3 blocks away from where I live. (So is YCombinator and a
       | thousand other tech companies, so not that surprising, really,
       | but amusing.)
       | 
       | At this point I think we need to go back to descriptive old-
       | school 70s company names like, "West Coast Microprocessor
       | Solutions", "Digital Logic, Inc.", "Mountain View Artificial
       | Intelligence Laboratories", etc.
       | 
       | You know, something that would blend into this map:
       | https://s3.amazonaws.com/rumsey5/silicon/11492000.jpg
       | 
       | Edit: Looking at that map, some of the company names are
       | fantastically generic! "Electronics Corporation", "California
       | Devices", "General Technology", "Test International".
        
         | drmpeg wrote:
         | Descriptive would be "Bus Stop Systems". No, drop the
         | "Systems", just "Bus Stop".
        
         | Causality1 wrote:
         | Given recent trends I'm just glad it isn't named Esperantr
         | Technologr.
        
           | iamtedd wrote:
           | Esprantly Technify.
        
       ___________________________________________________________________
       (page generated 2021-08-29 23:00 UTC)