[HN Gopher] New SiFive RISC-V core P650 with 40% IPC increase
       ___________________________________________________________________
        
       New SiFive RISC-V core P650 with 40% IPC increase
        
       Author : FullyFunctional
       Score  : 131 points
       Date   : 2021-12-02 16:21 UTC (6 hours ago)
        
 (HTM) web link (www.sifive.com)
 (TXT) w3m dump (www.sifive.com)
        
       | snvzz wrote:
       | Some context: RISC-V Summit is next week, and RISC-V
       | international has just approved a batch of important
       | extensions[0]. With these extensions, RISC-V is not missing
       | anything relative to ARM and x86 ISAs in terms of functionality.
       | 
       | I expect a lot of tape-outs to happen this month, as core vendors
       | were probably holding for the announced ratifications, in fear of
       | last minute changes. Next year is going to be exciting.
       | 
       | [0]: https://riscv.org/announcements/2021/12/riscv-
       | ratifies-15-ne...
        
         | [deleted]
        
         | socialdemocrat wrote:
         | That is great news! Is there any friendly intro/coverage
         | anywhere of the new vector extension?
         | 
         | I am curious about the final design. Would be interesting to
         | hear how people think it compares with ARMs scalable vector
         | extensions.
        
           | snvzz wrote:
           | There's been a few talks on the topic. They're archived in
           | e.g. youtube.
           | 
           | I like it. It's fairly simple and clean, yet powerful.
           | 
           | There was also some discussion here in HN months ago, about
           | an article comparing RISC-V V extension and ARM SVE.
           | 
           | The article itself got several things wrong about V, but the
           | discussion[0] was interesting.
           | 
           | [0] https://news.ycombinator.com/item?id=27063748
        
             | [deleted]
        
         | monocasa wrote:
         | I wouldn't say RISC-V isn't missing anything. The lack of
         | add/subtract with carry is an issue for efficient runtime of
         | many JITed languages like JavaScript.
         | 
         | That being said, I don't think it's the worse thing in the
         | world like some do. The focus now should be on compiled code
         | since JITs by definition can make runtime descions on if some
         | future extension that fixes this deficiency exists or not. The
         | J extension has stalled for the moment, but with these other
         | extensions ratified there should be more bandwidth available
         | hopefully.
        
           | teruakohatu wrote:
           | Can't vendor's making desktop/mobile class CPUs detect the
           | equivalent pattern and optimize it in microcode or silicon?
           | 
           | Or is that what we are trying to get away from?
        
             | monocasa wrote:
             | Maybe, but it's a leap, IMO. The equivalent patterns are 3x
             | as long, and modify tons of arch visible state for their
             | intermediate results which leaves more work for those
             | combined instructions to do.
             | 
             | The complaint is valid, IMO, and would show up on the
             | filtration test they used to come up with ops if they were
             | working with JITs too rather than just what's in AOT code.
        
       | socialdemocrat wrote:
       | Anyone able to put this in context? How fast are these cores
       | compared to various ARM, Intel and AMD cores? At what level can
       | they compete?
        
         | sanxiyn wrote:
         | > With a projected score of 11+ SPECInt2006/GHz, the SiFive
         | Performance P650 brings RISC-V into a new category of high-end
         | computing applications.
         | 
         | 11+ SPECInt2006/GHz is comparable to Apple Icestorm
         | microarchitecture. Apple Firestorm microarchitecture is roughly
         | 2x better at 22 SPECInt2006/GHz.
        
           | Symmetry wrote:
           | How impressive that number is rather depends on how many GHz
           | they're managing. In general the slower you design your clock
           | to clock, the faster you can make all your caches. Plus the
           | slower you clock your core, designed in or not, the lower the
           | number of clock cycles it takes to talk to main memory.
        
           | pantalaimon wrote:
           | Mind you that raw core performance is not everything, memory
           | bandwidth and caches are crucial to make sure the CPU isn't
           | waiting for data all the time.
        
             | sanxiyn wrote:
             | Yes, but SPECint includes all such effects. As long as
             | SPECint benchmarks (such as GCC) are representative of your
             | workload, it works fine.
        
               | tlb wrote:
               | I trust that the Apple benchmarks include all such
               | effects. I'm less convinced that the RISC-V "projections"
               | include them. SPECint2006 is supposed to be measured with
               | real memory and an OS. Per-GHz numbers can't accurately
               | reflect main memory latency, since its speed doesn't
               | scale with the CPU clock.
        
               | spear wrote:
               | Right, and "per GHz" numbers are also not very useful
               | because you can't just crank up the GHz when you need
               | performance. Even with the same process technology, you
               | can't assume different microarchitectures will max out at
               | the same frequency.
        
         | sebow wrote:
         | If i recall correctly the sifive unmatched is still pretty slow
         | compared to ARM(
         | https://www.phoronix.com/scan.php?page=article&item=hifive-u...
         | ).Now this board is not the one in question(P650) but we'll
         | have to observe upcoming benchmarks [for which i recommend
         | phoronix]
         | 
         | Obviously you can't even think about comparing it further with
         | Intel & AMD, but when you look at the history of something like
         | ARM(which i believe is 30-40 years old), riscv came a long way
         | pretty fast, and the good thing it's a solid choice for the
         | future due being open.
        
       | sebow wrote:
       | Sweet, are there any resources on transitioning/migrating or
       | differences between x86_64 and riscv; or the ISAs are drastically
       | different that it's just better to dive in head-first?
        
       | bruce343434 wrote:
       | > With a projected score of 11+ SPECInt2006/GHz
       | 
       | That seems to imply a certain integer arithmetic performance, but
       | I wonder what the floating point performance is. They could have
       | just said "X flops".
       | 
       | Comparing to other benchmarks at [1], I have no idea, because
       | they all have denormalized results, so totals, rather than per
       | GHz per core. Nice reporting.
       | 
       | How fast is this thing? Pentium? first gen i3? current gent ryzen
       | 5? The fact that they are being so obtuse about it leads me to
       | believe performance isn't great.
       | 
       | [1] https://www.spec.org/cgi-
       | bin/osgresults?conf=cint2006;op=dum...
        
         | wmf wrote:
         | I'd compare it to an Atom "efficiency" core.
        
       | marcodiego wrote:
       | Faster than ARM A-77:
       | https://www.phoronix.net/image.php?id=2021&image=sifive_p650... .
       | Performance comparable to Apple Icestorm architecture, the
       | 'efficiency' cores in M1. Considering A-710 is the fastest ARM
       | core currently available and its successor will only be available
       | next year, SiFive is just a few years before real competition
       | starts in an arena currently dominated by ARM.
       | 
       | This will be beautiful to watch.
        
         | [deleted]
        
         | zozbot234 wrote:
         | It will be interesting to see a comparison on power-efficiency
         | as well as performance. RISC-V implementations have shown a
         | pretty sizeable advantage wrt. power use in the past, and we
         | don't quite know how this advantage compares in these larger,
         | performance-focused designs.
        
         | dmitrygr wrote:
         | > just a few years before real competition starts
         | 
         | Are you assuming the competition will just sit and do nothing?
        
           | GhettoComputers wrote:
           | Good enough" matters more than benchmarks. They can make
           | supercomputers but it doesn't matter to someone who wants a
           | $100 computer.
        
             | dmitrygr wrote:
             | All riscv thingies i see today are decidedly not $100. I do
             | see plenty of arm designs running linux under $10 though
        
       | baybal2 wrote:
       | This is something genuinely interesting from riscv crowd for the
       | first time
        
       | danielEM wrote:
       | Once it gets to the shelfes at reasonable price will be happy to
       | work with/on it.
       | 
       | Curious how IP pricing compares to ARM in this case and how much
       | would I need to put on top of it to tape out own batch of
       | processors
        
         | snvzz wrote:
         | The license to the ISA itself is free.
         | 
         | There's several vendors besides RISC-V offering cores for
         | licensing. There's even some OSHW cores that can be freely
         | used.
         | 
         | Even if we choose to ignore the technical prowess of being a
         | true 5th generation RISC ISA built with hindsight no other ISA
         | has, what's IMHO a big deal in RISC-V is the mere availability
         | of this market of cores.
         | 
         | It poses a threat to ARM's business model, where ARM licenses
         | cores and ISA, but nobody else than ARM can license cores to
         | others.
        
           | Teknoman117 wrote:
           | As far as OSHW cores go, it's so very nice to be able to
           | throw something together in verilog and be able to inherit a
           | compiler and not be trampling on someone else's copyright...
        
           | dmitrygr wrote:
           | > built with hindsight no other ISA has
           | 
           | Why do all the riscv fans Conveniently ignore aarch64 when
           | they make statements like this? It was in fact a completely
           | clean new design, based on hindsight, by people who know what
           | they are doing, and with no legacy Cruft.
        
             | FullyFunctional wrote:
             | I'm a fan of RISC-V but the freedom is a large part of it.
             | Aarch64 _is_ a very well designed ISA and _clearly_ has a
             | lot of benefit of hindsight. The load pair /store pair
             | instructions, the addressing modes, fixed 32-bit
             | instruction size, etc. It all really helps. I suspect that
             | Apple was actively part of designing it.
             | 
             | I think however that RISC-V isn't that much worse and
             | because of the freedom we will almost certainly see more
             | implementation of RISC-V. I'd be watching Tenstorrent,
             | SiFive, Rivos, Esperanto, and maybe Alibaba/T-Head.
        
             | brucehoult wrote:
             | Aarch64 obviously _isn 't_ a completely clean sheet design.
             | It was constrained by having to execute on the same CPU
             | pipelines as 32 bit code, at least for the first decade or
             | so. And the 32 bit mode has to perform well. There are tens
             | of millions of Raspberry Pi 3s and 4s (and later model Pi
             | 2s) which have 64 bit CPUs but have never seen a 64 bit
             | instruction in their lives. Android phones have been
             | supporting both 32 and 64 bit apps for a long time.
             | 
             | The "by people who know what they are doing" thing is just
             | pure FUD. Sure, ARM employs some competent people, but no
             | more so than IBM, Intel, AMD or the various members of
             | RISC-V International.
        
             | snvzz wrote:
             | >Why do all the riscv fans Conveniently ignore aarch64 when
             | they make statements like this? It was in fact a completely
             | clean new design, based on hindsight, by people who know
             | what they are doing, and with no legacy Cruft.
             | 
             | aarch64 seems poorly designed to me.
             | 
             | ARMv7 had thumb, but for some reason ARMv8 did not
             | incorporate any lessons from that. As a result, code
             | density is bad; ARMv8 binaries are huge.
             | 
             | ARMv9, to be available in chips next year, is just a higher
             | profile of required extensions, and does nothing to fix
             | that.
             | 
             | Ever wonder why M1 needs such huge L1 cache? Well, now you
             | know.
             | 
             | Considering ARMv9 will be competing against RVA22, I don't
             | have much hope for ARM.
        
               | dmitrygr wrote:
               | > for some reason ARMv8 did not incorporate any lessons
               | from that.
               | 
               | I used to think so too, until I asked some more
               | knowledgeable people about it. Turns out the lesson _IS_
               | that not having it is better. Fixed-sized instructions
               | make a decoding significantly simpler, making it much
               | easier to make very wide front ends
        
               | brucehoult wrote:
               | A little easier, not much easier. A number of
               | organisations are making very wide RISC-V
               | implementations, and one has already published how their
               | decoder works. It's modular, with each block looking at
               | 48 bits of code (the first 16 overlapping with the
               | previous block) and decoding either two 16 bit
               | instructions, or one aligned 32 bit instruction, or one
               | misaligned 32 bit instruction with a following 16 bit
               | instruction, or one misaligned 32 bit instruction
               | followed by an ignored start of another misaligned 32 bit
               | instruction.
               | 
               | You can put as many of these modules side by side as you
               | want. There is a serial dependency between them in that
               | each block has to tell the next block whether its last 16
               | bits are the start of a misaligned 32 bit instruction or
               | not. That could become an issue with really really wide
               | but for something decoding e.g. 16 bytes at a time (4 to
               | 8 instructions) it's not an issue.
               | 
               | There is a trade-off between a little bit of decoder
               | complexity and a lot of improved code density -- but
               | nowhere near to the same extent as say x86.
        
               | adrian_b wrote:
               | ARMv8 code density is quite good for a fixed-length ISA
               | and is of course much better than that of RISC-V.
               | 
               | RISC-V has only one good feature for code density, the
               | combined compare-and-branch instructions, but even this
               | feature was designed poorly, because it does not have all
               | the kinds of compare-and-branch that are needed, e.g. if
               | you want safe code that checks for overflows, the number
               | of required instructions and the code size explode. Only
               | unsafe code, without run-time checks, can have an
               | acceptable size in RISC-V.
               | 
               | ARMv8 has an adequate unused space in the branch opcode
               | map, where combined compare-and-branch instructions could
               | be added, and with a larger branch offset range than in
               | RISC-V, in which case the code size advantage of ARMv8
               | vs. RISC-V would increase significantly.
               | 
               | While the combined compare-and-branch of RISC-V are good
               | for code density, because branches are very frequent, the
               | rest of the ISA is bad and the worst is the lack of
               | indexed addressing, which frequently requires 2 RISC-V
               | instructions instead of 1 ARM instruction.
        
               | brucehoult wrote:
               | I'm not sure how you missed RISC-V's big feature for code
               | density -- the "C" extension, giving it arbitrarily mixed
               | 16 and 32 bit opcodes.
               | 
               | I've heard of that feature before somewhere else. It gave
               | the company that invented it unparalleled code density in
               | their 32 bit systems and propelled them to the heights of
               | success in mobile devices. What was their name? Wait ..
               | oh, yes ... ARM.
               | 
               | Why they forgot this in their 64 bit ISA is a mystery.
               | The best theory I can come up with is that they thought
               | the industry had shaken out and amd64 was the only
               | competition they were going to have, ever. Aarch64 does
               | indeed have very good code density for a fixed-length 32
               | bit opcode ISA, and comes very close to matching amd64.
               | They may have thought that was going to be good enough.
               | 
               | Note: the RISC-V "C" extension is technically optional,
               | but the only CPU cores I know of that don't implement it
               | are academic toys, student projects, and tiny cores for
               | use in FPGAs where they are running programs with only a
               | few hundred instructions in them. Once you get over even
               | maybe 1 KB of code it's cheaper in resources to implement
               | "C" than to provide more program storage.
        
               | zozbot234 wrote:
               | The thing with lack of shifted indexed addressing is that
               | it just might not matter all that much beyond toy
               | examples. Address calculations can generally be folded in
               | with other code, particularly in loops which are a common
               | case. So it's only rarely that you actually need those
               | extra instructions.
        
               | adrian_b wrote:
               | Shifted indexed addressing is needed more seldom, but
               | indexed addressing, i.e. register + register, is needed
               | in every loop that accesses memory.
               | 
               | There are 2 ways of programming a loop that addresses
               | memory with a minimum of instructions.
               | 
               | One way, which is preferable e.g. on Intel/AMD, is to
               | reuse the loop counter as the index into the data
               | structure that is accessed, so each load/store needs a
               | base register + index register addressing, which is
               | missing in RISC-V.
               | 
               | The second way, which is preferable e.g. on POWER and
               | which is also available on ARM, is to use an addressing
               | mode with auto-update, where the offset used in loads or
               | stores is added into the base register. This is also
               | missing in RISC-V.
               | 
               | Because none of the 2 methods works in RISC-V with a
               | minimum number of instructions, like in all other CPUs,
               | all such loops, which are very frequent, need pairs of
               | instructions in RISC-V, corresponding to single
               | instructions in the other CPUs.
        
               | brucehoult wrote:
               | A big difference here is that the RISC-V instructions are
               | usually all 16 bits in size while the Aarch64 and POWER
               | instructions are all 32 bits in size. So the code size is
               | the same.
               | 
               | Also, high performance Aarch64 and POWER implementations
               | are likely to be splitting those instructions into two
               | decoupled uops in the back end.
               | 
               | Performance-critical loops are unrolled on all ISAs to
               | minimise loop control overhead and also to allow
               | scheduling instructions to allow for the several cycle
               | latency of loads from even L1 cache. When you do that,
               | indexed addressing and auto-update addressing are still
               | doing both operations for every load or store which, as
               | well as being a lot of operations, introduces sequential
               | dependency between the instructions. The RISC-V way
               | allows the use of simple load/store with offset -- all of
               | which are independent of each other -- with one merged
               | update of each pointer at the end of the loop. POWER and
               | Aarch64 compilers for high performance microarchitectures
               | use the RISC-V structure for unrolled loops anyway.
               | 
               | So indexed addressing and auto-update addressing give no
               | advantage for code size, and don't help performance at
               | the high end.
        
               | snvzz wrote:
               | >in which case the code size advantage of ARMv8 vs.
               | RISC-V would increase significantly.
               | 
               | Many things could be said about ARMv8, but that it has
               | good code size is not one of it. It does, in fact, have
               | abysmal code density. Both RISC-V and x86-64 produce
               | significantly smaller binaries. For RISC-V, we're talking
               | about a 20% reduction of size.
               | 
               | There's a wealth of papers on this, but you can verify
               | this trivially yourself, by either compiling binaries for
               | different architectures from the same sources, or
               | comparing binaries in Linux distributions that support
               | RISC-V and ARM.
               | 
               | >where combined compare-and-branch instructions could be
               | added, and with a larger branch offset range than in
               | RISC-V
               | 
               | If your argument is that ARMv8 could get better over
               | time, I hate to be the bearer of bad news. ARMv9 code
               | density isn't any better.
               | 
               | >and the worst is the lack of indexed addressing, which
               | frequently requires 2 RISC-V instructions instead of 1
               | ARM instruction.
               | 
               | These patterns are standardized, and they become one
               | instruction after fusion.
               | 
               | RISC-V, unlike the previous generation of ISAs, was
               | thoroughly designed with hindsight on fusion. The
               | simplest microarchitectures can of course omit it
               | altogether, but the cost of fusion in RISC-V is low; I
               | have seen it quoted at 400 gates.
        
               | brucehoult wrote:
               | Instruction fusion is a possibility for the future, which
               | has been discussed academically, but no one implements it
               | at present. I'm not sure anyone will -- it's too much
               | complexity for simple cores, and not needed for big OoO
               | cores.
               | 
               | The one fusion implementation I'm aware of if the SiFive
               | 7-series combining a conditional branch that jumps
               | forward over exactly one instruction. It turns the
               | instruction pair into predicated execution.
               | 
               | I agree with everything else. In particular the code
               | density. Anyone can download Ubuntu or Fedora images for
               | the same release for amd64, arm64, and riscv64. Mount
               | them and run "size" on any selection of binaries you
               | want. The RISC-V ones are consistently and significantly
               | smaller than the other two, with arm64 the biggest.
        
               | pohl wrote:
               | _Ever wonder why M1 needs such huge L1 cache? Well, now
               | you know._
               | 
               | I'm not sure I follow this, but it reminds me to ask:
               | does RISC-V allow for designs to have both efficiency &
               | performance cores like the ARM big.LITTLE concept? Has
               | anyone made one yet?
        
               | brucehoult wrote:
               | Of course you can do it. SiFive has been allowing
               | customers to configure core complexes with a mixture of
               | different core types for years -- for example mixing U84
               | cores with U74 or U54. If you want to do a BIG.little
               | thing with transferring a running program from one core
               | type to another that's just a software thing -- and using
               | cores with the same ISA but different microarchitecture.
               | 
               | To date the examples of this that have been shipped to
               | the public have used cores with similar
               | microarchitecture, but a different set of extensions.
               | 
               | For example the U54-MC in the HiFive Unleashed and in the
               | Microsemi Polarfire SoC FPGAs use four U54 cores plus one
               | E51 core for "real time" tasks. The E51 doesn't have an
               | FPU or MMU or Supervisor mode. The U74-MC in the HiFive
               | Unmatched is similar.
               | 
               | Alibaba's ICE SoC, which you may have seen videos of
               | running Android, has two C910 Out-of-Order cores (similar
               | to ARM A72/A73) implementing RV64GC, and a third C910
               | core that also has a vector processing unit with two
               | pipes with 256 bit vector ALU each, plus 128 bit vector
               | load and store pipes.
        
               | [deleted]
        
           | fartcannon wrote:
           | So I guess we should expect to hear a lot of FUD about RISC-V
           | over the coming years.
        
             | marcodiego wrote:
             | No need to wait. Already happened in 2018:
             | https://www.theregister.com/2018/07/10/arm_riscv_website/
             | 
             | https://www.extremetech.com/wp-
             | content/uploads/2018/07/arm-r...
        
               | snvzz wrote:
               | And it is how many learned about RISC-V's existence.
               | 
               | It will be a PR disaster long remembered. One for the
               | textbooks.
        
             | snvzz wrote:
             | This is a real possibility, albeit a sad one.
             | 
             | No amount of FUD will save ARM. Only pivoting into a
             | different business model could.
        
               | duskwuff wrote:
               | Honestly, ARM is fine. They're no longer the only game in
               | town, but they've still got a huge head start.
        
               | snvzz wrote:
               | They'll be fine if they focus on their microarchitectures
               | rather than the ISA (where IMHO they've already lost),
               | and make the process for obtaining a license much more
               | streamlined; I've heard it takes no less than 18 months
               | of long negotiations to license anythin from ARM. That's
               | not sustainable now that there's competition.
        
               | duskwuff wrote:
               | That's already where their focus is. Most of ARM's
               | customers are licensing specific cores from ARM, not the
               | ISA as a whole.
        
       | jaas wrote:
       | Who exactly are the customers for this chip?
        
       ___________________________________________________________________
       (page generated 2021-12-02 23:01 UTC)