[HN Gopher] New SiFive RISC-V core P650 with 40% IPC increase ___________________________________________________________________ New SiFive RISC-V core P650 with 40% IPC increase Author : FullyFunctional Score : 131 points Date : 2021-12-02 16:21 UTC (6 hours ago) (HTM) web link (www.sifive.com) (TXT) w3m dump (www.sifive.com) | snvzz wrote: | Some context: RISC-V Summit is next week, and RISC-V | international has just approved a batch of important | extensions[0]. With these extensions, RISC-V is not missing | anything relative to ARM and x86 ISAs in terms of functionality. | | I expect a lot of tape-outs to happen this month, as core vendors | were probably holding for the announced ratifications, in fear of | last minute changes. Next year is going to be exciting. | | [0]: https://riscv.org/announcements/2021/12/riscv- | ratifies-15-ne... | [deleted] | socialdemocrat wrote: | That is great news! Is there any friendly intro/coverage | anywhere of the new vector extension? | | I am curious about the final design. Would be interesting to | hear how people think it compares with ARMs scalable vector | extensions. | snvzz wrote: | There's been a few talks on the topic. They're archived in | e.g. youtube. | | I like it. It's fairly simple and clean, yet powerful. | | There was also some discussion here in HN months ago, about | an article comparing RISC-V V extension and ARM SVE. | | The article itself got several things wrong about V, but the | discussion[0] was interesting. | | [0] https://news.ycombinator.com/item?id=27063748 | [deleted] | monocasa wrote: | I wouldn't say RISC-V isn't missing anything. The lack of | add/subtract with carry is an issue for efficient runtime of | many JITed languages like JavaScript. | | That being said, I don't think it's the worse thing in the | world like some do. The focus now should be on compiled code | since JITs by definition can make runtime descions on if some | future extension that fixes this deficiency exists or not. The | J extension has stalled for the moment, but with these other | extensions ratified there should be more bandwidth available | hopefully. | teruakohatu wrote: | Can't vendor's making desktop/mobile class CPUs detect the | equivalent pattern and optimize it in microcode or silicon? | | Or is that what we are trying to get away from? | monocasa wrote: | Maybe, but it's a leap, IMO. The equivalent patterns are 3x | as long, and modify tons of arch visible state for their | intermediate results which leaves more work for those | combined instructions to do. | | The complaint is valid, IMO, and would show up on the | filtration test they used to come up with ops if they were | working with JITs too rather than just what's in AOT code. | socialdemocrat wrote: | Anyone able to put this in context? How fast are these cores | compared to various ARM, Intel and AMD cores? At what level can | they compete? | sanxiyn wrote: | > With a projected score of 11+ SPECInt2006/GHz, the SiFive | Performance P650 brings RISC-V into a new category of high-end | computing applications. | | 11+ SPECInt2006/GHz is comparable to Apple Icestorm | microarchitecture. Apple Firestorm microarchitecture is roughly | 2x better at 22 SPECInt2006/GHz. | Symmetry wrote: | How impressive that number is rather depends on how many GHz | they're managing. In general the slower you design your clock | to clock, the faster you can make all your caches. Plus the | slower you clock your core, designed in or not, the lower the | number of clock cycles it takes to talk to main memory. | pantalaimon wrote: | Mind you that raw core performance is not everything, memory | bandwidth and caches are crucial to make sure the CPU isn't | waiting for data all the time. | sanxiyn wrote: | Yes, but SPECint includes all such effects. As long as | SPECint benchmarks (such as GCC) are representative of your | workload, it works fine. | tlb wrote: | I trust that the Apple benchmarks include all such | effects. I'm less convinced that the RISC-V "projections" | include them. SPECint2006 is supposed to be measured with | real memory and an OS. Per-GHz numbers can't accurately | reflect main memory latency, since its speed doesn't | scale with the CPU clock. | spear wrote: | Right, and "per GHz" numbers are also not very useful | because you can't just crank up the GHz when you need | performance. Even with the same process technology, you | can't assume different microarchitectures will max out at | the same frequency. | sebow wrote: | If i recall correctly the sifive unmatched is still pretty slow | compared to ARM( | https://www.phoronix.com/scan.php?page=article&item=hifive-u... | ).Now this board is not the one in question(P650) but we'll | have to observe upcoming benchmarks [for which i recommend | phoronix] | | Obviously you can't even think about comparing it further with | Intel & AMD, but when you look at the history of something like | ARM(which i believe is 30-40 years old), riscv came a long way | pretty fast, and the good thing it's a solid choice for the | future due being open. | sebow wrote: | Sweet, are there any resources on transitioning/migrating or | differences between x86_64 and riscv; or the ISAs are drastically | different that it's just better to dive in head-first? | bruce343434 wrote: | > With a projected score of 11+ SPECInt2006/GHz | | That seems to imply a certain integer arithmetic performance, but | I wonder what the floating point performance is. They could have | just said "X flops". | | Comparing to other benchmarks at [1], I have no idea, because | they all have denormalized results, so totals, rather than per | GHz per core. Nice reporting. | | How fast is this thing? Pentium? first gen i3? current gent ryzen | 5? The fact that they are being so obtuse about it leads me to | believe performance isn't great. | | [1] https://www.spec.org/cgi- | bin/osgresults?conf=cint2006;op=dum... | wmf wrote: | I'd compare it to an Atom "efficiency" core. | marcodiego wrote: | Faster than ARM A-77: | https://www.phoronix.net/image.php?id=2021&image=sifive_p650... . | Performance comparable to Apple Icestorm architecture, the | 'efficiency' cores in M1. Considering A-710 is the fastest ARM | core currently available and its successor will only be available | next year, SiFive is just a few years before real competition | starts in an arena currently dominated by ARM. | | This will be beautiful to watch. | [deleted] | zozbot234 wrote: | It will be interesting to see a comparison on power-efficiency | as well as performance. RISC-V implementations have shown a | pretty sizeable advantage wrt. power use in the past, and we | don't quite know how this advantage compares in these larger, | performance-focused designs. | dmitrygr wrote: | > just a few years before real competition starts | | Are you assuming the competition will just sit and do nothing? | GhettoComputers wrote: | Good enough" matters more than benchmarks. They can make | supercomputers but it doesn't matter to someone who wants a | $100 computer. | dmitrygr wrote: | All riscv thingies i see today are decidedly not $100. I do | see plenty of arm designs running linux under $10 though | baybal2 wrote: | This is something genuinely interesting from riscv crowd for the | first time | danielEM wrote: | Once it gets to the shelfes at reasonable price will be happy to | work with/on it. | | Curious how IP pricing compares to ARM in this case and how much | would I need to put on top of it to tape out own batch of | processors | snvzz wrote: | The license to the ISA itself is free. | | There's several vendors besides RISC-V offering cores for | licensing. There's even some OSHW cores that can be freely | used. | | Even if we choose to ignore the technical prowess of being a | true 5th generation RISC ISA built with hindsight no other ISA | has, what's IMHO a big deal in RISC-V is the mere availability | of this market of cores. | | It poses a threat to ARM's business model, where ARM licenses | cores and ISA, but nobody else than ARM can license cores to | others. | Teknoman117 wrote: | As far as OSHW cores go, it's so very nice to be able to | throw something together in verilog and be able to inherit a | compiler and not be trampling on someone else's copyright... | dmitrygr wrote: | > built with hindsight no other ISA has | | Why do all the riscv fans Conveniently ignore aarch64 when | they make statements like this? It was in fact a completely | clean new design, based on hindsight, by people who know what | they are doing, and with no legacy Cruft. | FullyFunctional wrote: | I'm a fan of RISC-V but the freedom is a large part of it. | Aarch64 _is_ a very well designed ISA and _clearly_ has a | lot of benefit of hindsight. The load pair /store pair | instructions, the addressing modes, fixed 32-bit | instruction size, etc. It all really helps. I suspect that | Apple was actively part of designing it. | | I think however that RISC-V isn't that much worse and | because of the freedom we will almost certainly see more | implementation of RISC-V. I'd be watching Tenstorrent, | SiFive, Rivos, Esperanto, and maybe Alibaba/T-Head. | brucehoult wrote: | Aarch64 obviously _isn 't_ a completely clean sheet design. | It was constrained by having to execute on the same CPU | pipelines as 32 bit code, at least for the first decade or | so. And the 32 bit mode has to perform well. There are tens | of millions of Raspberry Pi 3s and 4s (and later model Pi | 2s) which have 64 bit CPUs but have never seen a 64 bit | instruction in their lives. Android phones have been | supporting both 32 and 64 bit apps for a long time. | | The "by people who know what they are doing" thing is just | pure FUD. Sure, ARM employs some competent people, but no | more so than IBM, Intel, AMD or the various members of | RISC-V International. | snvzz wrote: | >Why do all the riscv fans Conveniently ignore aarch64 when | they make statements like this? It was in fact a completely | clean new design, based on hindsight, by people who know | what they are doing, and with no legacy Cruft. | | aarch64 seems poorly designed to me. | | ARMv7 had thumb, but for some reason ARMv8 did not | incorporate any lessons from that. As a result, code | density is bad; ARMv8 binaries are huge. | | ARMv9, to be available in chips next year, is just a higher | profile of required extensions, and does nothing to fix | that. | | Ever wonder why M1 needs such huge L1 cache? Well, now you | know. | | Considering ARMv9 will be competing against RVA22, I don't | have much hope for ARM. | dmitrygr wrote: | > for some reason ARMv8 did not incorporate any lessons | from that. | | I used to think so too, until I asked some more | knowledgeable people about it. Turns out the lesson _IS_ | that not having it is better. Fixed-sized instructions | make a decoding significantly simpler, making it much | easier to make very wide front ends | brucehoult wrote: | A little easier, not much easier. A number of | organisations are making very wide RISC-V | implementations, and one has already published how their | decoder works. It's modular, with each block looking at | 48 bits of code (the first 16 overlapping with the | previous block) and decoding either two 16 bit | instructions, or one aligned 32 bit instruction, or one | misaligned 32 bit instruction with a following 16 bit | instruction, or one misaligned 32 bit instruction | followed by an ignored start of another misaligned 32 bit | instruction. | | You can put as many of these modules side by side as you | want. There is a serial dependency between them in that | each block has to tell the next block whether its last 16 | bits are the start of a misaligned 32 bit instruction or | not. That could become an issue with really really wide | but for something decoding e.g. 16 bytes at a time (4 to | 8 instructions) it's not an issue. | | There is a trade-off between a little bit of decoder | complexity and a lot of improved code density -- but | nowhere near to the same extent as say x86. | adrian_b wrote: | ARMv8 code density is quite good for a fixed-length ISA | and is of course much better than that of RISC-V. | | RISC-V has only one good feature for code density, the | combined compare-and-branch instructions, but even this | feature was designed poorly, because it does not have all | the kinds of compare-and-branch that are needed, e.g. if | you want safe code that checks for overflows, the number | of required instructions and the code size explode. Only | unsafe code, without run-time checks, can have an | acceptable size in RISC-V. | | ARMv8 has an adequate unused space in the branch opcode | map, where combined compare-and-branch instructions could | be added, and with a larger branch offset range than in | RISC-V, in which case the code size advantage of ARMv8 | vs. RISC-V would increase significantly. | | While the combined compare-and-branch of RISC-V are good | for code density, because branches are very frequent, the | rest of the ISA is bad and the worst is the lack of | indexed addressing, which frequently requires 2 RISC-V | instructions instead of 1 ARM instruction. | brucehoult wrote: | I'm not sure how you missed RISC-V's big feature for code | density -- the "C" extension, giving it arbitrarily mixed | 16 and 32 bit opcodes. | | I've heard of that feature before somewhere else. It gave | the company that invented it unparalleled code density in | their 32 bit systems and propelled them to the heights of | success in mobile devices. What was their name? Wait .. | oh, yes ... ARM. | | Why they forgot this in their 64 bit ISA is a mystery. | The best theory I can come up with is that they thought | the industry had shaken out and amd64 was the only | competition they were going to have, ever. Aarch64 does | indeed have very good code density for a fixed-length 32 | bit opcode ISA, and comes very close to matching amd64. | They may have thought that was going to be good enough. | | Note: the RISC-V "C" extension is technically optional, | but the only CPU cores I know of that don't implement it | are academic toys, student projects, and tiny cores for | use in FPGAs where they are running programs with only a | few hundred instructions in them. Once you get over even | maybe 1 KB of code it's cheaper in resources to implement | "C" than to provide more program storage. | zozbot234 wrote: | The thing with lack of shifted indexed addressing is that | it just might not matter all that much beyond toy | examples. Address calculations can generally be folded in | with other code, particularly in loops which are a common | case. So it's only rarely that you actually need those | extra instructions. | adrian_b wrote: | Shifted indexed addressing is needed more seldom, but | indexed addressing, i.e. register + register, is needed | in every loop that accesses memory. | | There are 2 ways of programming a loop that addresses | memory with a minimum of instructions. | | One way, which is preferable e.g. on Intel/AMD, is to | reuse the loop counter as the index into the data | structure that is accessed, so each load/store needs a | base register + index register addressing, which is | missing in RISC-V. | | The second way, which is preferable e.g. on POWER and | which is also available on ARM, is to use an addressing | mode with auto-update, where the offset used in loads or | stores is added into the base register. This is also | missing in RISC-V. | | Because none of the 2 methods works in RISC-V with a | minimum number of instructions, like in all other CPUs, | all such loops, which are very frequent, need pairs of | instructions in RISC-V, corresponding to single | instructions in the other CPUs. | brucehoult wrote: | A big difference here is that the RISC-V instructions are | usually all 16 bits in size while the Aarch64 and POWER | instructions are all 32 bits in size. So the code size is | the same. | | Also, high performance Aarch64 and POWER implementations | are likely to be splitting those instructions into two | decoupled uops in the back end. | | Performance-critical loops are unrolled on all ISAs to | minimise loop control overhead and also to allow | scheduling instructions to allow for the several cycle | latency of loads from even L1 cache. When you do that, | indexed addressing and auto-update addressing are still | doing both operations for every load or store which, as | well as being a lot of operations, introduces sequential | dependency between the instructions. The RISC-V way | allows the use of simple load/store with offset -- all of | which are independent of each other -- with one merged | update of each pointer at the end of the loop. POWER and | Aarch64 compilers for high performance microarchitectures | use the RISC-V structure for unrolled loops anyway. | | So indexed addressing and auto-update addressing give no | advantage for code size, and don't help performance at | the high end. | snvzz wrote: | >in which case the code size advantage of ARMv8 vs. | RISC-V would increase significantly. | | Many things could be said about ARMv8, but that it has | good code size is not one of it. It does, in fact, have | abysmal code density. Both RISC-V and x86-64 produce | significantly smaller binaries. For RISC-V, we're talking | about a 20% reduction of size. | | There's a wealth of papers on this, but you can verify | this trivially yourself, by either compiling binaries for | different architectures from the same sources, or | comparing binaries in Linux distributions that support | RISC-V and ARM. | | >where combined compare-and-branch instructions could be | added, and with a larger branch offset range than in | RISC-V | | If your argument is that ARMv8 could get better over | time, I hate to be the bearer of bad news. ARMv9 code | density isn't any better. | | >and the worst is the lack of indexed addressing, which | frequently requires 2 RISC-V instructions instead of 1 | ARM instruction. | | These patterns are standardized, and they become one | instruction after fusion. | | RISC-V, unlike the previous generation of ISAs, was | thoroughly designed with hindsight on fusion. The | simplest microarchitectures can of course omit it | altogether, but the cost of fusion in RISC-V is low; I | have seen it quoted at 400 gates. | brucehoult wrote: | Instruction fusion is a possibility for the future, which | has been discussed academically, but no one implements it | at present. I'm not sure anyone will -- it's too much | complexity for simple cores, and not needed for big OoO | cores. | | The one fusion implementation I'm aware of if the SiFive | 7-series combining a conditional branch that jumps | forward over exactly one instruction. It turns the | instruction pair into predicated execution. | | I agree with everything else. In particular the code | density. Anyone can download Ubuntu or Fedora images for | the same release for amd64, arm64, and riscv64. Mount | them and run "size" on any selection of binaries you | want. The RISC-V ones are consistently and significantly | smaller than the other two, with arm64 the biggest. | pohl wrote: | _Ever wonder why M1 needs such huge L1 cache? Well, now | you know._ | | I'm not sure I follow this, but it reminds me to ask: | does RISC-V allow for designs to have both efficiency & | performance cores like the ARM big.LITTLE concept? Has | anyone made one yet? | brucehoult wrote: | Of course you can do it. SiFive has been allowing | customers to configure core complexes with a mixture of | different core types for years -- for example mixing U84 | cores with U74 or U54. If you want to do a BIG.little | thing with transferring a running program from one core | type to another that's just a software thing -- and using | cores with the same ISA but different microarchitecture. | | To date the examples of this that have been shipped to | the public have used cores with similar | microarchitecture, but a different set of extensions. | | For example the U54-MC in the HiFive Unleashed and in the | Microsemi Polarfire SoC FPGAs use four U54 cores plus one | E51 core for "real time" tasks. The E51 doesn't have an | FPU or MMU or Supervisor mode. The U74-MC in the HiFive | Unmatched is similar. | | Alibaba's ICE SoC, which you may have seen videos of | running Android, has two C910 Out-of-Order cores (similar | to ARM A72/A73) implementing RV64GC, and a third C910 | core that also has a vector processing unit with two | pipes with 256 bit vector ALU each, plus 128 bit vector | load and store pipes. | [deleted] | fartcannon wrote: | So I guess we should expect to hear a lot of FUD about RISC-V | over the coming years. | marcodiego wrote: | No need to wait. Already happened in 2018: | https://www.theregister.com/2018/07/10/arm_riscv_website/ | | https://www.extremetech.com/wp- | content/uploads/2018/07/arm-r... | snvzz wrote: | And it is how many learned about RISC-V's existence. | | It will be a PR disaster long remembered. One for the | textbooks. | snvzz wrote: | This is a real possibility, albeit a sad one. | | No amount of FUD will save ARM. Only pivoting into a | different business model could. | duskwuff wrote: | Honestly, ARM is fine. They're no longer the only game in | town, but they've still got a huge head start. | snvzz wrote: | They'll be fine if they focus on their microarchitectures | rather than the ISA (where IMHO they've already lost), | and make the process for obtaining a license much more | streamlined; I've heard it takes no less than 18 months | of long negotiations to license anythin from ARM. That's | not sustainable now that there's competition. | duskwuff wrote: | That's already where their focus is. Most of ARM's | customers are licensing specific cores from ARM, not the | ISA as a whole. | jaas wrote: | Who exactly are the customers for this chip? ___________________________________________________________________ (page generated 2021-12-02 23:01 UTC)