[HN Gopher] "Risc V greatly underperforms" ___________________________________________________________________ "Risc V greatly underperforms" Author : oxxoxoxooo Score : 187 points Date : 2021-12-02 18:51 UTC (4 hours ago) (HTM) web link (gmplib.org) (TXT) w3m dump (gmplib.org) | snvzz wrote: | I don't think they even tried to read the ISA spec documents. If | they did, they would have found that the rationale for most of | these decisions is solid: Evidence was considered, all the | factors were weighted, and decisions were made accordingly. | | But ultimately, the gist of their argument is this: | | >Any task will require more Risc V instructions that any | contemporary instruction set. | | Which is easy to verify as utter nonsense. There's not even a | need to look at the research, which shows RISC-V as the clear | winner in code density. It is enough to grab any Linux | distribution that supports RISC-V and look at the size of the | binaries across architectures. | theresistor wrote: | > I don't think they even tried to read the ISA spec documents. | If they did, they would have found that the rationale for most | of these decisions is solid: Evidence was considered, all the | factors were weighted, and decisions were made accordingly. | | It's perfectly possible to have read the spec and _disagree_ | with the rationale provided. RISC-V is in fact the outlier | among ISAs in many of these design decisions, so there 's a | heavy burden of proof to demonstrate that making the contrary | decisions in many cases was the right call. | | > Which is easy to verify as utter nonsense. There's not even a | need to look at the research, which shows RISC-V as the clear | winner in code density. It is enough to grab any Linux | distribution that supports RISC-V and look at the size of the | binaries across architectures. | | This doesn't seem to be true when you actually do an apples-to- | apples comparison. | | Taking as an example the build of Bash in Debian Sid | (https://packages.debian.org/sid/shells/bash). I chose this | because I'm pretty confident there's no functional or build- | dependency difference that will be relevant here. Other | examples like the Linux kernel are harder to compare because | the code in question is different across architectures. I saw | the same trend in the GCC package, so it's not an isolated | example. | | riscv64 installed size: 6,157.0 kB amd64 installed size: | 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed | size: 6,041.0 kB | | RV64 is outperforming the other 64-bit architectures, but | under-performing 32-bit ARM. This is consistent with | expectations: amd64 has a size penalty due to REX bytes, arm64 | got rid of compressed instructions to enable higher | performance, and armhf (32-bit) has smaller constants embedded | in the binary. | | Compressed instructions definitely _do_ work for making code | smaller, and that 's part of why arm32 has been very successful | in the embedded space, and why that space hasn't been rushing | to adopt arm64. For arm32, however, compressed instructions | proved to be a limiting factor on high performance | implementation, and arm64 moved away from them because of it. | _Maybe_ that 's due to some particular limitations of arm32's | compressed instructions that RISC-V compressed instructions | won't suffer from, but that remains to be proven. | mianos wrote: | Probably because, like most applications, that one does not | have a lot of wide multiplications. It is hard not the turn | this point into an insult at the OP. | btdmaster wrote: | Unfortunately, it seems that, at least for gmp, the shared | objects balloon in comparison to all other architectures. It | is about three times bigger (6000 instead of 2000kB): | https://packages.debian.org/sid/libgmp-dev. I am hopeful that | this may improve with extensions, though I know little about | the details. | jepler wrote: | text data bss dec hex filename | 311218 2284 36 313538 4c8c2 arm-linux- | gnueabihf/libgmp.so.10 374878 4328 56 | 379262 5c97e riscv64-linux-gnu/libgmp.so.10 | 480289 4624 56 484969 76669 aarch64-linux- | gnu/libgmp.so.10 511604 4720 72 516396 | 7e12c x86_64-linux-gnu/libgmp.so.10 | | Strange, that's not what I see. | xiphias2 wrote: | A company creating embedded risc-v cpus also has some added | extra instruction set extensions that conflict with the | floating point instructions though. | adrian_b wrote: | The size of the files can be very misleading, because a large | part of the files can be filled with various tables with | additional information, with strings, with debugging | information, with empty spaces left for alignment to page | boundaries and so on. So the size of the installed files is | not necessarily correlated with the code size. | | To compare the code sizes, you need tools like "size", | "readelf" etc. and the data given by the tools should still | be studied, to see how much of the code sections really | contain code. | | I have never seen until now a program where the RISC-V | variant is smaller than the ARMv8 or Intel/AMD variant, and I | doubt very much that such a program can exist. Except for the | branches, where RISC-V frequently needs only 4 bytes instead | of 5 bytes for Intel/AMD or 8 bytes for ARMv8, for all the | other instructions it is very frequent to need 8 bytes for | RISC-V instead of 4 bytes for ARMv8. | | Moreover, choosing compiler options like -fsanitize for | RISC-V increases the number of instructions dramatically, | because there is no hardware support for things like overflow | detection. | justinpombrio wrote: | So (i) the research on RISC-V that shows it has dense code | is bunk, and (ii) the fact that it compiles to a smaller | binary is irrelevant, and (iii) it sounds like you're | saying in advance that it might also have smaller code | section sizes within the binary but that's irrelevant too. | | And yet you're quite confident that RISC-V has poor code | density. So you clearly have a source of knowledge that | others don't. If it's a blog/article/research, could you | share a link? If it's personal experimentation, you should | write a blog post, I would totally read that. | pierrebai wrote: | Re-read what was written. He is saying exactly that the | RISCV code size is larger, but to see it you need the | right tool used the right way to actually look at the | code, not debug info, constant sections, etc. | snvzz wrote: | I would hope there's something more to his reasoning. | | There's tables showing values for just the code and | RISC-V beating aarch64 and x86-64 with ample margin, in | this very discussion. | ant6n wrote: | Perhaps thumb2 makes an 8-wide decide much harder. Plus, then | you can't have 32 instead of 16 registers. | akiselev wrote: | _> RISC-V is in fact the outlier among ISAs in many of these | design decisions, so there 's a heavy burden of proof to | demonstrate that making the contrary decisions in many cases | was the right call._ | | Genuinely asking, _why_? Do we think RISC-V should, or even | _could_ , try to compete against the AMD/Intel/ARM behemoths | on their playing field? Obviously ISAs are a low level detail | and far removed from the end product, but it feels like the | architectural decisions we are "stuck with" today are | inextricably intertwined with their contemporary market | conditions and historical happenstance. It feels like all the | experimental architectures that lost to x86/ARM (including | Intel's own) were simply too much too soon, before ubiquitous | internet and the open source culture could establish itself. | We've now got companies using genetic algorithms to optimize | ICs and people making their own semiconductors in the 100s of | microns range in their garages - maybe it's time to rethink | some things! | | (EE in a past life but little experience designing ICs so I | feel like I'm talking out of my rear end) | lonjil wrote: | > Genuinely asking, why? Do we think RISC-V should, or even | could, try to compete against the AMD/Intel/ARM behemoths | on their playing field? | | Well, it's exactly what many RISC-V folks are trying to do. | There's news about a new high performance RISC-V core on | the HN front page right now! | | > but it feels like the architectural decisions we are | "stuck with" today are inextricably intertwined with their | contemporary market conditions and historical happenstance. | It feels like all the experimental architectures that lost | to x86/ARM (including Intel's own) were simply too much too | soon, | | I just want to note that ARM64 was a mostly clean break | from prior versions of ARM. Basically a clean slate design | started in the late 2000s. It's a modern design built with | the same hindsight and approximate market conditions | available to the designers of RISC-V. | audunw wrote: | Yeah, I'm not sure he takes into considered compressed | instructions, which can be used anywhere, rather than being a | separate mode like Thumb on ARM. | | Fusing instructions isn't just theoretical either. I'm pretty | sure it is or will be a common optimisation for CPUs aiming for | high performance. How exactly is two easily-fused 16-bit | instructions worse than one 32-bit one? Is there really a | practical difference other than the name of the instruction(s)? | | At the same time, the reduced transistor count you get from a | simpler instruction set is not a benefit to be just dismissed | either. I'm starting to see RISC-V cores being put all over the | place in complex microcontrollers, because they're so damn | cheap, yet have very decent performance. I know a guy | developing a RISC-V core. He was involved with the proposal for | a couple of instructions that would put the code density above | Thumb for most code, and the performance of his core was better | than Cortex-M0 at a similar or smaller gate count. I'm not sure | if the instructions was added to the standard or not though. | | Even for high performance CPUs, there's a case to be made for | requiring fewer transistors for the base implementation. It | makes it easier to make low-power low-leakage cores for the | heterogeneous architecture (big.little, M1, etc.) which is | becoming so popular. | robert_foss wrote: | So how would you suggest re-writing their example in less than | 6 instructions for RISC-V? X86/arm both have instructions that | include the carry operation for long additions, and only | require 2 instructions. | jolmg wrote: | I don't even see the issue. RISC-V is supposed to be a RISC- | type ISA. It's in the very name. That it takes more | instructions when compared to a CISC-type ISA like x86 is | completely normal. | | https://en.wikipedia.org/wiki/Reduced_instruction_set_comput. | .. | theresistor wrote: | The argument for RISC instructions (in high performance | architectures) is that the faster decode makes up for the | increase in instruction count. The problem is that a faster | decode has a practical ceiling on how much faster it's | going to make your processor, and it's much lower than 3x. | If your workload is bottlenecked on an inner loop that got | 3x larger in instruction count, no 15% improvement in | decode performance is going to save you. | userbinator wrote: | Larger caches won't help much either; there's an old | article I remember that compares the efficiency of | various ARM, x86, and one MIPS CPU, and while x86 and ARM | were neck-and-neck, the MIPS was dead last in all the | comparisons despite having more cache than the others. | RISC-V is very similar to MIPS. | snvzz wrote: | Larger caches, as seen in Apple's M1 L1, are one of many | tools to deal with bad code density. | | RISC-V might, at first glance, look similar to MIPS, but | it leads in code density among the 64 bit architectures. | User23 wrote: | > [RISC-V] leads in code density among the 64 bit | architectures. | | You keep baldly asserting this in virtually all of your | very many replies here, with a vague appeal to your own | authority, but you haven't shown anything. Given that the | submission is precisely an example of bad code density, | if you're really here in the service of intellectual | curiosity then please show instead of just telling. | jolmg wrote: | I don't know what the design goals of RISC-V were, but I | would guess performance is not the key goal or at least | not the only goal. It makes more sense that ease of | implementation is a more important goal, if they want to | make adoption easy. That's another argument for favoring | RISC over CISC. | monocasa wrote: | If that's the case, you can always stick a uop cache in | after the decoder. | snvzz wrote: | Amount of instructions matters much less if they can be | fused into more complex instructions before execution. | | RISC-V was designed with hindsight on fusion, thus it has | more opportunities for doing it, and doing it at a lower | cost. | | And, due to the very high code density RISC-V has, the | decoder can do its job while not having to look at a huge | window. | theresistor wrote: | To everyone who's saying "But macro-fusion!" in response, | see my comment here: | https://news.ycombinator.com/item?id=29421107 | rbanffy wrote: | I don't think there is anything preventing the processor to | fuse those instructions into a single operation once they are | decoded. | adrian_b wrote: | Instruction fusion is the magical rescue invoked by all | those who believe that the RISC-V ISA is well designed. | | Instruction fusion has no effect on code size, but only on | execution speed. | | For example RISC-V has combined compare-and-branch | instructions, while the Intel/AMD ISA does not have such | instructions, but all Intel & AMD CPUs fuse the compare and | branch instruction pairs. | | So there is no speed difference, but the separate compare | and branch instructions of Intel/AMD remain longer at 5 | bytes, instead of the 4 bytes of RISC-V. | | Unfortunately for RISC-V, this is the only example | favorable for it, because for a large number of ARM or | Intel/AMD instructions RISC-V needs a pair of instructions | or even more instructions. | | Fusing instructions will not help RISC-V with the code | density, but it is the only way available for RISC-V to | match the speed of other CPUs. | | Even if instruction fusion can enable an adequate speed, | implementing such decoders is more expensive than | implementing decoders for an ISA that does not need | instruction fusion for the same performance | audunw wrote: | > Even if instruction fusion can enable an adequate | speed, implementing such decoders is more expensive than | implementing decoders for an ISA that does not need | instruction fusion for the same performance | | I'm very skeptical that a RISC-V decoder would be much | more complex than an X86 one, even with instruction | fusion. For the simpler fusion pairs, decoding the fused | instructions wouldn't be more complex than matching some | of the crazy instruction encoding in X86. | | For ARM I'm not so sure, but RISC-V does have very | significant instruction decoding benefits over ARM too, | so my guess would be that they'd be similar enough. | snvzz wrote: | >Unfortunately for RISC-V, this is the only example | favorable for it, because for a large number of ARM or | Intel/AMD instructions RISC-V needs a pair of | instructions or even more instructions. | | Yet, as many pointed out to you already, RISC-V has the | highest code density of all contemporary 64bit | architectures. And aarch64, which you seem to like, is | beyond bad. | | >but it is the only way available for RISC-V to match the | speed of other CPUs. | | Higher code density and lack of flags helps the decoder a | big deal. This means it is far cheaper for RISC-V to keep | execution units well fed. It also enables smaller caches | and conversely higher clock speeds. It's great for | performance. | | This, if anything, makes RISC-V the better ISA. | | >Even if instruction fusion can enable an adequate speed, | implementing such decoders is more expensive than | implementing decoders for an ISA that does not need | instruction fusion for the same performance | | Grasping at straws. RISC-V has been designed for fusion, | from the get-go. The cost of doing fusion with it has | been quoted to be as low as 400 gates. This is something | you've been told elsewhere in the discussion, but that | you chose to ignore, for reasons unknown. | avianes wrote: | I see that you are pretty active here in debunking anti- | RISC-V attacks, thanks for that! There are a bunch of | poor criticisms about RISC-V. | | > This is something you've been told elsewhere in the | discussion, but that you chose to ignore, for reasons | unknown. | | I would call it RISC-V bashing. | | Everyone loves to hate RISC-V, probably because it's new | and heavily hyped. | | It is really common to see irrelevant and uninformed | criticism about RISC-V. The article, which seems to be | enjoyed by the HN audience, literally says: "I believe | that an average computer science student could come up | with a better instruction set that Risc V in a single | term project". How can anyone say such a thing about a | collaborative project of more than 10 years, fed by many | scientific works and projects and many companies in the | industry? | | I do not mean that RISC-V is perfect, there are some | points which are source of debate (e.g. favoring a vector | extension rather than the classic SIMD is a source of | interesting discussion). But I would appreciate on HN to | read better analysis and more interesting discussions. | socialdemocrat wrote: | You ignore compressed instructions on RISC-V which 64-bit | ARM does not have. | | And if you compare 32-bit CPUs then RISC-V has twice as | many registers reducing the number of instructions needed | to read and write from memory. | | RISC-V branching takes less space and so does vector | instructions. There are many case like that which adds up | end results in RISC-V having the most dense ISA in all | studies when using compressed instructions. | Dylan16807 wrote: | > Even if instruction fusion can enable an adequate | speed, implementing such decoders is more expensive than | implementing decoders for an ISA that does not need | instruction fusion for the same performance | | On the other hand just splitting up x86 instructions is | very expensive, and decoding in general takes a lot of | work before you even start to do fancy tricks. | lordnacho wrote: | How does the instruction fusion work? It seems to be | mentioned in the article and by a couple of other | commenters. | volta83 wrote: | The CPU executes the two (or more) dependent instructions | "as if" they were one, e.g., in 1 cycle. | | The CPU has a frontend, which has a decoder, which is the | part that "reads" the program instructions. When it | "sees" certain pattern, like "instruction x to register r | followed by instruction y consuming r", it can treat this | "as if" it was a single instruction if the CPU has | hardware for executing that single instruction (even if | the ISA doesn't have a name for that instruction). | | This allows the people that build the CPU to choose | whether this is something they want to add hardware for. | If they don't, this runs in e.g. 2 cycles, but if they do | then it runs in 1. A server CPU might want to pay the | cost of running it in 1 cycle, but a micro controller CPU | might not. | vitno wrote: | Any != All. There is a difference between synthetic | benchmarks and real world test cases. | api wrote: | So this person found a pathological case for the RISC-V | instruction set? | adrian_b wrote: | This is not a pathological case, it is normal operation. | | A computer is supposed to compute, but the RISC-V ISA | does not provide everything that is needed for all the | kinds of computations that exist. | | The 2 most annoying missing features are the lack of | support for multi-word operations, which are needed to | compute with numbers larger than 64 bits, but also the | lack of support for detecting overflow in the operations | with standard-size integers. | | If you either want larger integers or safe computations | with normal integers, the number of RISC-V instructions | needed for implementation is very large compared to any | other ISA. | | While there are people who do a lot of computations with | large numbers, even the other users need such operations | every day. Large number computations are needed at the | establishment of any Internet connection, for the key | exchange. For software developers, many compilers, e.g. | gcc (which uses precisely libgmp), do computations with | large numbers during compilation, for various kind of | optimizations related to the handling of constants in the | code, e.g. for sub-expression extraction or for operation | complexity lowering. | | So every time when some project is compiled, libgmp or | other equivalent library for large numbers might be used, | like also every time when you click on a new link in a | browser. | | So this case is not at all pathological, except in the | vision of the RISC-V designers who omitted support for | this case. | | That was a good decision for an ISA intended only for | teaching or for embedded computers, but it becomes a bad | decision when someone wants to use RISC-V outside those | domains, e.g. for general-purpose personal computers. | panick21_ wrote: | > A computer is supposed to compute, but the RISC-V ISA | does not provide everything that is needed for all the | kinds of computations that exist. | | This is non-sense. You can still do everything you need. | Its just that in some cases the code size is a bit bigger | or smaller. | | And the difference with compressed instruction is not | nearly as big, if you add fusion the difference is | marginal. | | So really its not a pathological case its 'its slightly | worse' case and even that is hard to prove in the real | world given the other benefit RISC-V brings that | compensate. | | And we can find 'slightly worse case' in the opposite | direction if we would go looking for them. | | If you gave 2 equal skilled teams 100M and told them to | make the best possible personal computer chip, I would | bet on the RISC-V team winning 90 times out of a 100. | waterhouse wrote: | (RISC-V fan here) This is a real-world use case. GMP is a | library for handling huge integers, and adding two huge | integers is one of the operations it performs, and the way | to do that is to add-with-carry one long sequence of word- | sized integers into another. It's not synthetic; it's | extremely specialized, but real. | smoldesu wrote: | I don't think you're supposed to. The compiler handles that | stuff, ideally RISC-V is just another compilation target. | masklinn wrote: | Did you misunderstand the issue entirely? | | The context here is the implementation of one of the inner | loops of a high-performance infinite-precision arithmetic | library (GMP), in RISCV the loop has 3x the instruction | count it has in competing architectures. | | "The compiler" is not relevant, this is by design stuff | that the compiler is not supposed to touch because it's | unlikely to have the necessary understanding to get it as | tight and efficient as possible. | adrian_b wrote: | I am sorry but saying that RISC-V is a winner in code density | is beyond ridiculous. | | I am familiar with many tens of instruction sets, since the | first computers with vacuum tubes until all the important | instruction sets that are still in use, and there is no doubt | that RISC-V requires more instructions and a larger code size | than almost all of them, for doing any task. | | Even the hard-to-believe "research" results published by RISC-V | developers have always showed worse code density than ARM, the | so-called better results were for the compressed extension, not | for the normal encoding. | | Moreover, the results for RISC-V are hugely influenced by the | programming language and the compiler options that are chosen. | RISC-V has an acceptable code size only for unsafe code, if the | programming language or the compiler options require run-time | checks, to ensure safe behavior, then the RISC-V code size | increases enormously, while for other CPUs it barely changes. | | The RISC-V ISA has only 1 good feature for code size, the | combined compare-and-branch instructions. Because there | typically is 1 branch for every 6 to 8 instructions, using 1 | instruction instead of 2 saves a lot. | | Except for this good feature, the rest of the ISA is full of | bad features, which frequently require at least 2 instructions | instead of 1 instruction in any other CPU, e.g. the lack of | indexed addressing, which is needed in any loop that must | access some aggregate data structure, in order to be able to | implement the loop with a minimum number of instructions. | zamadatix wrote: | > Except for this good feature, the rest of the ISA is full | of bad features | | What are your thoughts on the way RISC V handled the | compressed instructions subset? | brandmeyer wrote: | It only addresses a subset of the available registers. | Small revisions in a function which change the number of | live variables will suddenly and dramatically change the | compressibility of the instructions. | | Higher-level languages rely heavily on inlining to reduce | their abstraction penalty. Profiles which were taken from | the Linux kernel and (checks notes...) _Drystone_ are not | representative of code from higher-level languages. | | 3/4 of the available prefix instruction space was consumed | by the 16-bit extension. There have been a couple of | proposals showing that even better density could be | achieved using only 1/2 the space instead of 3/4, but they | were struck down in order to maintain backwards | compatibility. | hajile wrote: | It's not too surprising. Load, store, move, add, subtract, | shift, branch, jump. These are definitely the most common | instructions used. | | Put it side-by-side with Thumb and it also looks pretty | similar (thumb has a multiply instruction IIRC). | | Put it side-by-side with short x86 instructions accounting | for the outdated ones and the list is pretty similar (down | to having 8 registers). | | All in all, when old and new instruction sets are taking | the same approach, you can be reasonably sure it's not the | absolute worst choice. | orra wrote: | > the so-called better results were for the compressed | extension, not for the normal encoding. | | Ignoring RISC-V's compressed encoding seems a rather | artificial restriction. | snvzz wrote: | You seem to be making your whole argument around some facts | which you got wrong. The central points of your argument are | often used in FUD, thus they are definitely worth tackling | here. | | >Even the hard-to-believe "research" results published by | RISC-V developers have always showed worse code density than | ARM | | the code size advantage of RISC-V is not artificial academic | bullshit. It is real, it is huge, and it is trivial to | verify. Just build any non-trivial application from source | with a common compiler (such as GCC or LLVM's clang) and | compare the sizes you get. Or look at the sizes of binaries | in Linux distributions. | | >the so-called better results were for the compressed | extension, not for the normal encoding. | | The C extension can be used anywhere, as long as the CPU | supports the extension; most RISC-V profiles require it. This | is in stark contrast with ARMv7's thumb, which was a literal | separate CPU mode. Effort was put in making this very cheap | for the decoder. | | The common patterns where number of instructions is larger | are made irrelevant by fusion. RISC-V has been thoroughly | designed with fusion in mind, and is unique in this regard. | It is within its right in calling itself the 5th generation | RISC ISA because of this, even if everything else is ignored. | | Fusion will turn most of these "2 instructions instead of | one" into actually one instruction from the execution unit | perspective. There's opportunities everywhere for fusion, the | patterns are designed in. The cost of fusion on RISC-V is | also very low, often quoted as 400 gates, allowing even | simpler microarchitectures to implement it. | theresistor wrote: | > This is in stark contrast with ARMv7's thumb, which was a | literal separate CPU mode. | | This is disingenuous. arm32's Thumb-2 (which has been | around since 2003) supports both 16-bit and 32-bit | instructions in a single mode, making it directly | comparable to RV32C. | snvzz wrote: | Your statement does not run counter to mine quoted. | | Thumb-2 is better designed than Thumb was, but it is | still a separate CPU mode. | | And it got far less use than it deserved, because of | this. It doesn't do everything, and switching has a | significant cost. This cost is in contrast with RISC-V's | C extension. | lonjil wrote: | The ARMv8-M profile is Thumb-only, so on ARM | microcontroller platforms there is no switching at all, | and it does do everything, or at least everything you | might want to do on a microcontroller, and has of course | gotten a very large amount of use, considering how widely | deployed those cores are. | Dylan16807 wrote: | Is thumb-only particularly good for density, compared to | being able to mix instruction sizes? | lonjil wrote: | Thumb has both 16-bit and 32-bit instructions. | Dylan16807 wrote: | Oh, you meant thumb _and_ thumb-2. | lonjil wrote: | "thumb-2" isn't really a thing. It's just an informal | name from when more instructions were being added to | thumb. it's still just thumb. | Taniwha wrote: | The main distinction is that the 16-bit RISCV-C ISA | exactly maps to existing 32-bit RISCV instructions, its | implementation only occurs in the decode pipe stage | bpye wrote: | The C extension is that, an extension. A RISC-V core with | the C extension should still support the long encoding as | well. There is no 16-bit variant specified, only 32, 64 | and 128. | | There is an E version of the ISA with a reduced register | set, but this is a separate thing. | brucehoult wrote: | You are mixing up integer register size and instruction | length. | | RISC-V has variants with 32 bit, 64 bit, or (not yet | fully specified or implemented) 128 bit registers. | | RISC-V has instructions of length 32 bits and, optionally | but almost universally, 16 bit length. | dragontamer wrote: | > The RISC-V ISA has only 1 good feature for code size, the | combined compare-and-branch instructions. Because there | typically is 1 branch for every 6 to 8 instructions, using 1 | instruction instead of 2 saves a lot. | | Which isn't really a big advantage, because ARM and x86 | macro-op fuse those instructions together. (That is, those | 2-instructions are decoded and executed as 1x macro-op in | practice). | | cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86 | vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on | RISC-V. | | So they're equal in practice. | | ----- | | ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM | but 8-bytes used up. | theresistor wrote: | ARM64 has cbz/tbz compare-and-branch instructions that | cover many common cases in a single 4-byte instruction as | well. | audunw wrote: | > I am sorry but saying that RISC-V is a winner in code | density is beyond ridiculous. | | You have no idea what you're talking about. I've worked on | designs with both ARM and RISC-V cores. The RISC-V code | outperforms the ARM core, with smaller gate count, and has | similar or higher code density in real world code, depending | on the extensions supported. The only way you get much lower | code density is without the C extension, but I haven't seen | it not implemented in a real-world commercial core, and if it | wasn't, I'm sure there was because of a benefit (FPGAs | sometimes use ultra-simple cores for some tasks, and don't | always care about instruction throughput or density) | | It should be said that my experience is in embedded, so yes, | it's unsafe code. But the embedded use-case is also the most | mature. I wouldn't be surprised if extensions that help with | safer programming languages would be added for desktop/server | class CPUs, if they haven't already (I haven't followed the | development of the spec that closely recently) | voz_ wrote: | Textbook example of the kind of hostility and close- | mindedness that is creeping into our beloved site. Why are | we dick measuring? why are we comparing experience like | this? so much "I" "I" "I"... | | I have no horse in the technical race here, but I certainly | am put off from reading what should be an intellectually | stimulating discussion by the nature of replies like this. | flatiron wrote: | Oh no. We don't maintain this site with these types of | comments no matter your feelings. It's the internet. | Don't get heated! | snvzz wrote: | It was likely instigated by its parent trying to inflate | himself by giving themselves some credentials, to try and | give their voice more weight. | | All of it, pretty sad, but I believe we should focus on | the technical arguments and try to put everything else | aside in order to re-conduct the discussion somewhere | more useful. | dataflow wrote: | >> RISC-V has an acceptable code size only for unsafe code | | > You have no idea what you're talking about. | | > It should be said that my experience is in embedded, so | yes, it's unsafe code. | | Just going based off your reply it certainly sounds like | they had at least _some_ idea what they were talking about? | In which case omitting that sentence would probably help. | [deleted] | okl wrote: | Few years ago, I designed my own ISA. In that time I investigated | design decisions in lots of ISAs and compared them. There was | nothing in the RISC-V instruction set that stood out to me, like | for example, the SuperH instruction set, which is remarkably well | designed. | | Edit: Don't get me wrong, I don't think RISC-V is "garbage" or | anything like that. I just think it could have been better. But | of course, most of an architecture's value comes from its | ecosystem and the time spent optimizing and tailoring | everything... | AlotOfReading wrote: | My memories of SuperH are a bit different. Yeah, it's cleaner | than ARM, but the delay slots, hardware division, and the tiny | register file among others made life unnecessarily difficult. A | lot of those design decisions didn't hold up well over time. | okl wrote: | Interesting! From which perspective? Implementing the ISA, | compiler or applications? Did you write machine language or | compiled? | AlotOfReading wrote: | Mainly system level and higher, but a bit of all three, I | suppose. I was helping reverse engineer a customized SH | chip and ended up implementing a small VM and optimized | system libraries/utilities afterwards. Most of the time was | spent in assembly, with some machine code and C on either | side. | okl wrote: | Thanks for your insight. | audunw wrote: | Never heard of SuperH. I see it has branch delay slots, which | is a seemingly clever but terrible idea. It's one of the | reasons RISC-V quickly overtook OpenRISC in popularity I think. | | Not having anything that stands out is perhaps a good thing. | Being "clever" with the ISA tends to bite you when implementing | OoO superscalar cores. | Teknoman117 wrote: | A bit of a computer history question: I have never looked at the | ISA of the Alpha (referenced in post), but RISC V has always | struck me as being nearly identical to (early) MIPS, just without | the HI and LO registers for multiply results and the addition of | variable length instruction support, even if the core ISA doesn't | use them. | | MIPS didn't have a flag register either and depended on a | dedicated zero register and slt instructions (set if less than) | dfox wrote: | The no flags at all part is clearly inspired by Alpha including | the rationale of flags being detrimental to OoO implementation. | | MIPS is classical RISC design that was not designed to be OoO- | friendly at all and is simply designed for ease of | straightforward pipelined implementation. The reason why it | does not have flags probably simply comes down to the | observation that you don't need flags for C. | userbinator wrote: | Yes, that's exactly my thought every time it comes out; RISC-V | is likely to displace MIPS everywhere performance doesn't | matter, but it'll have a hard time competing with ARM or x86 on | that. | [deleted] | okl wrote: | I bet this article on RISC-V's genealogy is interesting for | you: https://live-risc-v.pantheonsite.io/wp- | content/uploads/2016/... | cpeterso wrote: | Andrew Waterman's thesis ("Design of the RISC-V Instruction | Set Architecture") has a very approachable comparison of | RISC-V to MIPS, SPARC, Alpha, ARMv7, ARMv8, OpenRISC, and | x86: | | https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-. | .. | socialdemocrat wrote: | RISC V is an opinionated architecture and that is always going to | get some people fired up. Any technology that aims for simplicity | has to make hard choices and trade offs. It isn't hard to | complain about missing instructions when there are less than 100 | of them. Meanwhile nobody will complain about ARM64 missing | instructions because it had about 1000 of them. | | Therein lies the problem. Nobody ever goes out guns blazing | complaining about too many instructions despite the fact that | complexity has its own downsides. | | RISC-V has been designed aggressively to have minimal ISA to | leave plenty of room to grow, and require minimal number of | transistors for a minimal solution. | | Should this be a showstopper down the road, then there will be | plenty of space to add an extensions that fixes this problem. | Meanwhile embedded systems paying a premium for transistors are | not going to have to pay for these extra instructions as only 47 | instructions have to be implemented in a minimal solution. | throwaway19937 wrote: | TL;DR RISC-V doesn't have add with carry. | | I'm not a fan of the RISC-V design but the presence or absence of | this instruction doesn't make it a terrible architecture. | stephencanon wrote: | _For the purposes of implementing multi-word arithmetic_, which | is Torbjorn's whole deal, it kind of does. (Also the actual | post subject is "greatly underperforms"). | FullyFunctional wrote: | It's meaningless to look at the code in absence an | implementation and conclude anything about the performance. | He doesn't know what the performance is. Having six | instruction vs. two does not mean one is 3X faster than the | other. It means nothing at all. | stephencanon wrote: | We know enough about the implementation of current RISC-V | cores to conclude that they won't be remotely competitive | on this one narrow (yet fairly high-impact for some | workloads) task. Is it _possible_ to design a core that is | competitive on this workload even when handicapped by a | limited ISA? Yes, definitely. Have any RISC-V designers | shown any interest in doing so yet? No. | [deleted] | aappleby wrote: | The author seems to be assuming that the designers have never | thought about this corner case. | sanxiyn wrote: | No, the author is arguing this is not a corner case but a | central? case. I tend to agree. | kayamon wrote: | "Gee no carry flag how will we cope?" | pcwalton wrote: | Doesn't RISC-V have an add-with-carry instruction as part of the | vector extension? I see it listed here: | https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 | monocasa wrote: | Afaict that's only for operations the vector register file. | Most of the complaints about the lack of addc/subc are around | how they're heavily used in JITs for languages that want to | speculatively optimize multi precision arthimetic into the | integer register file for their regular integer ops. | JavaScript, a lot of Lisps, the MLs all fit into that space. | pcwalton wrote: | Sure, but this email is in the context of GMP, which should | be using the vector extension, no? | monocasa wrote: | I don't think so; most of the users I know of for the | integer side of GMP are compilers/runtimes. An apt rdepends | on the gmp packages in Ubuntu only shows stuff like ocaml, | and I know gcc vendors it. | | Edit: Another place you see this kind of arthimetic is | crypto, but those specific use cases (Diffie Hellman, RSA, | a few others) don't tend to be vectorized. You have one op | you're trying to work through with large integers, and | there's the carry dependency on each partial op. The carry | depdent crypto algorithms aren't typically vectorisable. | Shadonototra wrote: | Who changed the title? | | Moderators where are you? | CalChris wrote: | TL;DR My code snippet results in bloated code for | RISC-V RV64I. | | I'm not sure how bloated it is. All of those instructions will | compress [1]. | | [1] https://riscv.org/wp-content/uploads/2015/05/riscv- | compresse... | | It's slower on RISC-V but not a lot on a superscalar. The x86 and | ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles | of latency. 1. add t0, a4, a6 add t1, a5, a7 | 2. sltu t6, t0, a4 sltu t2, t1, a5 3. add t4, t1, t6 | sltu t3, t4, t1 4. add t6, t2, t3 | | I'm not getting _terrible_ from this. | Koffiepoeder wrote: | CPU performance increases nowadays often are measured in single | digit percentages because the margins became so thin. Doubling | the cycles is a 100% increase. You can call that not so | bloated, but I think many people would beg to differ. | | On the other hand I take this article with a grain of salt | anyhow, since it only discusses a single example. I think we | would need a lot more optimized assembly snippet comparisons to | make meaningful conclusions (and even then there could be | authored selection bias). | snvzz wrote: | The article's approach to arguing against RISC-V is fairly | childish. | | >"here's this snippet, it takes more instructions on RISC-V, | thus RISC-V bad" | | Is pretty much what it's saying. An actual argument about ISA | design would weight the cost this has with the advantages of | not having flags, provide a body of evidence and draw | conclusions from it. But, of course, that would be much | harder to do. | | What's comparatively easy and they should have done, however, | is to read the ISA specification. Alongside the decisions | that were made, there's a rationale to support it. Most of | these choices, particularly so the ones often quoted in FUD | as controversial or bad, have a wealth of papers, backed by | plentiful evidence, behind them. | marcodiego wrote: | So, how meaningful is the "projected score of 11+ | SPECInt2006/GHz" as claimed here: | https://www.sifive.com/press/sifive-raises-risc-v-performanc... ? | bell-cot wrote: | Rather than glib hand-waving in front of the chalkboard...might | there be a decent piece or few of RISC V hardware, which could | actually be compared to non-RISC V hardware with similar budgets | (for design work, transistor count, etc.) - to see how things | work out when running substantial pieces of decently-compiled | code? | yjftsjthsd-h wrote: | The original title was "Risc V greatly underperforms", which | seems like a far more defensible and less inflammatory claim than | "Risc V is a terrible architecture", which was picked from the | actual message but still isn't the title. | gary_0 wrote: | I almost skipped this thread because of the flamebait title. | This is a debate over CPU instruction set performance details, | nobody is going to die. | yjftsjthsd-h wrote: | In fairness, this is Hacker News; flame wars^w^w respectful | but intense debate over editors, operating systems, and, yes, | ISA details, is somewhat expected. (Although, yes, I'm not | sure that I would get too worked up about this particular | detail; even if the stated claim is 100% true and | unmitigated, it means _some_ kinds of code will have | potentially bigger binaries. I understand a math library | person caring, I don 't think I care.) | dang wrote: | Flamewars are definitely not expected - they're against the | rules and something we try to dampen in every way we know. | | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&qu | e... | | https://news.ycombinator.com/newsguidelines.html | rbanffy wrote: | > I understand a math library person caring, I don't think | I care. | | Not wasting much sleep on this one. Not sure there's | anything on the spec that stops implementations from | recognizing the two instructions and fuse them into a | single atomic operation for the backends to deal with. | It'll occupy more space in the L1 cache, but that's it. | Dylan16807 wrote: | I would say that "underperforms" is indefensible from such a | simple analysis that doesn't touch IPC. "Terrible" is at least | openly an opinion. | dang wrote: | Fixed now. Thanks! | xondono wrote: | Experimenting with RISC-V is one of those things I keep | postponing. | | For those are more versed, is this really a general problem? | | I was under the impression that the real bottleneck is memory, | and things like this would be fixed in real applications through | out of order execution, and that it payed off having simpler | instructions because compilers had more freedom to rearrange | things. | fwsgonzo wrote: | RISC-V is completely fine, heavily based on research and well | thought out. It does have pros and cons like any other | architecture, and for what it does well, it does it really | well! | nynx wrote: | If this really is an issue, I imagine risc-v could easily get an | extension for adding/subtracting/etc simd vectors together in a | way that would expand to the capabilities of underlying processor | without requiring hardcoding. | kayamon wrote: | It already has this. | nynx wrote: | Yes, the simd extension has the flexible vectors thing. I | don't think it has a way to treat simd vectors as bigints. | kelnos wrote: | > _My conclusion is that Risc V is a terrible architecture._ | | Kinda stopped reading here. It's a pretty arrogant hot take. I | don't know this guy, maybe he's some sort of ISA expert. But it | strains credulity that after all this time and work put into it, | RISC-V is a "terrible architecture". | | My expectation here is that RISC-V requires some inefficient | instruction sequences in some corners somewhere (and one of these | corners happens to be OP's pet use case), but by and large things | are fine. | | And even then, I don't think that's clear. You're not going to | determine performance just by looking at a stream of instructions | on modern CPUs. Hell, it's really hard to compare streams of | instructions from _different ISAs_. | SavantIdiot wrote: | A bit off topic, but when did a DWORD implicitly become 64bits? | robertlagrant wrote: | I heard the bird was DWORD. | fhood wrote: | Oh wow, everybody else is debating the specific intricacies of | the design decisions, and I'm here wondering why you would | complain about not enough instructions in an architecture with | "RISC" in the name. | adrian_b wrote: | The RISC idea was to not include in the ISA instructions so | complex that they would require a multi-cycle implementation. | | The minimum duration of the clock cycle of a modern CPU is | essentially determined by the duration of a 64-bit integer | addition/subtraction, because such operations need a latency of | only 1 clock cycle to be useful. | | Operations that are more complex than 64-bit integer | addition/subtraction, e.g. integer multiplications or floating- | point operations, need multiple cycles, but they are pipelined | so that their throughput remains at 1 per cycle. | | So 64-bit addition/subtraction is certainly expected to be | included in any RISC ISA. | | The hardware adders used for addition/subtraction provide, at a | negligible additional cost, 2 extra bits, carry and overflow, | which are needed for operations with large integers and for | safe operations with 64-bit integers. | | The problem is that the RISC-V ISA does not offer access to | those 2 bits and generating them in software requires a very | large cost in execution time and in lost energy in comparison | with generating them in hardware. | | I do not see any relationship between these bits and the RISC | concepts, omitting them does not simplify the hardware, but it | makes the software more complex and inefficient. | samstave wrote: | So a few decades ago.. I knew a guy who was one of the chief | designers of RISC procs at MIPs | | Yeah - he was addicted to prostitutes... | | This guy was doing amazing engineering work and we talked at | length about designing a system for basically what became rack- | mount trays | | But he was so distracted by his addiction to prostitutes... | jasonhansel wrote: | One thing that bothers me: RISC-V seems to use up a lot of the | available instruction set space with "HINT" instructions that | nobody has (yet) found a use for. Is it anticipated that all of | the available HINTs will actually be used, or is the hope that | the compressed version of the instruction set will avoid the | wasted space? | jpfr wrote: | The idea is to use the compressed instruction extension. Then two | adjacent instructions can be handled like a single "fat" | instruction with a special case implementation. | | That allows more flexibility for CPU designs to optimize | transistor count vs speed vs energy consumption. | | This guy clearly did not look at the stated rationale for the | design decisions of RISC-V. | theresistor wrote: | Compressed instructions and macro-fusion aren't magical | solutions. It's not always possible to convince the compiler to | generate the magical sequence required, and it actually makes | high-performance implementations (wide superscalar) more | difficult thanks to the variable width decoding. | | Beyond that, compressed instructions are _not_ a 1:1 substitute | for more complex instructions, because a pair of compressed | instructions cannot have any fields that cross the 16-bit | boundary. This means you can 't recover things like larger | load/store offsets. | | Additionally, you can't discard architectural state changes due | to the first instruction. If you want to fuse an address | computation with a load, you still have to write the new | address to the register destination of the address computation. | If you want to perform clever fusion for carry propagation, you | still have to perform all of the GPR writes. This is work that | a more complex instruction simply wouldn't have to perform, and | again it complicates a high performance implementation. | panick21_ wrote: | Part of the idea is to create standard ways to do certain | things and then hope compiler writers generation code | according to that. That will allow more chip designers to | take advantage of those if they want to. | | They spent a lot of time and effort on making sure the | decoding pretty good and useful for high performance | implementations. | | RISC-V is designed for very small and very large system. At | some point some tradeoffs need to be made but these are very | reasonable and most of the time no a huge problem. | | For the really specialized cases where you simply can't live | with those extra instruction, those will be added to the | standard and then some profiles will include them and others | not. If those instructions are really as vital as those that | want them claim, they will find their way into many profiles. | | Saying RISC-V is 'terrible' because of those choices is not | fair way of evaluating it. | userbinator wrote: | _RISC-V is designed for very small and very large system_ | | That's exactly the problem --- there is no one-size-fits- | all when it comes to instruction set design. | panick21_ wrote: | There is a trade-off but there is overall far more value | in having it be unified. | | The trade-offs are mostly very small or non existent once | you consider the standard extensions that different use | cases will have. | | Overall having a unified open instruction set is far | better then hand designing many different instruction | sets just to get marginal improvement. Some really | extreme application might require that, but for the most | part the whole indsutry could do just fine with RISC-V. | Both on the low and on the high end, and in fact better | then most of the alternative all things considered. | | If integer checking is really the be all end all and | without it RISC-V can not be successful without it, it | will be added and it will be pulled into all the | profiles. If it is not actually that relevant then it | wont. If it is very useful for some verticals and not | others, it will be in those profiles and not in others. | jpfr wrote: | In the context of gmp, people write architecture-specific | assembly for the inner loop anyway. | | Besides that, you raise good points on sources of complexity. | I'm waiting for the benchmarks once such developments have | been incorporated. Everything else is guesswork. | audunw wrote: | > and it actually makes high-performance implementations | (wide superscalar) more difficult thanks to the variable | width decoding. | | More difficult than x86? We're talking about a damn simple | variable width decoding here. | | I could imagine RISC-V with C extension being more tricky | than 64-bit ARM. Maybe. | | > and again it complicates a high performance implementation. | | But so much of the rationale behind the design of RISC-V is | to simplify high performance implementation in other ways. So | the big question is what the net effect is. | | The other big question is if extensions will be added to | optimise for desktop/server workloads by the time RISC-V CPUs | penetrate that market significantly. | okl wrote: | Sweet spot seems to be 16-bit instructions with 32/64-bit | registers. With 64-bit registers you need some clever way to | load your immediates, e.g., like the shift/offset in ARM | instructions. | msbarnett wrote: | He literally addressed this, albeit obliquely, in the message | | > I have heard that Risc V proponents say that these problems | are known and could be fixed by having the hardware fuse | dependent instructions. Perhaps that could lessen the | instruction set shortcomings, but will it fix the 3x worse | performance for cases like the one outlined here? | | Macro-fusion can _to some extent_ offset the weak instruction | set, but you 're never going to get a multiple integer | multiplier speedup out of it given the complexity of inter-op | architectural state changes that have to be preserved, and | instruction boundary limitations involved; it's never going to | offset a 3x blowup in instruction count in a tight loop. | socialdemocrat wrote: | Fusing 3 instructions is not unusual, those could also have | been compressed. Thus you have no more microcode to execute | and only 50% more cache usage rather than 300% | alerighi wrote: | Even if you do so, the program size is still bigger, and it | consumes more disk, RAM and most importantly cache space. | Wasting cache for having multiple instructions when on another | architecture it's done by only one doesn't make particular | sense to me. | | Also, it's said that x86 is bad because the instructions are | then reorganized and translated inside the CPU. But it seems | that you are proposing the same, the CPU that preprocessed the | instructions and fuses some into a single one (the opposite | that x86 does). Ad that point, it seems to me that what x86 | does makes more sense: have a ton of instruction (and thus | smaller programs and thus more code that can fit in cache) and | split them, rather than having a ton of instructions (and waste | cache space) for then the CPU to combine them into a single one | (a thing that a compiler can also do). | Buttons840 wrote: | How many cache misses are for program instructions, versus | data misses? | anarazel wrote: | IME icache misses are a frequent bottleneck. There's plenty | code where all the time is spent in one tight inner loop | and thus the icache is not a constraint, but there's also a | lot of cases with a much flatter profile. Where icache | misses suddenly become a serious constraint. | alerighi wrote: | Depends on the application. But even if they are few, it's | not a good reason to have them just for having a nice | instruction set, that if you are not writing assembly by | hand (and nobody does these days) doesn't give you any | benefit. | | Also don't reason with the desktop or server use case in | mind, where you have TB of disk and code size doesn't | matter. RISC-V is meant to be used also for embedded | systems (in fact their use nowadays is only for these | systems), where usually code size matter more than | performance (i.e. you typically compile with -Os). In these | situations more instructions means more flash space wasted, | meaning you can fit less code. | rbanffy wrote: | > that if you are not writing assembly by hand (and | nobody does these days) doesn't give you any benefit. | | An elegant architecture is easier to reason about. | Compilers will make fewer wrong decisions, fewer bugs | will be introduced, and fewer workarounds will need to be | implemented. An architecture that's simple to reason | about is an invaluable asset. | socialdemocrat wrote: | x86 also does macro fusion. Difference is RISC-V was designed | for compressed instruction and fusion from the get go. X86 | bolted this on. | | Anyway what you gain from this is a very simple ISA, which | helps tool writers, those who implement hardware as well in | academia for teaching and research. | | How does the insanely complex x86 instructions help anyone? | ksec wrote: | The unwritten rule of HN: | | You do not criticise The Rusted Holy Grail and the Riscy Silver | Bullet. | rbanffy wrote: | Or, if you do, you'd better be absolutely right, or people will | tear your argument to shreds. | sosodev wrote: | Why do these half baked slam pieces always make it to the top of | HN? | jgilias wrote: | Many people upvote things not necessarily because they agree | with them, but rather to bump it in hopes that someone with | good insights will chime in in the comments section. | | This especially applies to potentially controversial things. | bob1029 wrote: | I think the reason is that it ultimately encourages deep and | thoughtful conversation. If nothing controversial was ever | proposed, the motivation for participating and "proving others | wrong" is lessened. It might not be the healthiest way, but I | certainly find myself putting a lot more thought into my | comments if its a contrary point or in some broader | controversial context. | | Overall, I feel HN is most fun when a lot of people are in | disagreement but also operating in good faith. | boibombeiro wrote: | Standing ground, specially when we are wrong, helps to learn | a lot more about the subject. | chillingeffect wrote: | if for no other reason than to quickly formulate | counterarguments. Next time at some meeting or other get | together, if someone pipes up with an anti-RISC comment, most | people won't be able to quickly refute it. But having had this | discussion here, we're inocculated and able to respond with | intelligence and experience. | okl wrote: | That sounds like you make up your mind first, then look for | arguments that support your position. I'd rather see the | arguments before I come to conclusions. | Avamander wrote: | I want to see what people say against it. | rm445 wrote: | Yeah. I'm not qualified to judge the quality of an instruction | set, but this writer destroyed all credibility with me by | claiming that an undergraduate could design a better | architecture (than this enormous collective effort) in a term. | It's right up there with claiming you could create Spotify in a | weekend or whatever. | gary_0 wrote: | The entire post is full of hyperbole, but the example they | show looks like a legitimate complaint. | rbanffy wrote: | I designed an ISA (and a CPU) as an undergrad, and I assure | you that, while it was very cool (stack-oriented, ASM looked | like Forth), it'd have horrendous performance these days. | AnIdiotOnTheNet wrote: | You say that as though Design By Committee isn't a thing. | Tuna-Fish wrote: | Except that the problem with RISC-V isn't even design by | committee. Even the most dysfunctional committee would | probably not fall into pits that RISC-V managed to. The | most credible explanation for it's misfeatures I've heard | is just plain bad taste and overly rigid adherence to | principle over practicality by it's original designers. | meepmorp wrote: | It's not a "slam piece," it's an email from a listserv, send | two months ago. Someone realized it'd be catnip for people on | HN and posted it. | rbanffy wrote: | It's not even a rock-solid critique... | [deleted] | Symmetry wrote: | I think talking about ISAs as better or worse than one another is | often a bad idea for the same reason that arguing about whether C | or Python is better is a bad idea. Different ISAs are used for | different purposes. We can point to some specific things as | almost always being bad in the modern world like branch delay | slots or the way the C preprocessor works but even then for | widely employed languages or ISAs there was a point to it when it | was created. | | RISC-V has a number of places it's employed where it makes an | excellent fit. First of all academia. For an undergrad making | building the netlist for their first processor or a grad student | doing their first out of order processor RISC-V's simplicity is | great for the pedagogical purpose. For a researcher trying to | experiment with better branch prediction techniques having a | standard high-ish performance open source design they can take | and modify with their ideas is immensely helpful. And for many | companies in the real world with their eyes on the bottom line | like having an ISA where you can add instructions that happen to | accelerate your own particular workload, where you can use a | standard compiler framework outside your special assembly inner | loops, and where you don't have to spend transistors on features | you don't need. | | I'm not optimistic about RISC-V's widescale adoption as an | application processor. If I were going to start designing an open | source processor in that space I'd probably start with IBM's now | open Power ISA. But there are so many more niches in the world | than just that and RISC-V is _already_ a success in some of them. | okl wrote: | Branch delay slots are an artifact of a simple pipeline without | speculation. There's nothing inherently "bad" about them. | pm215 wrote: | If you're designing a single CPU that definitely has a simple | pipeline, branch delay slots are maybe justifiable. If you're | designing an architecture which you hope will eventually be | used by many CPU designs which might have a variety of design | approaches, then delay slots are pretty bad because every | future CPU that _isn 't_ a simple non-speculating pipeline | will have to do extra work to fake up the behaviour. This is | an example of a general principle, which is that it's usually | a mistake to let microarchitectural details leak into the | architecture -- they quickly go stale and then both hw and sw | have to carry the burden of them. | oneplane wrote: | All of the discussions about instruction sets and "mine is better | than yours" or "anyone else could do better in a small amount of | time" are useless considering those arguments, if true, haven't | actually resulted in any free ISA being available broadly, | embraced broadly and hardware implementing that ISA being | available. | | It doesn't matter how great something else could be in theory if | it doesn't exist or doesn't meet the same scale and mindshare (or | adoption). | dragontamer wrote: | Hmmm... I think this argument is solid. Albeit biased from GMP's | perspective, but bignums are used all the time in RSA / ECC, and | probably other common tasks, so maybe its important enough to | analyze at this level. | | 2-instructions to work with 64-bits, maybe 1 more instruction / | macro-op for the compare-and-jump back up to a loop, and 1 more | instruction for a loop counter of somekind? | | So we're looking at ~4 instructions for 64-bits on ARM/x86, but | ~9-instructions on RISC-V. | | The loop will be performed in parallel in practice however due to | Out-of-order / superscalar execution, so the discussion inside | the post (2 instruction on x86 vs 7-instructions on RISC-V) | probably is the closest to the truth. | | ---------- | | Question: is ~2-clock ticks per 64-bits really the ideal? I don't | think so. It seems to me that bignum arithmetic is easily SIMD. | Carries are NOT accounted for in x86 AVX or ARM NEON | instructions, so x86, ARM, and RISC-V will probably be best. | | I don't know exactly how to write a bignum addition loop in AVX | off the top of my head. But I'd assume it'd be similar to the | 7-instructions listed here, except... using 256-bit AVX-registers | or 512-bit AVX512 registers. | | So 7-instructions to perform 512-bits of bignum addition is | 73-bits-per-clock cycle, far superior in speed to the 32-bits- | per-clock cycle from add + adc (the 64-bit code with implicit | condition codes). | | AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: | leading to ~36-bits-per-clock tick. | | ---------- | | ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes | 512-bits). RISC-V has a bunch of competing vector instructions. | | .......... | | Ultimately, I'm not convinced that the add + adc methodology here | is best anymore for bignums. With a wide-enough vector, it seems | more important to bring forth big 256-bit or 512-bit vector | instructions for this use case? | | EDIT: How many bits is the typical bignum? I think add+adc | probably is best for 128, 256, or maybe even 512-bits. But moving | up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say | without me writing code, but just a hunch). | | 2048-bit RSA is the common bignum, right? Any other bignums that | are commonly used? EDIT2: Now that I think of it, addition isn't | the common operation in RSA, but instead multiplication (and | division which is based on multiplication). | Teknoman117 wrote: | Can you treat the whole vector register as a single bignum on | x86? If so, I totally missed that. | dragontamer wrote: | No. | | Which is why I'm sure add / adc will still win at 128-bits, | or 256-bits. | | The main issue is that the vector-add instructions are | missing carry-out entirely, so recreating the carry will be | expensive. But with a big enough number, that carry | propagation is parallelizable in log2(n), so a big enough | bignum (like maybe 1024-bits) will probably be more efficient | for SIMD. | Taniwha wrote: | So this is one tiny corner of the ISA, not something that makes | ALL instruction sequences longer - essentially RISCV has no | condition codes (they're a bit of an architectural nightmare for | everyone doing any more than the simplest CPUs, they make every | instruction potentially have dependencies or anti-dependencies | with every other). | | It's a trade off - and the one that's been made is one that makes | it possible to make ALL instructions a little faster at the | expense of one particular case that isn't used much - that's how | you do computer architecture, you look at the whole, not just one | particular case | | RISCV also specifies a 128-bit variant that is of course FASTER | than these examples | monocasa wrote: | > they're a bit of an architectural nightmare for everyone | doing any more than the simplest CPUs, they make every | instruction potentially have dependencies or anti-dependencies | with every other | | It's doesn't have to be _that_ bad. As long as condition flags | are all written at once (or are essentially banked like | PowerPC) the dependency issue can go away because they're | renamed and their results aren't dependent on previous data. | | Now, of course, instructions that only update some condition | flags and preserve others are the devil. | sanxiyn wrote: | RISC-V designers optimized for C and found overflow flag isn't | used much and got rid of it. It was the wrong choice: overflow | flag is used a lot for JavaScript and any language with | arbitrary precision integer (including GMP, the topic of OP). | kannanvijayan wrote: | It kind of chafed when I excitedly read the ISA docs and | found that overflow testing was cumbersome. | | That said, I think it's less of an issue these days for JS | implementors in particular. It might have mattered more back | in the day when pure JS carried a lot of numeric compute load | and there weren't other options. These days it's better to | stow that compute code in wasm and get predictable reliable | performance and move on. | | The big pain points in perf optimization for JS is objects | and their representation, functions and their various type- | specializations. | | Another factor is that JS impls use int32s as their internal | integer representation, so there should be some relatively | straightforward approach involving lifting to int64s and | testing the high half for overflow. | | Still kind of cumbersome. | | There are similar issues in existing ISAs. NaN-boxing for | example uses high bits to store type info for boxed values. | Unboxing boxed values on amd64 involves loading an 8-byte | constant into a free register and then using that to mask out | the type. The register usage is mandatory because you can't | use 64-bit values as immediates. | | I remember trying to reduce code size and improve perf (and | save a scratch register) by turning that into a left-shift | right-shift sequence involving no constants, but that led to | the code executing measurably slower as it introduced data | dependencies. | formerly_proven wrote: | > It kind of chafed when I excitedly read the ISA docs and | found that overflow testing was cumbersome. | | It just feels backwards to me to _increase_ the cost of | these checks in a time where we have realized that | unchecked arithmetic is not a good idea in general. | aidenn0 wrote: | Over just the time I've been aware of things, there's been a | constant positive feedback loop of "checked overflow isn't | used by software, so CPU designers make it less performant" | followed by "Checked overflow is less performant so software | uses it less." | | I wish there was a way out. | | Language features are also often implemented at least partly | because they can be done efficiently on the premiere hardware | for the language. Then new hardware can make such features | hard to implement. | | WASM implemented return values in a way that was different | from register hardware, and it makes efficient codegen of | Common Lisp more challenging. This was brought to the | attention of the committee while WASM was still in flux, and | they (perhaps rightfully) decided CL was insufficiently | important to change things. | | I'm sure that people brought up the overflow situation to the | RISC-V designers, and it was similarly dismissed. It's just | unfortunate that legacy software is such a big driver of CPU | features as that's a race towards lowest-common-denominator | hardware. | audunw wrote: | Which is exactly the right trade-off for embedded CPUs, where | RISC-V is most popular right now. | | If desktop/server-class RISC-V CPUs become more common, it's | not unreasonable to think they'll add an extension that | covers the needs of managed/higher-level languages like | RISC-V. | | Even for server-class CPUs you could argue that you | absolutely want this extension to be optional, as you can | design more efficient CPUs for datacenters/supercomputers | where you know what kind of code you'll be running. | [deleted] | roca wrote: | Also Rust applications are increasingly going to be built | with integer overflow checking enabled, e.g. Android's Rust | components are going to ship with integer overflow checking. | And unlike say GMP, that poses a potential code density | problem because we're not talking about inner loops that can | be effectively cached, it's code bloat smeared across the | entire binary. | Taniwha wrote: | Yeah but the code required for an overflow check is just | one extra instruction (3 rather than 2) | masklinn wrote: | For generalised signed addition, the overhead is 3 | instructions _per addition_. It can be one in specific | contexts where more is known about the operands (e.g. | addition of immediates). | | It's always 1 in x64/ARM64 as they have built-in support | for overflow. | Taniwha wrote: | you have to include the branch instruction too in any | comparison | zozbot234 wrote: | They provide recommended insn sequences for overflow checking | as commentary to the ISA specification, and this enables | efficient implementation in hardware. | throwaway81523 wrote: | > They provide recommended insn sequences for overflow | checking as commentary to the ISA specification, and this | enables efficient implementation in hardware. | | I would like to see some benchmarks of this efficient | implementation in hardware, even simulated hardware, | compared against conventional architectures. | | Even for C, it's a recurring source of bugs and | vulnerabilities that int overflow goes undetected. What we | really need is an overflow trap like the one in IEEE | floating point. RISC-V went the opposite direction. | adrian_b wrote: | Any hardware adder provides almost for free the overflow | detection output (at less than the cost of an extra bit, so | less than 1/64 of a 64-bit adder). | | So anyone who thinks about an efficient hardware | implementation would expose the overflow bit to the | software. | | A hardware implementation that requires multiple additions | to provide the complete result of a single addition can be | called in many ways, but certainly not "efficient". | user-the-name wrote: | > RISCV also specifies a 128-bit variant that is of course | FASTER than these examples | | Is it actually implemented on any hardware? | ncmncm wrote: | No. Mentioning it is only meant to distract. | TheCondor wrote: | Is there a semi competitive Risc-V core implemented | anywhere? | | It all seem hypothetical to me now, fast cores would fuse | the instructions together so instruction count alone isn't | adequate for the original evaluation of the ISA. Now I'm | not sure that there are any that really do that.. | theresistor wrote: | This isn't an isolated case. RISC-V makes the same basic | tradeoff (simplicity above all else) across the board. You can | see this in the (lack of) addressing modes, compare-and-branch, | etc. | | Where this really bites you is in workloads dominated by tight | loops (image processing, cryptography, HPC, etc). While a | microarchitecture may be more efficient thanks to simpler | instructions (ignoring the added complexity of compressed | instructions and macro-fusion, the usual suggested fixes...), | it's not going to be 2-3x faster, so it's never going to | compensate for a 2-3x larger inner loop. | lottospm wrote: | I'm not an expert on ISA and CPU internals, but an X86 | instruction is not just "an instruction" anymore. Afaik, | since the P6 arch Intel is using a fancy decoder to translate | x86/-64 CISC into an internal RISC ISA (up to 4 u-ops per | CISC instruction) and that internal ISA could be quite close | to the RISC-V ISA for all I know. | | Instruction decoding and memory ordering can be a bit of | nightmare on CISC ISAs and fewer macro-instructions are not | automatically a win. I guess we'll eventually see in | benchmarks. | | Even though Intel has had decades to refine their CPUs I'm | quite excited to see where RISC-V is going. | theresistor wrote: | As someone who _is_ an expert on ISA and CPU internals, | this meme of "X86 has an internal RISC" is an over- | simplification that obscures reality. Yes, it decodes | instructions into micro-ops. No, micro-ops are not "quite | close to the RISC-V ISA". | | Macro fusion definitely has a place in microarchitecture | performance, especially when you have to deal with a legacy | ISA. RISC-V makes the _very unusual_ choice of depending on | it for performance, when most ISAs prefer to fix the | problem upstream. | jasonhansel wrote: | Indeed. Also not an expert, but relying on macro-op | fusion in hardware is tricky IIRC since different | implementors will (likely) choose different macro-ops, | resulting in strange performance differences between | otherwise-identical chips. | | Of course, you could start documenting "official" macro- | ops that implementations should support, but at that | point you're pretty much inventing a new ISA... | seoaeu wrote: | RISC-V _does_ document "official" macro-ops that | implementations are encouraged to support. | lonjil wrote: | Most commonly used x86_64 instructions decode to only 1 or | 2 uops, thus often also just as "complex" as the original | instructions. | Tuna-Fish wrote: | > but an X86 instruction is not just "an instruction" | anymore. | | This is technically true but not really. Decoding into many | instructions is mainly used for compatibility with the | crufty parts of the x86 spec. In general, for anything | other than rmw or locking a competent compiler or assembly | writer will only very rarely emit instructions that compile | to more than one uop. The way the frontend works, | microcoded instructions are extraordinarily slow on real | cpus. | | Modern x86 is basically a risc with a very complex decode, | few extra useful complex operations tacked on, and piles | and piles of old moldy cruft that no-one should ever touch. | monocasa wrote: | X86 doesn't have to go to microcode to have multiple uOPs | for an instruction. Most uarchs can spit out three or | four uOPs per instruction before having to resort to the | microcode ROM. Basically instructions that would only | need one microcode ROM row in a purely microcoded design | can be spit out of the fast decoders. | nickez wrote: | For those use cases you typically have specialised hardware | or an FPGA. | [deleted] | vadfa wrote: | So when h266 or whatever comes out you can't watch video | anymore because your cpu can't decode it in software even | if it tried? | lvh wrote: | An FPGA can be reprogrammed, and we do really do this for | standards with better longevity than video standards | (e.g. cryptographic ones like AES and SHA). For standards | like video codecs, we just use GPUs instead, which I | assume is what OP had in mind for "specialized hardware" | (specialization can still be pretty general :-)). | plorkyeran wrote: | Hardware video decoding is done by a single-purpose chip | on the graphics card (or dedicated hardware inside the | GPU), not via software running on the GPU. Adding support | for a new video codec requires buying a new video card | which supports that codec. | classichasclass wrote: | So do what Power does: most instructions that update the | condition flags can do so optionally (except for instructions | like stdcx. or cmpd where they're meaningless without it, and | corner oddballs like andi.). For that matter, Power treats | things like overflow and carry as separate from the condition | register (they go in a special purpose register), so you can | issue an instruction like addco. or just a regular add with no | flags, and the condition register actually is divided into | eight, so you can operate on separate fields without | dependencies. | jabl wrote: | ARM also does something similar, many instructions has a flag | bit specifying whether flags should be updated or not. It | doesn't have the multiple flag registers of POWER though. | crest wrote: | Which at least back in the day neither the IBM compilers | nor GCC 2.x - 4.x made much use of. I've seen only a few | handoptimzed assembler routines get decent use out of them. | Easy to fuse pairs a probably a good compromise for carry | calculation e.g. add + a carry instruction. That would get | rid of one of the additional dependencies, but it would | take a three operand addition or fusing two additions to | get rid of the second RISC V specific dependency. And while | GMP isn't unimportant it is still a niche use case that's | probably not worth throwing that much hardware resources at | to fix the ISA limitations in the uarch. | crest wrote: | IIRC few Power(PC) cores really split the condition register | nibbles into 8 renamable registers and while Power(PC) | includes everything (including at least two spare kitchen | sinks) only a few instructions can pick _which_ condition | register nibble to update. Most integer instructions can only | update cr0 and floating point instructions cr1. On the other | hand you can do nice hacks with the cornucopia of bitwise | available bitwise operations on condition register bits and | it 's one of the architectures where (floating point) | comparisons return the full set of results (less, equal, | greater, _unordered_ ). | adrian_b wrote: | On POWER, all the comparison instruction can store their | result in any of the 8 sets of flags. The conditional | branches can use any flag from any set. | | The arithmetic instructions, e.g. addition or | multiplication, do not encode a field for where to store | the flags, so they use, like you said, an implicit | destination, which is still different for integer and | floating-point. | | In large out-of-order CPUs, with flag register renaming, | this is no longer so important, but in 1990, when POWER was | introduced, the multiple sets of flags were a great | advance, because they enabled the parallel execution of | many instructions even in CPUs much simpler than today. | | Besides POWER, the 64-bit ARMv8 also provides most of the | 14 predicates that exist for a partial order relation. For | some weird reason, the IEEE FP standard requires only 12 of | the 14 predicates, so ARM implemented just those 12, even | if they have 14 encodings, by using duplicate encodings for | a pair of predicates. | | I consider this stupid, because there would not have been | any additional cost to gate correctly the missing predicate | pair, even if it is indeed one that is only seldom needed | (distinguishing between less-or-greater and equal-or- | unordered). | wbl wrote: | You can opt in to generating and propagating conditions and | rename the predicates as well. | bitwize wrote: | It's not a tiny corner. People do arithmetic with carry all the | time. Arbitrary precision arithmetic is more common than you | think. Congratulations, RISC-V, you've not only slowed down | every bignum implementation in existence, all those extra | instructions to compute carry will blow the I$ faster, | potentially slowing down any code that relies on a bignum | implementation as well. | dlsa wrote: | I noticed high and low in there so those code snippets look like | 32 bit code, at least to me. | | Is that even a fair comparison given the arm and x86 versions | used as examples of "better" were 64 bit? | | If we're really comparing 32 and 64 and complaining that 32 bit | uses more instructions than 64, perhaps we should dig out the 4 | bit processors and really sharpen the pitchforks. Alternatively, | we could simply not. Comparing apples to oranges doesn't really | help. | | From the article: | | _Let 's look at some examples of how Risc V underperforms._ | | _First, addition of a double-word integer with carry-out:_ | | add t0, a4, a6 // add low words | | sltu t6, t0, a4 // compute carry-out from low add | | add t1, a5, a7 // add hi words | | sltu t2, t1, a5 // compute carry-out from high add | | add t4, t1, t6 // add carry to low result | | sltu t3, t4, t1 // compute carry out from the carry add | | add t6, t2, t3 // combine carries | | _Same for 64-bit arm:_ | | adds x12, x6, x10 | | adcs x13, x7, x11 | | _Same for 64-bit x86:_ | | add %r8, %rax | | adc %r9, %rdx | adrian_b wrote: | The comparison is completely fair, because on RISC-V there is | no better way to generate the carries required for computations | with large integers. You cannot generate a carry with a 64-bit | addition, because it is lost and you cannot store it. | | You should take into account that the libgmp authors have a | huge amount of experience in implementing operations with large | integers on a very large number of CPU architectures, i.e. on | all architectures supported by gcc, and for most of those | architectures libgmp has been the fastest during many years, or | it still is the fastest. | jhallenworld wrote: | What if the multi-precision code is written in C? | | You can detect carry of (a+b) in C branch-free with: ((a&b) | | ((a|b) & ~(a+b))) >> 31 | | So 64-bit add in C is: f_low = a_low + b_low | c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31 | f_high = a_high + b_high + c_high | | So for RISC-V in gcc 8.2.0 with -O2 -S -c | add a1,a3,a2 or a5,a3,a2 not | a7,a1 and a5,a5,a7 and a3,a3,a2 | or a5,a5,a3 srli a5,a5,31 add | a4,a4,a6 add a4,a4,a5 | | But for ARM I get (with gcc 9.3.1): add | ip, r2, r1 orr r3, r2, r1 and r1, | r1, r2 bic r3, r3, ip orr r3, r3, | r1 lsr r3, r3, #31 add r2, r2, lr | add r2, r2, r3 | | It's shorter because ARM has bic. Neither one figures out to use | carry related instructions. | | Ah! But! There is a gcc macro: __builtin_uadd_overflow() that | replaces the first two C lines above: c_high = | __builtin_uadd_overflow(a_low, b_low, &f_low); | | So with this: | | RISC-V: add a3,a4,a3 sltu | a4,a3,a4 add a5,a5,a2 add | a5,a5,a4 | | ARM: adds r2, r3, r2 movcs | r1, #1 movcc r1, #0 add r3, r3, ip | add r3, r3, r1 | | RISC-V is faster.. | | EDIT: CLANG has one better: __builtin_addc(). | f_low = __builtin_addcl(a_low, b_low, 0, &c); f_high = | __builtin_addcl(a_high, b_high, c, &junk); | | x86: addl 8(%rdi), %eax adcl | 4(%rdi), %ecx | | ARM: adds w8, w8, w10 add | w9, w11, w9 cinc w9, w9, hs | | RISC-V: add a1, a4, a5 add | a6, a2, a3 sltu a2, a2, a3 add a6, | a6, a2 | volta83 wrote: | > RISC-V is faster.. | | I find it funny that you make the same pitfall than the author | did. | | Faster on which CPU? | | The author doesn't measure on any CPU, so here there are dozens | of people hypothesizing whether fusion happens or not, and what | the impact is. | brutal_chaos_ wrote: | > Faster on which CPU? | | Perhaps faster means fewer instructions in this instance? | Considering number of instructions is what has been | discussed. | jhallenworld wrote: | All other things equal, you would prefer smaller code for | better cache use. | brucehoult wrote: | Note that with the newly-ratified B extension, RISC-V has BIC | (called ANDN) as well as ORN and XNOR. | | In addition to the actual ALU instructions doing the add with | carry, for bignums it's important to include the load and store | instructions. Even in L1 cache it's typically 2 or 3 or 4 | cycles to do the load, which makes one or two extra | instructions for the arithmetic less important. Once you get to | bignums large enough to stream from RAM (e.g. calculating pi to | a few billion digits) it's completely irrelevant. ___________________________________________________________________ (page generated 2021-12-02 23:01 UTC)