hngopher.com

       [HN Gopher] "Risc V greatly underperforms"
       ___________________________________________________________________
        
       "Risc V greatly underperforms"
        
       Author : oxxoxoxooo
       Score  : 187 points
       Date   : 2021-12-02 18:51 UTC (4 hours ago)
        
 (HTM) web link (gmplib.org)
 (TXT) w3m dump (gmplib.org)
        
       | snvzz wrote:
       | I don't think they even tried to read the ISA spec documents. If
       | they did, they would have found that the rationale for most of
       | these decisions is solid: Evidence was considered, all the
       | factors were weighted, and decisions were made accordingly.
       | 
       | But ultimately, the gist of their argument is this:
       | 
       | >Any task will require more Risc V instructions that any
       | contemporary instruction set.
       | 
       | Which is easy to verify as utter nonsense. There's not even a
       | need to look at the research, which shows RISC-V as the clear
       | winner in code density. It is enough to grab any Linux
       | distribution that supports RISC-V and look at the size of the
       | binaries across architectures.
        
         | theresistor wrote:
         | > I don't think they even tried to read the ISA spec documents.
         | If they did, they would have found that the rationale for most
         | of these decisions is solid: Evidence was considered, all the
         | factors were weighted, and decisions were made accordingly.
         | 
         | It's perfectly possible to have read the spec and _disagree_
         | with the rationale provided. RISC-V is in fact the outlier
         | among ISAs in many of these design decisions, so there 's a
         | heavy burden of proof to demonstrate that making the contrary
         | decisions in many cases was the right call.
         | 
         | > Which is easy to verify as utter nonsense. There's not even a
         | need to look at the research, which shows RISC-V as the clear
         | winner in code density. It is enough to grab any Linux
         | distribution that supports RISC-V and look at the size of the
         | binaries across architectures.
         | 
         | This doesn't seem to be true when you actually do an apples-to-
         | apples comparison.
         | 
         | Taking as an example the build of Bash in Debian Sid
         | (https://packages.debian.org/sid/shells/bash). I chose this
         | because I'm pretty confident there's no functional or build-
         | dependency difference that will be relevant here. Other
         | examples like the Linux kernel are harder to compare because
         | the code in question is different across architectures. I saw
         | the same trend in the GCC package, so it's not an isolated
         | example.
         | 
         | riscv64 installed size: 6,157.0 kB amd64 installed size:
         | 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed
         | size: 6,041.0 kB
         | 
         | RV64 is outperforming the other 64-bit architectures, but
         | under-performing 32-bit ARM. This is consistent with
         | expectations: amd64 has a size penalty due to REX bytes, arm64
         | got rid of compressed instructions to enable higher
         | performance, and armhf (32-bit) has smaller constants embedded
         | in the binary.
         | 
         | Compressed instructions definitely _do_ work for making code
         | smaller, and that 's part of why arm32 has been very successful
         | in the embedded space, and why that space hasn't been rushing
         | to adopt arm64. For arm32, however, compressed instructions
         | proved to be a limiting factor on high performance
         | implementation, and arm64 moved away from them because of it.
         | _Maybe_ that 's due to some particular limitations of arm32's
         | compressed instructions that RISC-V compressed instructions
         | won't suffer from, but that remains to be proven.
        
           | mianos wrote:
           | Probably because, like most applications, that one does not
           | have a lot of wide multiplications. It is hard not the turn
           | this point into an insult at the OP.
        
           | btdmaster wrote:
           | Unfortunately, it seems that, at least for gmp, the shared
           | objects balloon in comparison to all other architectures. It
           | is about three times bigger (6000 instead of 2000kB):
           | https://packages.debian.org/sid/libgmp-dev. I am hopeful that
           | this may improve with extensions, though I know little about
           | the details.
        
             | jepler wrote:
             | text    data     bss     dec     hex filename
             | 311218    2284      36  313538   4c8c2 arm-linux-
             | gnueabihf/libgmp.so.10          374878    4328      56
             | 379262   5c97e riscv64-linux-gnu/libgmp.so.10
             | 480289    4624      56  484969   76669 aarch64-linux-
             | gnu/libgmp.so.10          511604    4720      72  516396
             | 7e12c x86_64-linux-gnu/libgmp.so.10
             | 
             | Strange, that's not what I see.
        
           | xiphias2 wrote:
           | A company creating embedded risc-v cpus also has some added
           | extra instruction set extensions that conflict with the
           | floating point instructions though.
        
           | adrian_b wrote:
           | The size of the files can be very misleading, because a large
           | part of the files can be filled with various tables with
           | additional information, with strings, with debugging
           | information, with empty spaces left for alignment to page
           | boundaries and so on. So the size of the installed files is
           | not necessarily correlated with the code size.
           | 
           | To compare the code sizes, you need tools like "size",
           | "readelf" etc. and the data given by the tools should still
           | be studied, to see how much of the code sections really
           | contain code.
           | 
           | I have never seen until now a program where the RISC-V
           | variant is smaller than the ARMv8 or Intel/AMD variant, and I
           | doubt very much that such a program can exist. Except for the
           | branches, where RISC-V frequently needs only 4 bytes instead
           | of 5 bytes for Intel/AMD or 8 bytes for ARMv8, for all the
           | other instructions it is very frequent to need 8 bytes for
           | RISC-V instead of 4 bytes for ARMv8.
           | 
           | Moreover, choosing compiler options like -fsanitize for
           | RISC-V increases the number of instructions dramatically,
           | because there is no hardware support for things like overflow
           | detection.
        
             | justinpombrio wrote:
             | So (i) the research on RISC-V that shows it has dense code
             | is bunk, and (ii) the fact that it compiles to a smaller
             | binary is irrelevant, and (iii) it sounds like you're
             | saying in advance that it might also have smaller code
             | section sizes within the binary but that's irrelevant too.
             | 
             | And yet you're quite confident that RISC-V has poor code
             | density. So you clearly have a source of knowledge that
             | others don't. If it's a blog/article/research, could you
             | share a link? If it's personal experimentation, you should
             | write a blog post, I would totally read that.
        
               | pierrebai wrote:
               | Re-read what was written. He is saying exactly that the
               | RISCV code size is larger, but to see it you need the
               | right tool used the right way to actually look at the
               | code, not debug info, constant sections, etc.
        
               | snvzz wrote:
               | I would hope there's something more to his reasoning.
               | 
               | There's tables showing values for just the code and
               | RISC-V beating aarch64 and x86-64 with ample margin, in
               | this very discussion.
        
           | ant6n wrote:
           | Perhaps thumb2 makes an 8-wide decide much harder. Plus, then
           | you can't have 32 instead of 16 registers.
        
           | akiselev wrote:
           | _> RISC-V is in fact the outlier among ISAs in many of these
           | design decisions, so there 's a heavy burden of proof to
           | demonstrate that making the contrary decisions in many cases
           | was the right call._
           | 
           | Genuinely asking, _why_? Do we think RISC-V should, or even
           | _could_ , try to compete against the AMD/Intel/ARM behemoths
           | on their playing field? Obviously ISAs are a low level detail
           | and far removed from the end product, but it feels like the
           | architectural decisions we are "stuck with" today are
           | inextricably intertwined with their contemporary market
           | conditions and historical happenstance. It feels like all the
           | experimental architectures that lost to x86/ARM (including
           | Intel's own) were simply too much too soon, before ubiquitous
           | internet and the open source culture could establish itself.
           | We've now got companies using genetic algorithms to optimize
           | ICs and people making their own semiconductors in the 100s of
           | microns range in their garages - maybe it's time to rethink
           | some things!
           | 
           | (EE in a past life but little experience designing ICs so I
           | feel like I'm talking out of my rear end)
        
             | lonjil wrote:
             | > Genuinely asking, why? Do we think RISC-V should, or even
             | could, try to compete against the AMD/Intel/ARM behemoths
             | on their playing field?
             | 
             | Well, it's exactly what many RISC-V folks are trying to do.
             | There's news about a new high performance RISC-V core on
             | the HN front page right now!
             | 
             | > but it feels like the architectural decisions we are
             | "stuck with" today are inextricably intertwined with their
             | contemporary market conditions and historical happenstance.
             | It feels like all the experimental architectures that lost
             | to x86/ARM (including Intel's own) were simply too much too
             | soon,
             | 
             | I just want to note that ARM64 was a mostly clean break
             | from prior versions of ARM. Basically a clean slate design
             | started in the late 2000s. It's a modern design built with
             | the same hindsight and approximate market conditions
             | available to the designers of RISC-V.
        
         | audunw wrote:
         | Yeah, I'm not sure he takes into considered compressed
         | instructions, which can be used anywhere, rather than being a
         | separate mode like Thumb on ARM.
         | 
         | Fusing instructions isn't just theoretical either. I'm pretty
         | sure it is or will be a common optimisation for CPUs aiming for
         | high performance. How exactly is two easily-fused 16-bit
         | instructions worse than one 32-bit one? Is there really a
         | practical difference other than the name of the instruction(s)?
         | 
         | At the same time, the reduced transistor count you get from a
         | simpler instruction set is not a benefit to be just dismissed
         | either. I'm starting to see RISC-V cores being put all over the
         | place in complex microcontrollers, because they're so damn
         | cheap, yet have very decent performance. I know a guy
         | developing a RISC-V core. He was involved with the proposal for
         | a couple of instructions that would put the code density above
         | Thumb for most code, and the performance of his core was better
         | than Cortex-M0 at a similar or smaller gate count. I'm not sure
         | if the instructions was added to the standard or not though.
         | 
         | Even for high performance CPUs, there's a case to be made for
         | requiring fewer transistors for the base implementation. It
         | makes it easier to make low-power low-leakage cores for the
         | heterogeneous architecture (big.little, M1, etc.) which is
         | becoming so popular.
        
         | robert_foss wrote:
         | So how would you suggest re-writing their example in less than
         | 6 instructions for RISC-V? X86/arm both have instructions that
         | include the carry operation for long additions, and only
         | require 2 instructions.
        
           | jolmg wrote:
           | I don't even see the issue. RISC-V is supposed to be a RISC-
           | type ISA. It's in the very name. That it takes more
           | instructions when compared to a CISC-type ISA like x86 is
           | completely normal.
           | 
           | https://en.wikipedia.org/wiki/Reduced_instruction_set_comput.
           | ..
        
             | theresistor wrote:
             | The argument for RISC instructions (in high performance
             | architectures) is that the faster decode makes up for the
             | increase in instruction count. The problem is that a faster
             | decode has a practical ceiling on how much faster it's
             | going to make your processor, and it's much lower than 3x.
             | If your workload is bottlenecked on an inner loop that got
             | 3x larger in instruction count, no 15% improvement in
             | decode performance is going to save you.
        
               | userbinator wrote:
               | Larger caches won't help much either; there's an old
               | article I remember that compares the efficiency of
               | various ARM, x86, and one MIPS CPU, and while x86 and ARM
               | were neck-and-neck, the MIPS was dead last in all the
               | comparisons despite having more cache than the others.
               | RISC-V is very similar to MIPS.
        
               | snvzz wrote:
               | Larger caches, as seen in Apple's M1 L1, are one of many
               | tools to deal with bad code density.
               | 
               | RISC-V might, at first glance, look similar to MIPS, but
               | it leads in code density among the 64 bit architectures.
        
               | User23 wrote:
               | > [RISC-V] leads in code density among the 64 bit
               | architectures.
               | 
               | You keep baldly asserting this in virtually all of your
               | very many replies here, with a vague appeal to your own
               | authority, but you haven't shown anything. Given that the
               | submission is precisely an example of bad code density,
               | if you're really here in the service of intellectual
               | curiosity then please show instead of just telling.
        
               | jolmg wrote:
               | I don't know what the design goals of RISC-V were, but I
               | would guess performance is not the key goal or at least
               | not the only goal. It makes more sense that ease of
               | implementation is a more important goal, if they want to
               | make adoption easy. That's another argument for favoring
               | RISC over CISC.
        
               | monocasa wrote:
               | If that's the case, you can always stick a uop cache in
               | after the decoder.
        
               | snvzz wrote:
               | Amount of instructions matters much less if they can be
               | fused into more complex instructions before execution.
               | 
               | RISC-V was designed with hindsight on fusion, thus it has
               | more opportunities for doing it, and doing it at a lower
               | cost.
               | 
               | And, due to the very high code density RISC-V has, the
               | decoder can do its job while not having to look at a huge
               | window.
        
               | theresistor wrote:
               | To everyone who's saying "But macro-fusion!" in response,
               | see my comment here:
               | https://news.ycombinator.com/item?id=29421107
        
           | rbanffy wrote:
           | I don't think there is anything preventing the processor to
           | fuse those instructions into a single operation once they are
           | decoded.
        
             | adrian_b wrote:
             | Instruction fusion is the magical rescue invoked by all
             | those who believe that the RISC-V ISA is well designed.
             | 
             | Instruction fusion has no effect on code size, but only on
             | execution speed.
             | 
             | For example RISC-V has combined compare-and-branch
             | instructions, while the Intel/AMD ISA does not have such
             | instructions, but all Intel & AMD CPUs fuse the compare and
             | branch instruction pairs.
             | 
             | So there is no speed difference, but the separate compare
             | and branch instructions of Intel/AMD remain longer at 5
             | bytes, instead of the 4 bytes of RISC-V.
             | 
             | Unfortunately for RISC-V, this is the only example
             | favorable for it, because for a large number of ARM or
             | Intel/AMD instructions RISC-V needs a pair of instructions
             | or even more instructions.
             | 
             | Fusing instructions will not help RISC-V with the code
             | density, but it is the only way available for RISC-V to
             | match the speed of other CPUs.
             | 
             | Even if instruction fusion can enable an adequate speed,
             | implementing such decoders is more expensive than
             | implementing decoders for an ISA that does not need
             | instruction fusion for the same performance
        
               | audunw wrote:
               | > Even if instruction fusion can enable an adequate
               | speed, implementing such decoders is more expensive than
               | implementing decoders for an ISA that does not need
               | instruction fusion for the same performance
               | 
               | I'm very skeptical that a RISC-V decoder would be much
               | more complex than an X86 one, even with instruction
               | fusion. For the simpler fusion pairs, decoding the fused
               | instructions wouldn't be more complex than matching some
               | of the crazy instruction encoding in X86.
               | 
               | For ARM I'm not so sure, but RISC-V does have very
               | significant instruction decoding benefits over ARM too,
               | so my guess would be that they'd be similar enough.
        
               | snvzz wrote:
               | >Unfortunately for RISC-V, this is the only example
               | favorable for it, because for a large number of ARM or
               | Intel/AMD instructions RISC-V needs a pair of
               | instructions or even more instructions.
               | 
               | Yet, as many pointed out to you already, RISC-V has the
               | highest code density of all contemporary 64bit
               | architectures. And aarch64, which you seem to like, is
               | beyond bad.
               | 
               | >but it is the only way available for RISC-V to match the
               | speed of other CPUs.
               | 
               | Higher code density and lack of flags helps the decoder a
               | big deal. This means it is far cheaper for RISC-V to keep
               | execution units well fed. It also enables smaller caches
               | and conversely higher clock speeds. It's great for
               | performance.
               | 
               | This, if anything, makes RISC-V the better ISA.
               | 
               | >Even if instruction fusion can enable an adequate speed,
               | implementing such decoders is more expensive than
               | implementing decoders for an ISA that does not need
               | instruction fusion for the same performance
               | 
               | Grasping at straws. RISC-V has been designed for fusion,
               | from the get-go. The cost of doing fusion with it has
               | been quoted to be as low as 400 gates. This is something
               | you've been told elsewhere in the discussion, but that
               | you chose to ignore, for reasons unknown.
        
               | avianes wrote:
               | I see that you are pretty active here in debunking anti-
               | RISC-V attacks, thanks for that! There are a bunch of
               | poor criticisms about RISC-V.
               | 
               | > This is something you've been told elsewhere in the
               | discussion, but that you chose to ignore, for reasons
               | unknown.
               | 
               | I would call it RISC-V bashing.
               | 
               | Everyone loves to hate RISC-V, probably because it's new
               | and heavily hyped.
               | 
               | It is really common to see irrelevant and uninformed
               | criticism about RISC-V. The article, which seems to be
               | enjoyed by the HN audience, literally says: "I believe
               | that an average computer science student could come up
               | with a better instruction set that Risc V in a single
               | term project". How can anyone say such a thing about a
               | collaborative project of more than 10 years, fed by many
               | scientific works and projects and many companies in the
               | industry?
               | 
               | I do not mean that RISC-V is perfect, there are some
               | points which are source of debate (e.g. favoring a vector
               | extension rather than the classic SIMD is a source of
               | interesting discussion). But I would appreciate on HN to
               | read better analysis and more interesting discussions.
        
               | socialdemocrat wrote:
               | You ignore compressed instructions on RISC-V which 64-bit
               | ARM does not have.
               | 
               | And if you compare 32-bit CPUs then RISC-V has twice as
               | many registers reducing the number of instructions needed
               | to read and write from memory.
               | 
               | RISC-V branching takes less space and so does vector
               | instructions. There are many case like that which adds up
               | end results in RISC-V having the most dense ISA in all
               | studies when using compressed instructions.
        
               | Dylan16807 wrote:
               | > Even if instruction fusion can enable an adequate
               | speed, implementing such decoders is more expensive than
               | implementing decoders for an ISA that does not need
               | instruction fusion for the same performance
               | 
               | On the other hand just splitting up x86 instructions is
               | very expensive, and decoding in general takes a lot of
               | work before you even start to do fancy tricks.
        
             | lordnacho wrote:
             | How does the instruction fusion work? It seems to be
             | mentioned in the article and by a couple of other
             | commenters.
        
               | volta83 wrote:
               | The CPU executes the two (or more) dependent instructions
               | "as if" they were one, e.g., in 1 cycle.
               | 
               | The CPU has a frontend, which has a decoder, which is the
               | part that "reads" the program instructions. When it
               | "sees" certain pattern, like "instruction x to register r
               | followed by instruction y consuming r", it can treat this
               | "as if" it was a single instruction if the CPU has
               | hardware for executing that single instruction (even if
               | the ISA doesn't have a name for that instruction).
               | 
               | This allows the people that build the CPU to choose
               | whether this is something they want to add hardware for.
               | If they don't, this runs in e.g. 2 cycles, but if they do
               | then it runs in 1. A server CPU might want to pay the
               | cost of running it in 1 cycle, but a micro controller CPU
               | might not.
        
           | vitno wrote:
           | Any != All. There is a difference between synthetic
           | benchmarks and real world test cases.
        
             | api wrote:
             | So this person found a pathological case for the RISC-V
             | instruction set?
        
               | adrian_b wrote:
               | This is not a pathological case, it is normal operation.
               | 
               | A computer is supposed to compute, but the RISC-V ISA
               | does not provide everything that is needed for all the
               | kinds of computations that exist.
               | 
               | The 2 most annoying missing features are the lack of
               | support for multi-word operations, which are needed to
               | compute with numbers larger than 64 bits, but also the
               | lack of support for detecting overflow in the operations
               | with standard-size integers.
               | 
               | If you either want larger integers or safe computations
               | with normal integers, the number of RISC-V instructions
               | needed for implementation is very large compared to any
               | other ISA.
               | 
               | While there are people who do a lot of computations with
               | large numbers, even the other users need such operations
               | every day. Large number computations are needed at the
               | establishment of any Internet connection, for the key
               | exchange. For software developers, many compilers, e.g.
               | gcc (which uses precisely libgmp), do computations with
               | large numbers during compilation, for various kind of
               | optimizations related to the handling of constants in the
               | code, e.g. for sub-expression extraction or for operation
               | complexity lowering.
               | 
               | So every time when some project is compiled, libgmp or
               | other equivalent library for large numbers might be used,
               | like also every time when you click on a new link in a
               | browser.
               | 
               | So this case is not at all pathological, except in the
               | vision of the RISC-V designers who omitted support for
               | this case.
               | 
               | That was a good decision for an ISA intended only for
               | teaching or for embedded computers, but it becomes a bad
               | decision when someone wants to use RISC-V outside those
               | domains, e.g. for general-purpose personal computers.
        
               | panick21_ wrote:
               | > A computer is supposed to compute, but the RISC-V ISA
               | does not provide everything that is needed for all the
               | kinds of computations that exist.
               | 
               | This is non-sense. You can still do everything you need.
               | Its just that in some cases the code size is a bit bigger
               | or smaller.
               | 
               | And the difference with compressed instruction is not
               | nearly as big, if you add fusion the difference is
               | marginal.
               | 
               | So really its not a pathological case its 'its slightly
               | worse' case and even that is hard to prove in the real
               | world given the other benefit RISC-V brings that
               | compensate.
               | 
               | And we can find 'slightly worse case' in the opposite
               | direction if we would go looking for them.
               | 
               | If you gave 2 equal skilled teams 100M and told them to
               | make the best possible personal computer chip, I would
               | bet on the RISC-V team winning 90 times out of a 100.
        
             | waterhouse wrote:
             | (RISC-V fan here) This is a real-world use case. GMP is a
             | library for handling huge integers, and adding two huge
             | integers is one of the operations it performs, and the way
             | to do that is to add-with-carry one long sequence of word-
             | sized integers into another. It's not synthetic; it's
             | extremely specialized, but real.
        
           | smoldesu wrote:
           | I don't think you're supposed to. The compiler handles that
           | stuff, ideally RISC-V is just another compilation target.
        
             | masklinn wrote:
             | Did you misunderstand the issue entirely?
             | 
             | The context here is the implementation of one of the inner
             | loops of a high-performance infinite-precision arithmetic
             | library (GMP), in RISCV the loop has 3x the instruction
             | count it has in competing architectures.
             | 
             | "The compiler" is not relevant, this is by design stuff
             | that the compiler is not supposed to touch because it's
             | unlikely to have the necessary understanding to get it as
             | tight and efficient as possible.
        
         | adrian_b wrote:
         | I am sorry but saying that RISC-V is a winner in code density
         | is beyond ridiculous.
         | 
         | I am familiar with many tens of instruction sets, since the
         | first computers with vacuum tubes until all the important
         | instruction sets that are still in use, and there is no doubt
         | that RISC-V requires more instructions and a larger code size
         | than almost all of them, for doing any task.
         | 
         | Even the hard-to-believe "research" results published by RISC-V
         | developers have always showed worse code density than ARM, the
         | so-called better results were for the compressed extension, not
         | for the normal encoding.
         | 
         | Moreover, the results for RISC-V are hugely influenced by the
         | programming language and the compiler options that are chosen.
         | RISC-V has an acceptable code size only for unsafe code, if the
         | programming language or the compiler options require run-time
         | checks, to ensure safe behavior, then the RISC-V code size
         | increases enormously, while for other CPUs it barely changes.
         | 
         | The RISC-V ISA has only 1 good feature for code size, the
         | combined compare-and-branch instructions. Because there
         | typically is 1 branch for every 6 to 8 instructions, using 1
         | instruction instead of 2 saves a lot.
         | 
         | Except for this good feature, the rest of the ISA is full of
         | bad features, which frequently require at least 2 instructions
         | instead of 1 instruction in any other CPU, e.g. the lack of
         | indexed addressing, which is needed in any loop that must
         | access some aggregate data structure, in order to be able to
         | implement the loop with a minimum number of instructions.
        
           | zamadatix wrote:
           | > Except for this good feature, the rest of the ISA is full
           | of bad features
           | 
           | What are your thoughts on the way RISC V handled the
           | compressed instructions subset?
        
             | brandmeyer wrote:
             | It only addresses a subset of the available registers.
             | Small revisions in a function which change the number of
             | live variables will suddenly and dramatically change the
             | compressibility of the instructions.
             | 
             | Higher-level languages rely heavily on inlining to reduce
             | their abstraction penalty. Profiles which were taken from
             | the Linux kernel and (checks notes...) _Drystone_ are not
             | representative of code from higher-level languages.
             | 
             | 3/4 of the available prefix instruction space was consumed
             | by the 16-bit extension. There have been a couple of
             | proposals showing that even better density could be
             | achieved using only 1/2 the space instead of 3/4, but they
             | were struck down in order to maintain backwards
             | compatibility.
        
             | hajile wrote:
             | It's not too surprising. Load, store, move, add, subtract,
             | shift, branch, jump. These are definitely the most common
             | instructions used.
             | 
             | Put it side-by-side with Thumb and it also looks pretty
             | similar (thumb has a multiply instruction IIRC).
             | 
             | Put it side-by-side with short x86 instructions accounting
             | for the outdated ones and the list is pretty similar (down
             | to having 8 registers).
             | 
             | All in all, when old and new instruction sets are taking
             | the same approach, you can be reasonably sure it's not the
             | absolute worst choice.
        
           | orra wrote:
           | > the so-called better results were for the compressed
           | extension, not for the normal encoding.
           | 
           | Ignoring RISC-V's compressed encoding seems a rather
           | artificial restriction.
        
           | snvzz wrote:
           | You seem to be making your whole argument around some facts
           | which you got wrong. The central points of your argument are
           | often used in FUD, thus they are definitely worth tackling
           | here.
           | 
           | >Even the hard-to-believe "research" results published by
           | RISC-V developers have always showed worse code density than
           | ARM
           | 
           | the code size advantage of RISC-V is not artificial academic
           | bullshit. It is real, it is huge, and it is trivial to
           | verify. Just build any non-trivial application from source
           | with a common compiler (such as GCC or LLVM's clang) and
           | compare the sizes you get. Or look at the sizes of binaries
           | in Linux distributions.
           | 
           | >the so-called better results were for the compressed
           | extension, not for the normal encoding.
           | 
           | The C extension can be used anywhere, as long as the CPU
           | supports the extension; most RISC-V profiles require it. This
           | is in stark contrast with ARMv7's thumb, which was a literal
           | separate CPU mode. Effort was put in making this very cheap
           | for the decoder.
           | 
           | The common patterns where number of instructions is larger
           | are made irrelevant by fusion. RISC-V has been thoroughly
           | designed with fusion in mind, and is unique in this regard.
           | It is within its right in calling itself the 5th generation
           | RISC ISA because of this, even if everything else is ignored.
           | 
           | Fusion will turn most of these "2 instructions instead of
           | one" into actually one instruction from the execution unit
           | perspective. There's opportunities everywhere for fusion, the
           | patterns are designed in. The cost of fusion on RISC-V is
           | also very low, often quoted as 400 gates, allowing even
           | simpler microarchitectures to implement it.
        
             | theresistor wrote:
             | > This is in stark contrast with ARMv7's thumb, which was a
             | literal separate CPU mode.
             | 
             | This is disingenuous. arm32's Thumb-2 (which has been
             | around since 2003) supports both 16-bit and 32-bit
             | instructions in a single mode, making it directly
             | comparable to RV32C.
        
               | snvzz wrote:
               | Your statement does not run counter to mine quoted.
               | 
               | Thumb-2 is better designed than Thumb was, but it is
               | still a separate CPU mode.
               | 
               | And it got far less use than it deserved, because of
               | this. It doesn't do everything, and switching has a
               | significant cost. This cost is in contrast with RISC-V's
               | C extension.
        
               | lonjil wrote:
               | The ARMv8-M profile is Thumb-only, so on ARM
               | microcontroller platforms there is no switching at all,
               | and it does do everything, or at least everything you
               | might want to do on a microcontroller, and has of course
               | gotten a very large amount of use, considering how widely
               | deployed those cores are.
        
               | Dylan16807 wrote:
               | Is thumb-only particularly good for density, compared to
               | being able to mix instruction sizes?
        
               | lonjil wrote:
               | Thumb has both 16-bit and 32-bit instructions.
        
               | Dylan16807 wrote:
               | Oh, you meant thumb _and_ thumb-2.
        
               | lonjil wrote:
               | "thumb-2" isn't really a thing. It's just an informal
               | name from when more instructions were being added to
               | thumb. it's still just thumb.
        
               | Taniwha wrote:
               | The main distinction is that the 16-bit RISCV-C ISA
               | exactly maps to existing 32-bit RISCV instructions, its
               | implementation only occurs in the decode pipe stage
        
               | bpye wrote:
               | The C extension is that, an extension. A RISC-V core with
               | the C extension should still support the long encoding as
               | well. There is no 16-bit variant specified, only 32, 64
               | and 128.
               | 
               | There is an E version of the ISA with a reduced register
               | set, but this is a separate thing.
        
               | brucehoult wrote:
               | You are mixing up integer register size and instruction
               | length.
               | 
               | RISC-V has variants with 32 bit, 64 bit, or (not yet
               | fully specified or implemented) 128 bit registers.
               | 
               | RISC-V has instructions of length 32 bits and, optionally
               | but almost universally, 16 bit length.
        
           | dragontamer wrote:
           | > The RISC-V ISA has only 1 good feature for code size, the
           | combined compare-and-branch instructions. Because there
           | typically is 1 branch for every 6 to 8 instructions, using 1
           | instruction instead of 2 saves a lot.
           | 
           | Which isn't really a big advantage, because ARM and x86
           | macro-op fuse those instructions together. (That is, those
           | 2-instructions are decoded and executed as 1x macro-op in
           | practice).
           | 
           | cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86
           | vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on
           | RISC-V.
           | 
           | So they're equal in practice.
           | 
           | -----
           | 
           | ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM
           | but 8-bytes used up.
        
             | theresistor wrote:
             | ARM64 has cbz/tbz compare-and-branch instructions that
             | cover many common cases in a single 4-byte instruction as
             | well.
        
           | audunw wrote:
           | > I am sorry but saying that RISC-V is a winner in code
           | density is beyond ridiculous.
           | 
           | You have no idea what you're talking about. I've worked on
           | designs with both ARM and RISC-V cores. The RISC-V code
           | outperforms the ARM core, with smaller gate count, and has
           | similar or higher code density in real world code, depending
           | on the extensions supported. The only way you get much lower
           | code density is without the C extension, but I haven't seen
           | it not implemented in a real-world commercial core, and if it
           | wasn't, I'm sure there was because of a benefit (FPGAs
           | sometimes use ultra-simple cores for some tasks, and don't
           | always care about instruction throughput or density)
           | 
           | It should be said that my experience is in embedded, so yes,
           | it's unsafe code. But the embedded use-case is also the most
           | mature. I wouldn't be surprised if extensions that help with
           | safer programming languages would be added for desktop/server
           | class CPUs, if they haven't already (I haven't followed the
           | development of the spec that closely recently)
        
             | voz_ wrote:
             | Textbook example of the kind of hostility and close-
             | mindedness that is creeping into our beloved site. Why are
             | we dick measuring? why are we comparing experience like
             | this? so much "I" "I" "I"...
             | 
             | I have no horse in the technical race here, but I certainly
             | am put off from reading what should be an intellectually
             | stimulating discussion by the nature of replies like this.
        
               | flatiron wrote:
               | Oh no. We don't maintain this site with these types of
               | comments no matter your feelings. It's the internet.
               | Don't get heated!
        
               | snvzz wrote:
               | It was likely instigated by its parent trying to inflate
               | himself by giving themselves some credentials, to try and
               | give their voice more weight.
               | 
               | All of it, pretty sad, but I believe we should focus on
               | the technical arguments and try to put everything else
               | aside in order to re-conduct the discussion somewhere
               | more useful.
        
             | dataflow wrote:
             | >> RISC-V has an acceptable code size only for unsafe code
             | 
             | > You have no idea what you're talking about.
             | 
             | > It should be said that my experience is in embedded, so
             | yes, it's unsafe code.
             | 
             | Just going based off your reply it certainly sounds like
             | they had at least _some_ idea what they were talking about?
             | In which case omitting that sentence would probably help.
        
           | [deleted]
        
       | okl wrote:
       | Few years ago, I designed my own ISA. In that time I investigated
       | design decisions in lots of ISAs and compared them. There was
       | nothing in the RISC-V instruction set that stood out to me, like
       | for example, the SuperH instruction set, which is remarkably well
       | designed.
       | 
       | Edit: Don't get me wrong, I don't think RISC-V is "garbage" or
       | anything like that. I just think it could have been better. But
       | of course, most of an architecture's value comes from its
       | ecosystem and the time spent optimizing and tailoring
       | everything...
        
         | AlotOfReading wrote:
         | My memories of SuperH are a bit different. Yeah, it's cleaner
         | than ARM, but the delay slots, hardware division, and the tiny
         | register file among others made life unnecessarily difficult. A
         | lot of those design decisions didn't hold up well over time.
        
           | okl wrote:
           | Interesting! From which perspective? Implementing the ISA,
           | compiler or applications? Did you write machine language or
           | compiled?
        
             | AlotOfReading wrote:
             | Mainly system level and higher, but a bit of all three, I
             | suppose. I was helping reverse engineer a customized SH
             | chip and ended up implementing a small VM and optimized
             | system libraries/utilities afterwards. Most of the time was
             | spent in assembly, with some machine code and C on either
             | side.
        
               | okl wrote:
               | Thanks for your insight.
        
         | audunw wrote:
         | Never heard of SuperH. I see it has branch delay slots, which
         | is a seemingly clever but terrible idea. It's one of the
         | reasons RISC-V quickly overtook OpenRISC in popularity I think.
         | 
         | Not having anything that stands out is perhaps a good thing.
         | Being "clever" with the ISA tends to bite you when implementing
         | OoO superscalar cores.
        
       | Teknoman117 wrote:
       | A bit of a computer history question: I have never looked at the
       | ISA of the Alpha (referenced in post), but RISC V has always
       | struck me as being nearly identical to (early) MIPS, just without
       | the HI and LO registers for multiply results and the addition of
       | variable length instruction support, even if the core ISA doesn't
       | use them.
       | 
       | MIPS didn't have a flag register either and depended on a
       | dedicated zero register and slt instructions (set if less than)
        
         | dfox wrote:
         | The no flags at all part is clearly inspired by Alpha including
         | the rationale of flags being detrimental to OoO implementation.
         | 
         | MIPS is classical RISC design that was not designed to be OoO-
         | friendly at all and is simply designed for ease of
         | straightforward pipelined implementation. The reason why it
         | does not have flags probably simply comes down to the
         | observation that you don't need flags for C.
        
         | userbinator wrote:
         | Yes, that's exactly my thought every time it comes out; RISC-V
         | is likely to displace MIPS everywhere performance doesn't
         | matter, but it'll have a hard time competing with ARM or x86 on
         | that.
        
         | [deleted]
        
         | okl wrote:
         | I bet this article on RISC-V's genealogy is interesting for
         | you: https://live-risc-v.pantheonsite.io/wp-
         | content/uploads/2016/...
        
           | cpeterso wrote:
           | Andrew Waterman's thesis ("Design of the RISC-V Instruction
           | Set Architecture") has a very approachable comparison of
           | RISC-V to MIPS, SPARC, Alpha, ARMv7, ARMv8, OpenRISC, and
           | x86:
           | 
           | https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-.
           | ..
        
       | socialdemocrat wrote:
       | RISC V is an opinionated architecture and that is always going to
       | get some people fired up. Any technology that aims for simplicity
       | has to make hard choices and trade offs. It isn't hard to
       | complain about missing instructions when there are less than 100
       | of them. Meanwhile nobody will complain about ARM64 missing
       | instructions because it had about 1000 of them.
       | 
       | Therein lies the problem. Nobody ever goes out guns blazing
       | complaining about too many instructions despite the fact that
       | complexity has its own downsides.
       | 
       | RISC-V has been designed aggressively to have minimal ISA to
       | leave plenty of room to grow, and require minimal number of
       | transistors for a minimal solution.
       | 
       | Should this be a showstopper down the road, then there will be
       | plenty of space to add an extensions that fixes this problem.
       | Meanwhile embedded systems paying a premium for transistors are
       | not going to have to pay for these extra instructions as only 47
       | instructions have to be implemented in a minimal solution.
        
       | throwaway19937 wrote:
       | TL;DR RISC-V doesn't have add with carry.
       | 
       | I'm not a fan of the RISC-V design but the presence or absence of
       | this instruction doesn't make it a terrible architecture.
        
         | stephencanon wrote:
         | _For the purposes of implementing multi-word arithmetic_, which
         | is Torbjorn's whole deal, it kind of does. (Also the actual
         | post subject is "greatly underperforms").
        
           | FullyFunctional wrote:
           | It's meaningless to look at the code in absence an
           | implementation and conclude anything about the performance.
           | He doesn't know what the performance is. Having six
           | instruction vs. two does not mean one is 3X faster than the
           | other. It means nothing at all.
        
             | stephencanon wrote:
             | We know enough about the implementation of current RISC-V
             | cores to conclude that they won't be remotely competitive
             | on this one narrow (yet fairly high-impact for some
             | workloads) task. Is it _possible_ to design a core that is
             | competitive on this workload even when handicapped by a
             | limited ISA? Yes, definitely. Have any RISC-V designers
             | shown any interest in doing so yet? No.
        
       | [deleted]
        
       | aappleby wrote:
       | The author seems to be assuming that the designers have never
       | thought about this corner case.
        
         | sanxiyn wrote:
         | No, the author is arguing this is not a corner case but a
         | central? case. I tend to agree.
        
       | kayamon wrote:
       | "Gee no carry flag how will we cope?"
        
       | pcwalton wrote:
       | Doesn't RISC-V have an add-with-carry instruction as part of the
       | vector extension? I see it listed here:
       | https://github.com/riscv/riscv-v-spec/releases/tag/v1.0
        
         | monocasa wrote:
         | Afaict that's only for operations the vector register file.
         | Most of the complaints about the lack of addc/subc are around
         | how they're heavily used in JITs for languages that want to
         | speculatively optimize multi precision arthimetic into the
         | integer register file for their regular integer ops.
         | JavaScript, a lot of Lisps, the MLs all fit into that space.
        
           | pcwalton wrote:
           | Sure, but this email is in the context of GMP, which should
           | be using the vector extension, no?
        
             | monocasa wrote:
             | I don't think so; most of the users I know of for the
             | integer side of GMP are compilers/runtimes. An apt rdepends
             | on the gmp packages in Ubuntu only shows stuff like ocaml,
             | and I know gcc vendors it.
             | 
             | Edit: Another place you see this kind of arthimetic is
             | crypto, but those specific use cases (Diffie Hellman, RSA,
             | a few others) don't tend to be vectorized. You have one op
             | you're trying to work through with large integers, and
             | there's the carry dependency on each partial op. The carry
             | depdent crypto algorithms aren't typically vectorisable.
        
       | Shadonototra wrote:
       | Who changed the title?
       | 
       | Moderators where are you?
        
       | CalChris wrote:
       | TL;DR                 My code snippet results in bloated code for
       | RISC-V RV64I.
       | 
       | I'm not sure how bloated it is. All of those instructions will
       | compress [1].
       | 
       | [1] https://riscv.org/wp-content/uploads/2015/05/riscv-
       | compresse...
       | 
       | It's slower on RISC-V but not a lot on a superscalar. The x86 and
       | ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles
       | of latency.                 1. add  t0, a4, a6  add  t1, a5, a7
       | 2. sltu t6, t0, a4  sltu t2, t1, a5       3. add  t4, t1, t6
       | sltu t3, t4, t1       4.                  add  t6, t2, t3
       | 
       | I'm not getting _terrible_ from this.
        
         | Koffiepoeder wrote:
         | CPU performance increases nowadays often are measured in single
         | digit percentages because the margins became so thin. Doubling
         | the cycles is a 100% increase. You can call that not so
         | bloated, but I think many people would beg to differ.
         | 
         | On the other hand I take this article with a grain of salt
         | anyhow, since it only discusses a single example. I think we
         | would need a lot more optimized assembly snippet comparisons to
         | make meaningful conclusions (and even then there could be
         | authored selection bias).
        
           | snvzz wrote:
           | The article's approach to arguing against RISC-V is fairly
           | childish.
           | 
           | >"here's this snippet, it takes more instructions on RISC-V,
           | thus RISC-V bad"
           | 
           | Is pretty much what it's saying. An actual argument about ISA
           | design would weight the cost this has with the advantages of
           | not having flags, provide a body of evidence and draw
           | conclusions from it. But, of course, that would be much
           | harder to do.
           | 
           | What's comparatively easy and they should have done, however,
           | is to read the ISA specification. Alongside the decisions
           | that were made, there's a rationale to support it. Most of
           | these choices, particularly so the ones often quoted in FUD
           | as controversial or bad, have a wealth of papers, backed by
           | plentiful evidence, behind them.
        
       | marcodiego wrote:
       | So, how meaningful is the "projected score of 11+
       | SPECInt2006/GHz" as claimed here:
       | https://www.sifive.com/press/sifive-raises-risc-v-performanc... ?
        
       | bell-cot wrote:
       | Rather than glib hand-waving in front of the chalkboard...might
       | there be a decent piece or few of RISC V hardware, which could
       | actually be compared to non-RISC V hardware with similar budgets
       | (for design work, transistor count, etc.) - to see how things
       | work out when running substantial pieces of decently-compiled
       | code?
        
       | yjftsjthsd-h wrote:
       | The original title was "Risc V greatly underperforms", which
       | seems like a far more defensible and less inflammatory claim than
       | "Risc V is a terrible architecture", which was picked from the
       | actual message but still isn't the title.
        
         | gary_0 wrote:
         | I almost skipped this thread because of the flamebait title.
         | This is a debate over CPU instruction set performance details,
         | nobody is going to die.
        
           | yjftsjthsd-h wrote:
           | In fairness, this is Hacker News; flame wars^w^w respectful
           | but intense debate over editors, operating systems, and, yes,
           | ISA details, is somewhat expected. (Although, yes, I'm not
           | sure that I would get too worked up about this particular
           | detail; even if the stated claim is 100% true and
           | unmitigated, it means _some_ kinds of code will have
           | potentially bigger binaries. I understand a math library
           | person caring, I don 't think I care.)
        
             | dang wrote:
             | Flamewars are definitely not expected - they're against the
             | rules and something we try to dampen in every way we know.
             | 
             | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&qu
             | e...
             | 
             | https://news.ycombinator.com/newsguidelines.html
        
             | rbanffy wrote:
             | > I understand a math library person caring, I don't think
             | I care.
             | 
             | Not wasting much sleep on this one. Not sure there's
             | anything on the spec that stops implementations from
             | recognizing the two instructions and fuse them into a
             | single atomic operation for the backends to deal with.
             | It'll occupy more space in the L1 cache, but that's it.
        
         | Dylan16807 wrote:
         | I would say that "underperforms" is indefensible from such a
         | simple analysis that doesn't touch IPC. "Terrible" is at least
         | openly an opinion.
        
         | dang wrote:
         | Fixed now. Thanks!
        
       | xondono wrote:
       | Experimenting with RISC-V is one of those things I keep
       | postponing.
       | 
       | For those are more versed, is this really a general problem?
       | 
       | I was under the impression that the real bottleneck is memory,
       | and things like this would be fixed in real applications through
       | out of order execution, and that it payed off having simpler
       | instructions because compilers had more freedom to rearrange
       | things.
        
         | fwsgonzo wrote:
         | RISC-V is completely fine, heavily based on research and well
         | thought out. It does have pros and cons like any other
         | architecture, and for what it does well, it does it really
         | well!
        
       | nynx wrote:
       | If this really is an issue, I imagine risc-v could easily get an
       | extension for adding/subtracting/etc simd vectors together in a
       | way that would expand to the capabilities of underlying processor
       | without requiring hardcoding.
        
         | kayamon wrote:
         | It already has this.
        
           | nynx wrote:
           | Yes, the simd extension has the flexible vectors thing. I
           | don't think it has a way to treat simd vectors as bigints.
        
       | kelnos wrote:
       | > _My conclusion is that Risc V is a terrible architecture._
       | 
       | Kinda stopped reading here. It's a pretty arrogant hot take. I
       | don't know this guy, maybe he's some sort of ISA expert. But it
       | strains credulity that after all this time and work put into it,
       | RISC-V is a "terrible architecture".
       | 
       | My expectation here is that RISC-V requires some inefficient
       | instruction sequences in some corners somewhere (and one of these
       | corners happens to be OP's pet use case), but by and large things
       | are fine.
       | 
       | And even then, I don't think that's clear. You're not going to
       | determine performance just by looking at a stream of instructions
       | on modern CPUs. Hell, it's really hard to compare streams of
       | instructions from _different ISAs_.
        
       | SavantIdiot wrote:
       | A bit off topic, but when did a DWORD implicitly become 64bits?
        
         | robertlagrant wrote:
         | I heard the bird was DWORD.
        
       | fhood wrote:
       | Oh wow, everybody else is debating the specific intricacies of
       | the design decisions, and I'm here wondering why you would
       | complain about not enough instructions in an architecture with
       | "RISC" in the name.
        
         | adrian_b wrote:
         | The RISC idea was to not include in the ISA instructions so
         | complex that they would require a multi-cycle implementation.
         | 
         | The minimum duration of the clock cycle of a modern CPU is
         | essentially determined by the duration of a 64-bit integer
         | addition/subtraction, because such operations need a latency of
         | only 1 clock cycle to be useful.
         | 
         | Operations that are more complex than 64-bit integer
         | addition/subtraction, e.g. integer multiplications or floating-
         | point operations, need multiple cycles, but they are pipelined
         | so that their throughput remains at 1 per cycle.
         | 
         | So 64-bit addition/subtraction is certainly expected to be
         | included in any RISC ISA.
         | 
         | The hardware adders used for addition/subtraction provide, at a
         | negligible additional cost, 2 extra bits, carry and overflow,
         | which are needed for operations with large integers and for
         | safe operations with 64-bit integers.
         | 
         | The problem is that the RISC-V ISA does not offer access to
         | those 2 bits and generating them in software requires a very
         | large cost in execution time and in lost energy in comparison
         | with generating them in hardware.
         | 
         | I do not see any relationship between these bits and the RISC
         | concepts, omitting them does not simplify the hardware, but it
         | makes the software more complex and inefficient.
        
       | samstave wrote:
       | So a few decades ago.. I knew a guy who was one of the chief
       | designers of RISC procs at MIPs
       | 
       | Yeah - he was addicted to prostitutes...
       | 
       | This guy was doing amazing engineering work and we talked at
       | length about designing a system for basically what became rack-
       | mount trays
       | 
       | But he was so distracted by his addiction to prostitutes...
        
       | jasonhansel wrote:
       | One thing that bothers me: RISC-V seems to use up a lot of the
       | available instruction set space with "HINT" instructions that
       | nobody has (yet) found a use for. Is it anticipated that all of
       | the available HINTs will actually be used, or is the hope that
       | the compressed version of the instruction set will avoid the
       | wasted space?
        
       | jpfr wrote:
       | The idea is to use the compressed instruction extension. Then two
       | adjacent instructions can be handled like a single "fat"
       | instruction with a special case implementation.
       | 
       | That allows more flexibility for CPU designs to optimize
       | transistor count vs speed vs energy consumption.
       | 
       | This guy clearly did not look at the stated rationale for the
       | design decisions of RISC-V.
        
         | theresistor wrote:
         | Compressed instructions and macro-fusion aren't magical
         | solutions. It's not always possible to convince the compiler to
         | generate the magical sequence required, and it actually makes
         | high-performance implementations (wide superscalar) more
         | difficult thanks to the variable width decoding.
         | 
         | Beyond that, compressed instructions are _not_ a 1:1 substitute
         | for more complex instructions, because a pair of compressed
         | instructions cannot have any fields that cross the 16-bit
         | boundary. This means you can 't recover things like larger
         | load/store offsets.
         | 
         | Additionally, you can't discard architectural state changes due
         | to the first instruction. If you want to fuse an address
         | computation with a load, you still have to write the new
         | address to the register destination of the address computation.
         | If you want to perform clever fusion for carry propagation, you
         | still have to perform all of the GPR writes. This is work that
         | a more complex instruction simply wouldn't have to perform, and
         | again it complicates a high performance implementation.
        
           | panick21_ wrote:
           | Part of the idea is to create standard ways to do certain
           | things and then hope compiler writers generation code
           | according to that. That will allow more chip designers to
           | take advantage of those if they want to.
           | 
           | They spent a lot of time and effort on making sure the
           | decoding pretty good and useful for high performance
           | implementations.
           | 
           | RISC-V is designed for very small and very large system. At
           | some point some tradeoffs need to be made but these are very
           | reasonable and most of the time no a huge problem.
           | 
           | For the really specialized cases where you simply can't live
           | with those extra instruction, those will be added to the
           | standard and then some profiles will include them and others
           | not. If those instructions are really as vital as those that
           | want them claim, they will find their way into many profiles.
           | 
           | Saying RISC-V is 'terrible' because of those choices is not
           | fair way of evaluating it.
        
             | userbinator wrote:
             | _RISC-V is designed for very small and very large system_
             | 
             | That's exactly the problem --- there is no one-size-fits-
             | all when it comes to instruction set design.
        
               | panick21_ wrote:
               | There is a trade-off but there is overall far more value
               | in having it be unified.
               | 
               | The trade-offs are mostly very small or non existent once
               | you consider the standard extensions that different use
               | cases will have.
               | 
               | Overall having a unified open instruction set is far
               | better then hand designing many different instruction
               | sets just to get marginal improvement. Some really
               | extreme application might require that, but for the most
               | part the whole indsutry could do just fine with RISC-V.
               | Both on the low and on the high end, and in fact better
               | then most of the alternative all things considered.
               | 
               | If integer checking is really the be all end all and
               | without it RISC-V can not be successful without it, it
               | will be added and it will be pulled into all the
               | profiles. If it is not actually that relevant then it
               | wont. If it is very useful for some verticals and not
               | others, it will be in those profiles and not in others.
        
           | jpfr wrote:
           | In the context of gmp, people write architecture-specific
           | assembly for the inner loop anyway.
           | 
           | Besides that, you raise good points on sources of complexity.
           | I'm waiting for the benchmarks once such developments have
           | been incorporated. Everything else is guesswork.
        
           | audunw wrote:
           | > and it actually makes high-performance implementations
           | (wide superscalar) more difficult thanks to the variable
           | width decoding.
           | 
           | More difficult than x86? We're talking about a damn simple
           | variable width decoding here.
           | 
           | I could imagine RISC-V with C extension being more tricky
           | than 64-bit ARM. Maybe.
           | 
           | > and again it complicates a high performance implementation.
           | 
           | But so much of the rationale behind the design of RISC-V is
           | to simplify high performance implementation in other ways. So
           | the big question is what the net effect is.
           | 
           | The other big question is if extensions will be added to
           | optimise for desktop/server workloads by the time RISC-V CPUs
           | penetrate that market significantly.
        
         | okl wrote:
         | Sweet spot seems to be 16-bit instructions with 32/64-bit
         | registers. With 64-bit registers you need some clever way to
         | load your immediates, e.g., like the shift/offset in ARM
         | instructions.
        
         | msbarnett wrote:
         | He literally addressed this, albeit obliquely, in the message
         | 
         | > I have heard that Risc V proponents say that these problems
         | are known and could be fixed by having the hardware fuse
         | dependent instructions. Perhaps that could lessen the
         | instruction set shortcomings, but will it fix the 3x worse
         | performance for cases like the one outlined here?
         | 
         | Macro-fusion can _to some extent_ offset the weak instruction
         | set, but you 're never going to get a multiple integer
         | multiplier speedup out of it given the complexity of inter-op
         | architectural state changes that have to be preserved, and
         | instruction boundary limitations involved; it's never going to
         | offset a 3x blowup in instruction count in a tight loop.
        
           | socialdemocrat wrote:
           | Fusing 3 instructions is not unusual, those could also have
           | been compressed. Thus you have no more microcode to execute
           | and only 50% more cache usage rather than 300%
        
         | alerighi wrote:
         | Even if you do so, the program size is still bigger, and it
         | consumes more disk, RAM and most importantly cache space.
         | Wasting cache for having multiple instructions when on another
         | architecture it's done by only one doesn't make particular
         | sense to me.
         | 
         | Also, it's said that x86 is bad because the instructions are
         | then reorganized and translated inside the CPU. But it seems
         | that you are proposing the same, the CPU that preprocessed the
         | instructions and fuses some into a single one (the opposite
         | that x86 does). Ad that point, it seems to me that what x86
         | does makes more sense: have a ton of instruction (and thus
         | smaller programs and thus more code that can fit in cache) and
         | split them, rather than having a ton of instructions (and waste
         | cache space) for then the CPU to combine them into a single one
         | (a thing that a compiler can also do).
        
           | Buttons840 wrote:
           | How many cache misses are for program instructions, versus
           | data misses?
        
             | anarazel wrote:
             | IME icache misses are a frequent bottleneck. There's plenty
             | code where all the time is spent in one tight inner loop
             | and thus the icache is not a constraint, but there's also a
             | lot of cases with a much flatter profile. Where icache
             | misses suddenly become a serious constraint.
        
             | alerighi wrote:
             | Depends on the application. But even if they are few, it's
             | not a good reason to have them just for having a nice
             | instruction set, that if you are not writing assembly by
             | hand (and nobody does these days) doesn't give you any
             | benefit.
             | 
             | Also don't reason with the desktop or server use case in
             | mind, where you have TB of disk and code size doesn't
             | matter. RISC-V is meant to be used also for embedded
             | systems (in fact their use nowadays is only for these
             | systems), where usually code size matter more than
             | performance (i.e. you typically compile with -Os). In these
             | situations more instructions means more flash space wasted,
             | meaning you can fit less code.
        
               | rbanffy wrote:
               | > that if you are not writing assembly by hand (and
               | nobody does these days) doesn't give you any benefit.
               | 
               | An elegant architecture is easier to reason about.
               | Compilers will make fewer wrong decisions, fewer bugs
               | will be introduced, and fewer workarounds will need to be
               | implemented. An architecture that's simple to reason
               | about is an invaluable asset.
        
           | socialdemocrat wrote:
           | x86 also does macro fusion. Difference is RISC-V was designed
           | for compressed instruction and fusion from the get go. X86
           | bolted this on.
           | 
           | Anyway what you gain from this is a very simple ISA, which
           | helps tool writers, those who implement hardware as well in
           | academia for teaching and research.
           | 
           | How does the insanely complex x86 instructions help anyone?
        
       | ksec wrote:
       | The unwritten rule of HN:
       | 
       | You do not criticise The Rusted Holy Grail and the Riscy Silver
       | Bullet.
        
         | rbanffy wrote:
         | Or, if you do, you'd better be absolutely right, or people will
         | tear your argument to shreds.
        
       | sosodev wrote:
       | Why do these half baked slam pieces always make it to the top of
       | HN?
        
         | jgilias wrote:
         | Many people upvote things not necessarily because they agree
         | with them, but rather to bump it in hopes that someone with
         | good insights will chime in in the comments section.
         | 
         | This especially applies to potentially controversial things.
        
         | bob1029 wrote:
         | I think the reason is that it ultimately encourages deep and
         | thoughtful conversation. If nothing controversial was ever
         | proposed, the motivation for participating and "proving others
         | wrong" is lessened. It might not be the healthiest way, but I
         | certainly find myself putting a lot more thought into my
         | comments if its a contrary point or in some broader
         | controversial context.
         | 
         | Overall, I feel HN is most fun when a lot of people are in
         | disagreement but also operating in good faith.
        
           | boibombeiro wrote:
           | Standing ground, specially when we are wrong, helps to learn
           | a lot more about the subject.
        
         | chillingeffect wrote:
         | if for no other reason than to quickly formulate
         | counterarguments. Next time at some meeting or other get
         | together, if someone pipes up with an anti-RISC comment, most
         | people won't be able to quickly refute it. But having had this
         | discussion here, we're inocculated and able to respond with
         | intelligence and experience.
        
           | okl wrote:
           | That sounds like you make up your mind first, then look for
           | arguments that support your position. I'd rather see the
           | arguments before I come to conclusions.
        
         | Avamander wrote:
         | I want to see what people say against it.
        
         | rm445 wrote:
         | Yeah. I'm not qualified to judge the quality of an instruction
         | set, but this writer destroyed all credibility with me by
         | claiming that an undergraduate could design a better
         | architecture (than this enormous collective effort) in a term.
         | It's right up there with claiming you could create Spotify in a
         | weekend or whatever.
        
           | gary_0 wrote:
           | The entire post is full of hyperbole, but the example they
           | show looks like a legitimate complaint.
        
           | rbanffy wrote:
           | I designed an ISA (and a CPU) as an undergrad, and I assure
           | you that, while it was very cool (stack-oriented, ASM looked
           | like Forth), it'd have horrendous performance these days.
        
           | AnIdiotOnTheNet wrote:
           | You say that as though Design By Committee isn't a thing.
        
             | Tuna-Fish wrote:
             | Except that the problem with RISC-V isn't even design by
             | committee. Even the most dysfunctional committee would
             | probably not fall into pits that RISC-V managed to. The
             | most credible explanation for it's misfeatures I've heard
             | is just plain bad taste and overly rigid adherence to
             | principle over practicality by it's original designers.
        
         | meepmorp wrote:
         | It's not a "slam piece," it's an email from a listserv, send
         | two months ago. Someone realized it'd be catnip for people on
         | HN and posted it.
        
           | rbanffy wrote:
           | It's not even a rock-solid critique...
        
         | [deleted]
        
       | Symmetry wrote:
       | I think talking about ISAs as better or worse than one another is
       | often a bad idea for the same reason that arguing about whether C
       | or Python is better is a bad idea. Different ISAs are used for
       | different purposes. We can point to some specific things as
       | almost always being bad in the modern world like branch delay
       | slots or the way the C preprocessor works but even then for
       | widely employed languages or ISAs there was a point to it when it
       | was created.
       | 
       | RISC-V has a number of places it's employed where it makes an
       | excellent fit. First of all academia. For an undergrad making
       | building the netlist for their first processor or a grad student
       | doing their first out of order processor RISC-V's simplicity is
       | great for the pedagogical purpose. For a researcher trying to
       | experiment with better branch prediction techniques having a
       | standard high-ish performance open source design they can take
       | and modify with their ideas is immensely helpful. And for many
       | companies in the real world with their eyes on the bottom line
       | like having an ISA where you can add instructions that happen to
       | accelerate your own particular workload, where you can use a
       | standard compiler framework outside your special assembly inner
       | loops, and where you don't have to spend transistors on features
       | you don't need.
       | 
       | I'm not optimistic about RISC-V's widescale adoption as an
       | application processor. If I were going to start designing an open
       | source processor in that space I'd probably start with IBM's now
       | open Power ISA. But there are so many more niches in the world
       | than just that and RISC-V is _already_ a success in some of them.
        
         | okl wrote:
         | Branch delay slots are an artifact of a simple pipeline without
         | speculation. There's nothing inherently "bad" about them.
        
           | pm215 wrote:
           | If you're designing a single CPU that definitely has a simple
           | pipeline, branch delay slots are maybe justifiable. If you're
           | designing an architecture which you hope will eventually be
           | used by many CPU designs which might have a variety of design
           | approaches, then delay slots are pretty bad because every
           | future CPU that _isn 't_ a simple non-speculating pipeline
           | will have to do extra work to fake up the behaviour. This is
           | an example of a general principle, which is that it's usually
           | a mistake to let microarchitectural details leak into the
           | architecture -- they quickly go stale and then both hw and sw
           | have to carry the burden of them.
        
       | oneplane wrote:
       | All of the discussions about instruction sets and "mine is better
       | than yours" or "anyone else could do better in a small amount of
       | time" are useless considering those arguments, if true, haven't
       | actually resulted in any free ISA being available broadly,
       | embraced broadly and hardware implementing that ISA being
       | available.
       | 
       | It doesn't matter how great something else could be in theory if
       | it doesn't exist or doesn't meet the same scale and mindshare (or
       | adoption).
        
       | dragontamer wrote:
       | Hmmm... I think this argument is solid. Albeit biased from GMP's
       | perspective, but bignums are used all the time in RSA / ECC, and
       | probably other common tasks, so maybe its important enough to
       | analyze at this level.
       | 
       | 2-instructions to work with 64-bits, maybe 1 more instruction /
       | macro-op for the compare-and-jump back up to a loop, and 1 more
       | instruction for a loop counter of somekind?
       | 
       | So we're looking at ~4 instructions for 64-bits on ARM/x86, but
       | ~9-instructions on RISC-V.
       | 
       | The loop will be performed in parallel in practice however due to
       | Out-of-order / superscalar execution, so the discussion inside
       | the post (2 instruction on x86 vs 7-instructions on RISC-V)
       | probably is the closest to the truth.
       | 
       | ----------
       | 
       | Question: is ~2-clock ticks per 64-bits really the ideal? I don't
       | think so. It seems to me that bignum arithmetic is easily SIMD.
       | Carries are NOT accounted for in x86 AVX or ARM NEON
       | instructions, so x86, ARM, and RISC-V will probably be best.
       | 
       | I don't know exactly how to write a bignum addition loop in AVX
       | off the top of my head. But I'd assume it'd be similar to the
       | 7-instructions listed here, except... using 256-bit AVX-registers
       | or 512-bit AVX512 registers.
       | 
       | So 7-instructions to perform 512-bits of bignum addition is
       | 73-bits-per-clock cycle, far superior in speed to the 32-bits-
       | per-clock cycle from add + adc (the 64-bit code with implicit
       | condition codes).
       | 
       | AVX512 is uncommon, but AVX (256-bit) is common on x86 at least:
       | leading to ~36-bits-per-clock tick.
       | 
       | ----------
       | 
       | ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes
       | 512-bits). RISC-V has a bunch of competing vector instructions.
       | 
       | ..........
       | 
       | Ultimately, I'm not convinced that the add + adc methodology here
       | is best anymore for bignums. With a wide-enough vector, it seems
       | more important to bring forth big 256-bit or 512-bit vector
       | instructions for this use case?
       | 
       | EDIT: How many bits is the typical bignum? I think add+adc
       | probably is best for 128, 256, or maybe even 512-bits. But moving
       | up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say
       | without me writing code, but just a hunch).
       | 
       | 2048-bit RSA is the common bignum, right? Any other bignums that
       | are commonly used? EDIT2: Now that I think of it, addition isn't
       | the common operation in RSA, but instead multiplication (and
       | division which is based on multiplication).
        
         | Teknoman117 wrote:
         | Can you treat the whole vector register as a single bignum on
         | x86? If so, I totally missed that.
        
           | dragontamer wrote:
           | No.
           | 
           | Which is why I'm sure add / adc will still win at 128-bits,
           | or 256-bits.
           | 
           | The main issue is that the vector-add instructions are
           | missing carry-out entirely, so recreating the carry will be
           | expensive. But with a big enough number, that carry
           | propagation is parallelizable in log2(n), so a big enough
           | bignum (like maybe 1024-bits) will probably be more efficient
           | for SIMD.
        
       | Taniwha wrote:
       | So this is one tiny corner of the ISA, not something that makes
       | ALL instruction sequences longer - essentially RISCV has no
       | condition codes (they're a bit of an architectural nightmare for
       | everyone doing any more than the simplest CPUs, they make every
       | instruction potentially have dependencies or anti-dependencies
       | with every other).
       | 
       | It's a trade off - and the one that's been made is one that makes
       | it possible to make ALL instructions a little faster at the
       | expense of one particular case that isn't used much - that's how
       | you do computer architecture, you look at the whole, not just one
       | particular case
       | 
       | RISCV also specifies a 128-bit variant that is of course FASTER
       | than these examples
        
         | monocasa wrote:
         | > they're a bit of an architectural nightmare for everyone
         | doing any more than the simplest CPUs, they make every
         | instruction potentially have dependencies or anti-dependencies
         | with every other
         | 
         | It's doesn't have to be _that_ bad. As long as condition flags
         | are all written at once (or are essentially banked like
         | PowerPC) the dependency issue can go away because they're
         | renamed and their results aren't dependent on previous data.
         | 
         | Now, of course, instructions that only update some condition
         | flags and preserve others are the devil.
        
         | sanxiyn wrote:
         | RISC-V designers optimized for C and found overflow flag isn't
         | used much and got rid of it. It was the wrong choice: overflow
         | flag is used a lot for JavaScript and any language with
         | arbitrary precision integer (including GMP, the topic of OP).
        
           | kannanvijayan wrote:
           | It kind of chafed when I excitedly read the ISA docs and
           | found that overflow testing was cumbersome.
           | 
           | That said, I think it's less of an issue these days for JS
           | implementors in particular. It might have mattered more back
           | in the day when pure JS carried a lot of numeric compute load
           | and there weren't other options. These days it's better to
           | stow that compute code in wasm and get predictable reliable
           | performance and move on.
           | 
           | The big pain points in perf optimization for JS is objects
           | and their representation, functions and their various type-
           | specializations.
           | 
           | Another factor is that JS impls use int32s as their internal
           | integer representation, so there should be some relatively
           | straightforward approach involving lifting to int64s and
           | testing the high half for overflow.
           | 
           | Still kind of cumbersome.
           | 
           | There are similar issues in existing ISAs. NaN-boxing for
           | example uses high bits to store type info for boxed values.
           | Unboxing boxed values on amd64 involves loading an 8-byte
           | constant into a free register and then using that to mask out
           | the type. The register usage is mandatory because you can't
           | use 64-bit values as immediates.
           | 
           | I remember trying to reduce code size and improve perf (and
           | save a scratch register) by turning that into a left-shift
           | right-shift sequence involving no constants, but that led to
           | the code executing measurably slower as it introduced data
           | dependencies.
        
             | formerly_proven wrote:
             | > It kind of chafed when I excitedly read the ISA docs and
             | found that overflow testing was cumbersome.
             | 
             | It just feels backwards to me to _increase_ the cost of
             | these checks in a time where we have realized that
             | unchecked arithmetic is not a good idea in general.
        
           | aidenn0 wrote:
           | Over just the time I've been aware of things, there's been a
           | constant positive feedback loop of "checked overflow isn't
           | used by software, so CPU designers make it less performant"
           | followed by "Checked overflow is less performant so software
           | uses it less."
           | 
           | I wish there was a way out.
           | 
           | Language features are also often implemented at least partly
           | because they can be done efficiently on the premiere hardware
           | for the language. Then new hardware can make such features
           | hard to implement.
           | 
           | WASM implemented return values in a way that was different
           | from register hardware, and it makes efficient codegen of
           | Common Lisp more challenging. This was brought to the
           | attention of the committee while WASM was still in flux, and
           | they (perhaps rightfully) decided CL was insufficiently
           | important to change things.
           | 
           | I'm sure that people brought up the overflow situation to the
           | RISC-V designers, and it was similarly dismissed. It's just
           | unfortunate that legacy software is such a big driver of CPU
           | features as that's a race towards lowest-common-denominator
           | hardware.
        
           | audunw wrote:
           | Which is exactly the right trade-off for embedded CPUs, where
           | RISC-V is most popular right now.
           | 
           | If desktop/server-class RISC-V CPUs become more common, it's
           | not unreasonable to think they'll add an extension that
           | covers the needs of managed/higher-level languages like
           | RISC-V.
           | 
           | Even for server-class CPUs you could argue that you
           | absolutely want this extension to be optional, as you can
           | design more efficient CPUs for datacenters/supercomputers
           | where you know what kind of code you'll be running.
        
             | [deleted]
        
           | roca wrote:
           | Also Rust applications are increasingly going to be built
           | with integer overflow checking enabled, e.g. Android's Rust
           | components are going to ship with integer overflow checking.
           | And unlike say GMP, that poses a potential code density
           | problem because we're not talking about inner loops that can
           | be effectively cached, it's code bloat smeared across the
           | entire binary.
        
             | Taniwha wrote:
             | Yeah but the code required for an overflow check is just
             | one extra instruction (3 rather than 2)
        
               | masklinn wrote:
               | For generalised signed addition, the overhead is 3
               | instructions _per addition_. It can be one in specific
               | contexts where more is known about the operands (e.g.
               | addition of immediates).
               | 
               | It's always 1 in x64/ARM64 as they have built-in support
               | for overflow.
        
               | Taniwha wrote:
               | you have to include the branch instruction too in any
               | comparison
        
           | zozbot234 wrote:
           | They provide recommended insn sequences for overflow checking
           | as commentary to the ISA specification, and this enables
           | efficient implementation in hardware.
        
             | throwaway81523 wrote:
             | > They provide recommended insn sequences for overflow
             | checking as commentary to the ISA specification, and this
             | enables efficient implementation in hardware.
             | 
             | I would like to see some benchmarks of this efficient
             | implementation in hardware, even simulated hardware,
             | compared against conventional architectures.
             | 
             | Even for C, it's a recurring source of bugs and
             | vulnerabilities that int overflow goes undetected. What we
             | really need is an overflow trap like the one in IEEE
             | floating point. RISC-V went the opposite direction.
        
             | adrian_b wrote:
             | Any hardware adder provides almost for free the overflow
             | detection output (at less than the cost of an extra bit, so
             | less than 1/64 of a 64-bit adder).
             | 
             | So anyone who thinks about an efficient hardware
             | implementation would expose the overflow bit to the
             | software.
             | 
             | A hardware implementation that requires multiple additions
             | to provide the complete result of a single addition can be
             | called in many ways, but certainly not "efficient".
        
         | user-the-name wrote:
         | > RISCV also specifies a 128-bit variant that is of course
         | FASTER than these examples
         | 
         | Is it actually implemented on any hardware?
        
           | ncmncm wrote:
           | No. Mentioning it is only meant to distract.
        
             | TheCondor wrote:
             | Is there a semi competitive Risc-V core implemented
             | anywhere?
             | 
             | It all seem hypothetical to me now, fast cores would fuse
             | the instructions together so instruction count alone isn't
             | adequate for the original evaluation of the ISA. Now I'm
             | not sure that there are any that really do that..
        
         | theresistor wrote:
         | This isn't an isolated case. RISC-V makes the same basic
         | tradeoff (simplicity above all else) across the board. You can
         | see this in the (lack of) addressing modes, compare-and-branch,
         | etc.
         | 
         | Where this really bites you is in workloads dominated by tight
         | loops (image processing, cryptography, HPC, etc). While a
         | microarchitecture may be more efficient thanks to simpler
         | instructions (ignoring the added complexity of compressed
         | instructions and macro-fusion, the usual suggested fixes...),
         | it's not going to be 2-3x faster, so it's never going to
         | compensate for a 2-3x larger inner loop.
        
           | lottospm wrote:
           | I'm not an expert on ISA and CPU internals, but an X86
           | instruction is not just "an instruction" anymore. Afaik,
           | since the P6 arch Intel is using a fancy decoder to translate
           | x86/-64 CISC into an internal RISC ISA (up to 4 u-ops per
           | CISC instruction) and that internal ISA could be quite close
           | to the RISC-V ISA for all I know.
           | 
           | Instruction decoding and memory ordering can be a bit of
           | nightmare on CISC ISAs and fewer macro-instructions are not
           | automatically a win. I guess we'll eventually see in
           | benchmarks.
           | 
           | Even though Intel has had decades to refine their CPUs I'm
           | quite excited to see where RISC-V is going.
        
             | theresistor wrote:
             | As someone who _is_ an expert on ISA and CPU internals,
             | this meme of  "X86 has an internal RISC" is an over-
             | simplification that obscures reality. Yes, it decodes
             | instructions into micro-ops. No, micro-ops are not "quite
             | close to the RISC-V ISA".
             | 
             | Macro fusion definitely has a place in microarchitecture
             | performance, especially when you have to deal with a legacy
             | ISA. RISC-V makes the _very unusual_ choice of depending on
             | it for performance, when most ISAs prefer to fix the
             | problem upstream.
        
               | jasonhansel wrote:
               | Indeed. Also not an expert, but relying on macro-op
               | fusion in hardware is tricky IIRC since different
               | implementors will (likely) choose different macro-ops,
               | resulting in strange performance differences between
               | otherwise-identical chips.
               | 
               | Of course, you could start documenting "official" macro-
               | ops that implementations should support, but at that
               | point you're pretty much inventing a new ISA...
        
               | seoaeu wrote:
               | RISC-V _does_ document  "official" macro-ops that
               | implementations are encouraged to support.
        
             | lonjil wrote:
             | Most commonly used x86_64 instructions decode to only 1 or
             | 2 uops, thus often also just as "complex" as the original
             | instructions.
        
             | Tuna-Fish wrote:
             | > but an X86 instruction is not just "an instruction"
             | anymore.
             | 
             | This is technically true but not really. Decoding into many
             | instructions is mainly used for compatibility with the
             | crufty parts of the x86 spec. In general, for anything
             | other than rmw or locking a competent compiler or assembly
             | writer will only very rarely emit instructions that compile
             | to more than one uop. The way the frontend works,
             | microcoded instructions are extraordinarily slow on real
             | cpus.
             | 
             | Modern x86 is basically a risc with a very complex decode,
             | few extra useful complex operations tacked on, and piles
             | and piles of old moldy cruft that no-one should ever touch.
        
               | monocasa wrote:
               | X86 doesn't have to go to microcode to have multiple uOPs
               | for an instruction. Most uarchs can spit out three or
               | four uOPs per instruction before having to resort to the
               | microcode ROM. Basically instructions that would only
               | need one microcode ROM row in a purely microcoded design
               | can be spit out of the fast decoders.
        
           | nickez wrote:
           | For those use cases you typically have specialised hardware
           | or an FPGA.
        
             | [deleted]
        
             | vadfa wrote:
             | So when h266 or whatever comes out you can't watch video
             | anymore because your cpu can't decode it in software even
             | if it tried?
        
               | lvh wrote:
               | An FPGA can be reprogrammed, and we do really do this for
               | standards with better longevity than video standards
               | (e.g. cryptographic ones like AES and SHA). For standards
               | like video codecs, we just use GPUs instead, which I
               | assume is what OP had in mind for "specialized hardware"
               | (specialization can still be pretty general :-)).
        
               | plorkyeran wrote:
               | Hardware video decoding is done by a single-purpose chip
               | on the graphics card (or dedicated hardware inside the
               | GPU), not via software running on the GPU. Adding support
               | for a new video codec requires buying a new video card
               | which supports that codec.
        
         | classichasclass wrote:
         | So do what Power does: most instructions that update the
         | condition flags can do so optionally (except for instructions
         | like stdcx. or cmpd where they're meaningless without it, and
         | corner oddballs like andi.). For that matter, Power treats
         | things like overflow and carry as separate from the condition
         | register (they go in a special purpose register), so you can
         | issue an instruction like addco. or just a regular add with no
         | flags, and the condition register actually is divided into
         | eight, so you can operate on separate fields without
         | dependencies.
        
           | jabl wrote:
           | ARM also does something similar, many instructions has a flag
           | bit specifying whether flags should be updated or not. It
           | doesn't have the multiple flag registers of POWER though.
        
             | crest wrote:
             | Which at least back in the day neither the IBM compilers
             | nor GCC 2.x - 4.x made much use of. I've seen only a few
             | handoptimzed assembler routines get decent use out of them.
             | Easy to fuse pairs a probably a good compromise for carry
             | calculation e.g. add + a carry instruction. That would get
             | rid of one of the additional dependencies, but it would
             | take a three operand addition or fusing two additions to
             | get rid of the second RISC V specific dependency. And while
             | GMP isn't unimportant it is still a niche use case that's
             | probably not worth throwing that much hardware resources at
             | to fix the ISA limitations in the uarch.
        
           | crest wrote:
           | IIRC few Power(PC) cores really split the condition register
           | nibbles into 8 renamable registers and while Power(PC)
           | includes everything (including at least two spare kitchen
           | sinks) only a few instructions can pick _which_ condition
           | register nibble to update. Most integer instructions can only
           | update cr0 and floating point instructions cr1. On the other
           | hand you can do nice hacks with the cornucopia of bitwise
           | available bitwise operations on condition register bits and
           | it 's one of the architectures where (floating point)
           | comparisons return the full set of results (less, equal,
           | greater, _unordered_ ).
        
             | adrian_b wrote:
             | On POWER, all the comparison instruction can store their
             | result in any of the 8 sets of flags. The conditional
             | branches can use any flag from any set.
             | 
             | The arithmetic instructions, e.g. addition or
             | multiplication, do not encode a field for where to store
             | the flags, so they use, like you said, an implicit
             | destination, which is still different for integer and
             | floating-point.
             | 
             | In large out-of-order CPUs, with flag register renaming,
             | this is no longer so important, but in 1990, when POWER was
             | introduced, the multiple sets of flags were a great
             | advance, because they enabled the parallel execution of
             | many instructions even in CPUs much simpler than today.
             | 
             | Besides POWER, the 64-bit ARMv8 also provides most of the
             | 14 predicates that exist for a partial order relation. For
             | some weird reason, the IEEE FP standard requires only 12 of
             | the 14 predicates, so ARM implemented just those 12, even
             | if they have 14 encodings, by using duplicate encodings for
             | a pair of predicates.
             | 
             | I consider this stupid, because there would not have been
             | any additional cost to gate correctly the missing predicate
             | pair, even if it is indeed one that is only seldom needed
             | (distinguishing between less-or-greater and equal-or-
             | unordered).
        
         | wbl wrote:
         | You can opt in to generating and propagating conditions and
         | rename the predicates as well.
        
         | bitwize wrote:
         | It's not a tiny corner. People do arithmetic with carry all the
         | time. Arbitrary precision arithmetic is more common than you
         | think. Congratulations, RISC-V, you've not only slowed down
         | every bignum implementation in existence, all those extra
         | instructions to compute carry will blow the I$ faster,
         | potentially slowing down any code that relies on a bignum
         | implementation as well.
        
       | dlsa wrote:
       | I noticed high and low in there so those code snippets look like
       | 32 bit code, at least to me.
       | 
       | Is that even a fair comparison given the arm and x86 versions
       | used as examples of "better" were 64 bit?
       | 
       | If we're really comparing 32 and 64 and complaining that 32 bit
       | uses more instructions than 64, perhaps we should dig out the 4
       | bit processors and really sharpen the pitchforks. Alternatively,
       | we could simply not. Comparing apples to oranges doesn't really
       | help.
       | 
       | From the article:
       | 
       |  _Let 's look at some examples of how Risc V underperforms._
       | 
       |  _First, addition of a double-word integer with carry-out:_
       | 
       | add t0, a4, a6 // add low words
       | 
       | sltu t6, t0, a4 // compute carry-out from low add
       | 
       | add t1, a5, a7 // add hi words
       | 
       | sltu t2, t1, a5 // compute carry-out from high add
       | 
       | add t4, t1, t6 // add carry to low result
       | 
       | sltu t3, t4, t1 // compute carry out from the carry add
       | 
       | add t6, t2, t3 // combine carries
       | 
       |  _Same for 64-bit arm:_
       | 
       | adds x12, x6, x10
       | 
       | adcs x13, x7, x11
       | 
       |  _Same for 64-bit x86:_
       | 
       | add %r8, %rax
       | 
       | adc %r9, %rdx
        
         | adrian_b wrote:
         | The comparison is completely fair, because on RISC-V there is
         | no better way to generate the carries required for computations
         | with large integers. You cannot generate a carry with a 64-bit
         | addition, because it is lost and you cannot store it.
         | 
         | You should take into account that the libgmp authors have a
         | huge amount of experience in implementing operations with large
         | integers on a very large number of CPU architectures, i.e. on
         | all architectures supported by gcc, and for most of those
         | architectures libgmp has been the fastest during many years, or
         | it still is the fastest.
        
       | jhallenworld wrote:
       | What if the multi-precision code is written in C?
       | 
       | You can detect carry of (a+b) in C branch-free with: ((a&b) |
       | ((a|b) & ~(a+b))) >> 31
       | 
       | So 64-bit add in C is:                  f_low = a_low + b_low
       | c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
       | f_high = a_high + b_high + c_high
       | 
       | So for RISC-V in gcc 8.2.0 with -O2 -S -c
       | add     a1,a3,a2             or      a5,a3,a2             not
       | a7,a1             and     a5,a5,a7             and     a3,a3,a2
       | or      a5,a5,a3             srli    a5,a5,31             add
       | a4,a4,a6             add     a4,a4,a5
       | 
       | But for ARM I get (with gcc 9.3.1):                       add
       | ip, r2, r1             orr     r3, r2, r1             and     r1,
       | r1, r2             bic     r3, r3, ip             orr     r3, r3,
       | r1             lsr     r3, r3, #31             add     r2, r2, lr
       | add     r2, r2, r3
       | 
       | It's shorter because ARM has bic. Neither one figures out to use
       | carry related instructions.
       | 
       | Ah! But! There is a gcc macro: __builtin_uadd_overflow() that
       | replaces the first two C lines above: c_high =
       | __builtin_uadd_overflow(a_low, b_low, &f_low);
       | 
       | So with this:
       | 
       | RISC-V:                       add     a3,a4,a3             sltu
       | a4,a3,a4             add     a5,a5,a2             add
       | a5,a5,a4
       | 
       | ARM:                       adds    r2, r3, r2             movcs
       | r1, #1             movcc   r1, #0             add     r3, r3, ip
       | add     r3, r3, r1
       | 
       | RISC-V is faster..
       | 
       | EDIT: CLANG has one better: __builtin_addc().
       | f_low = __builtin_addcl(a_low, b_low, 0, &c);         f_high =
       | __builtin_addcl(a_high, b_high, c, &junk);
       | 
       | x86:                       addl    8(%rdi), %eax             adcl
       | 4(%rdi), %ecx
       | 
       | ARM:                       adds    w8, w8, w10             add
       | w9, w11, w9             cinc    w9, w9, hs
       | 
       | RISC-V:                       add     a1, a4, a5             add
       | a6, a2, a3             sltu    a2, a2, a3             add     a6,
       | a6, a2
        
         | volta83 wrote:
         | > RISC-V is faster..
         | 
         | I find it funny that you make the same pitfall than the author
         | did.
         | 
         | Faster on which CPU?
         | 
         | The author doesn't measure on any CPU, so here there are dozens
         | of people hypothesizing whether fusion happens or not, and what
         | the impact is.
        
           | brutal_chaos_ wrote:
           | > Faster on which CPU?
           | 
           | Perhaps faster means fewer instructions in this instance?
           | Considering number of instructions is what has been
           | discussed.
        
           | jhallenworld wrote:
           | All other things equal, you would prefer smaller code for
           | better cache use.
        
         | brucehoult wrote:
         | Note that with the newly-ratified B extension, RISC-V has BIC
         | (called ANDN) as well as ORN and XNOR.
         | 
         | In addition to the actual ALU instructions doing the add with
         | carry, for bignums it's important to include the load and store
         | instructions. Even in L1 cache it's typically 2 or 3 or 4
         | cycles to do the load, which makes one or two extra
         | instructions for the arithmetic less important. Once you get to
         | bignums large enough to stream from RAM (e.g. calculating pi to
         | a few billion digits) it's completely irrelevant.
        
       ___________________________________________________________________
       (page generated 2021-12-02 23:01 UTC)