[HN Gopher] How many x86 instructions are there? (2016)
       ___________________________________________________________________
        
       How many x86 instructions are there? (2016)
        
       Author : sandinmyjoints
       Score  : 111 points
       Date   : 2021-04-21 13:03 UTC (9 hours ago)
        
 (HTM) web link (fgiesen.wordpress.com)
 (TXT) w3m dump (fgiesen.wordpress.com)
        
       | hnmullany wrote:
       | Now ask how many unique x86 instructions do standard compilers
       | emit...
        
       | FpUser wrote:
       | I feel sudden urge to write some assembly for fun. Have not done
       | it for at least a couple of years I think.
        
         | carbonguy wrote:
         | The assembly bug bit me again a few months back, but instead of
         | writing it I had a grand time hand-disassembling some ARMv4T
         | binaries - scratched something of the same itch as solving a
         | sudoku.
        
         | neurostimulant wrote:
         | Then you might like TIS-100 [1] and Shenzhen I/O [2].
         | 
         | [1]
         | https://store.steampowered.com/app/370360/TIS100/?curator_cl...
         | 
         | [2]
         | https://store.steampowered.com/app/504210/SHENZHEN_IO/?curat...
        
           | deelowe wrote:
           | I've been getting a lot of enjoyment out of 6502/z80 projects
           | like rc2015 and ben eaters breadboard 6502 kit/videos.
           | There's also nand2tetris, if you prefer a more academic
           | approach.
        
         | snvzz wrote:
         | You'll feel less miserable if you target a modern RISC
         | architecture such as RISC-V or a cleaner CISC architecture such
         | as the venerable 68000.
         | 
         | Whenever bored, I read or watch chibiakumas tutorials. There's
         | no such thing as knowing too many assemblers.
        
         | magicalhippo wrote:
         | Then when checking the results are half the speed of what the
         | compiler spits out and the fun is gone. At least that's what
         | happens to me...
        
           | kingsuper20 wrote:
           | I've always been able to beat the compiler, and that's
           | usually after trying to optimize using C. Admittedly, it's a
           | whole lot harder to understand what's fast than it used to
           | be. Access to SSE has it's own benefits.
           | 
           | It's been a problem (optimizing) for some time though. I
           | remember it being some work to beat the compiler on the
           | i960CA. OTOH, I seem to remember the i860 being not-so-great
           | and for sure the TI C80 C compiler was downright awful (per
           | usual for DSPs).
        
             | magicalhippo wrote:
             | Back in the Pentium 1 and earlier days I could beat the
             | compiler. But then it got hard.
             | 
             | And it changes so often, instructions that are fast on one
             | CPU are not so fast on the next one, and vice versa.
             | 
             | Not to mention branch prediction and out-of-order execution
             | makes it very difficult to meaningfully benchmark. Is my
             | code really faster, or just seems like it because some
             | address got better aligned or similar.
             | 
             | I've gotten significant speed gains in certain projects by
             | simply replacing certain hand-optimized assembly in
             | libraries (ie not my code) with the plain C code
             | equivalent. The assembly was probably faster 10-15 years
             | ago, but not anymore...
        
               | mhh__ wrote:
               | I've started injecting code fragments into spectre
               | exploits as a way of seeing whether the CPU likes them or
               | not.
        
               | kingsuper20 wrote:
               | >I've gotten significant speed gains in certain projects
               | by simply replacing certain hand-optimized assembly in
               | libraries (ie not my code) with the plain C code
               | equivalent.
               | 
               | That's an interesting point, plus there's the portability
               | issue.
               | 
               | My own breadcrumbs of legacy code for this kind of
               | innerloopish stuff has been to write a straightforward
               | 'C' implementation (and time it), an optimized 'C'
               | version (which itself can depend on the processor used),
               | and a handtuned assembly version where really needed.
               | 
               | It allows you to back out of the tricky stuff plus acts
               | as a form of documentation.
        
             | zsmi wrote:
             | One should never loose to the complier, after all you can
             | see it's output and it can't see yours.
             | 
             | Also, the programmer can "cheat" by doing things the
             | compiler would consider invalid but are known to be ok
             | given the larger context of the application.
             | 
             | The restrict keyword in c gives an example of how one can
             | optimize code by knowing the larger context. https://cellpe
             | rformance.beyond3d.com/articles/2006/05/demyst...
             | 
             | The problem is the ROI is usually pretty bad as these
             | assumptions rarely hold as the code evolves, in my
             | experience, and the optimization usually only lasts for
             | finite (sometimes shockingly short) amount of time. i.e. OS
             | changes, hardware changes, memory changes, etc. etc. etc.
        
           | f00zz wrote:
           | That's why it's more fun to try to optimize for size
        
           | FpUser wrote:
           | For this case all fun is in the process ;)
        
           | im3w1l wrote:
           | Last time I wrote assembly, and it was a long while ago, it
           | was way faster. But let's be honest, 95% of it was doing
           | manual buffering on top of OS api's rather than use C stdlib.
           | And the other 5% were by skipping itoa calls, by doing
           | arithmetic directly on the string representation.
           | 
           | I think this is why assembler can be faster many times. Not
           | because I'm better than a compiler. But because the structure
           | of the language nudges you into faster approaches.
        
             | magicalhippo wrote:
             | Good point, though I usually found that if I then go back
             | and restructure my high-level code, the compiler beat my
             | ASM again.
        
       | TwoBit wrote:
       | ARM has all these variations which make it seem as complicated as
       | x86, but they are distict variations and future CPUs can for
       | example drop 16 bit Thumb fairly clearly.
        
         | klelatti wrote:
         | I think that one of the things that distinguishes Arm from
         | Intel (for good or bad) is that Arm _has_ left behind a lot of
         | the legacy. aarch64 has no Thumb, Jazelle etc
        
         | api wrote:
         | It's _way_ easier to determine instruction length on ARM. It 's
         | usually fixed. That eliminates a lot of brute force thrashing
         | that X86 decoders have to do. It doesn't impact transistor
         | count all that much on a huge modern CPU but it saves a decent
         | amount of power. It's one of the things that factors into why
         | ARM is so power efficient.
         | 
         | ARM has also been willing to drop older optional legacy stuff
         | like Java oriented instructions that almost nobody used and
         | Thumb. X86 supports nearly all legacy opcodes, even things like
         | MMX and other obsolete vector operations that modern programs
         | never use.
        
           | thechao wrote:
           | It's not that variable length is expensive, it's that
           | variable length _the way Intel does it_ is expensive. For
           | instance -- not that this is a good idea -- you could burn
           | the top 2b to mark instructions as 2 /4/6/8 bytes (or
           | whatever) in length. Then you can have your variable-width-
           | cake-and-eat-your-fast-decode-too.
        
             | richardwhiuk wrote:
             | Variable length is always going to restrict the parralelism
             | of your instruction decode (for a given chip area/power/etc
             | cost).
        
               | thechao wrote:
               | Only because a single instruction would take the "space"
               | of multiple instructions in the I$-fetch-to-decode. The
               | idea with variable-length encodings is that, for example,
               | an 8B encoding does _more than twice the work_ of a 4B
               | encoding, so you lose out on the instruction slot, but
               | win on the work done.
               | 
               | I mean ... that's the theory.
        
             | bogomipz wrote:
             | Could you elaborate - what is about the Intel design that
             | makes the decode so inefficient? Is "2b" bits here?
             | 
             | Are there examples of ISA or chips that handle variable
             | length instruction encoding efficiently?
        
               | pkaye wrote:
               | Due to how the instruction set evolved, for the Intel x86
               | architecture, you have to look at a lot of bits of the
               | instruction stream to determine where the next
               | instruction starts. To execute multiple instructions per
               | clock cycles you also have to decode instruction multiple
               | instructions per clock cycle.
               | 
               | I think this old Intel patent talks about one of their
               | decoder implementations:
               | 
               | https://patents.google.com/patent/US5758116A/en
        
               | Symmetry wrote:
               | Intel x86 isn't self synchonizing. In theory you have to
               | decode every byte of the instruction stream that came
               | before to be sure where the boundaries of an instruction
               | are. Normally it stabilizes after a while in real world
               | instruction streams but you can craft malicious
               | instruction streams which yield two different valid sets
               | of instructions depending on whether you start reading
               | them at an offset or not.
               | 
               | Contrast that to something like utf8 where that isn't
               | possible.
        
               | saagarjha wrote:
               | Note that random x86 code is usually "eventually self-
               | synchronizing", which id useful if you're trying to
               | disassemble a blob at a random offset.
        
             | atq2119 wrote:
             | > you could burn the top 2b to mark instructions as 2/4/6/8
             | bytes (or whatever) in length.
             | 
             | FWIW, this is exactly what RISC-V does.
        
             | akira2501 wrote:
             | > you could burn the top 2b to mark
             | 
             | Which seems reasonable.. but you just burned 3/4 of the
             | single opcode instruction space, which may not be worth it
             | for most general purpose loads.
        
               | bogomipz wrote:
               | Would you mind elaborating on the math of how the "top
               | 2b" ends up burning 3/4 of the single opcode instruction
               | space?
        
       | gbrown_ wrote:
       | How about undocumented instructions? Sandsifter[1] is an
       | interesting project and the video from BlackHat[2] is a good
       | watch. There's also a previous discussion of it on HN[3].
       | 
       | [1] https://github.com/Battelle/sandsifter
       | 
       | [2] https://www.youtube.com/watch?v=KrksBdWcZgQ
       | 
       | [3] https://news.ycombinator.com/item?id=18179212
        
         | da_big_ghey wrote:
         | yes and sansdifter are continue finding more, recent
         | undocumented microcode modify instruction are find with it.
        
         | sandinmyjoints wrote:
         | Very cool project! Thanks for the pointer.
        
       | snvzz wrote:
       | The only real answer is: Too Many.
       | 
       | This complexity is pushed down to operating systems, compilers,
       | assemblers, debuggers. It ends up causing brutal human time
       | overhead throughout the chain, and its cost effectively prevents
       | security and high assurance.
       | 
       | This more than justifies moving away from x86 into RISC
       | architectures, such as the rising open and royalty-free RISC-V.
        
         | saagarjha wrote:
         | There is complexity, but it's not nearly as large as you claim
         | it is, nor does RISC magically resolve the issues you've
         | brought up.
        
         | WalterBright wrote:
         | Back in the 70s, a friend of mine said he loved the Z80
         | instruction set the best. I replied that it was the most
         | complicated and kludgy one. He laughed and said the complexity
         | allowed his code to shine above others and he got paid more.
         | 
         | Of course, these days the Z80 instruction set looks trivial :-)
        
         | deepspace wrote:
         | > This complexity is pushed down to operating systems,
         | compilers, assemblers, debuggers.
         | 
         | Is that really true though? In my experience, the amount of
         | complexity in a layered system is roughly constant, and if you
         | make one layer simpler, you have to make another more complex.
         | 
         | I would expect that compiler for a RISC architecture, for
         | example, needs to do a lot of work figuring out which set of
         | 'simple' instructions encode an intent most efficiently. It
         | also needs to deal with instruction scheduling and other issues
         | that a RISC compiler does not care about.
        
       | mobilio wrote:
       | In short - endless battle between RISC or CISC.
        
         | Symmetry wrote:
         | These days RISC is mostly about regular encoding of
         | instructions which don't present challenges to doing precise
         | interrupts on a deeply pipelined machine rather than just few
         | instructions _per se_.
        
           | bogomipz wrote:
           | What do you mean by "precise" interrupts here? Do some types
           | of interrupts cause worse pipeline stalls than other types?
           | Are the interrupt handlers in RISC faster because of more
           | efficient decoding? Is that the issue?
        
             | zsmi wrote:
             | This is the classic definition
             | 
             | https://dl.acm.org/doi/10.1109/12.4607
             | 
             | "An interrupt is precise if the saved process state
             | corresponds to a sequential model of program execution in
             | which one instruction completes before the next begins. In
             | a pipelined processor, precise interrupts are difficult to
             | implement because an instruction may be initiated before
             | its predecessors have completed."
        
             | Symmetry wrote:
             | No, it's just that the fact that RISC didn't have any load-
             | op-store instructions made it easier to pipeline them back
             | in the day while still being able to handle interrupts as
             | if they executed one instruction at a time and there was
             | one instruction that had executed while the following one
             | hadn't executed at all. That's possible to do with CISC
             | architectures, of course, we do it all the time these days.
             | But back in the day being able to implement that easily-ish
             | in a reasonable number of transistors were a big factor in
             | early RISC processors having so much better performance
             | than the CISCs of the time.
             | 
             | In the modern day it mostly just means that you can
             | implement fancy out of order schenanigans with a bit fewer
             | engineer-years than non-RISC ISAs.
        
               | bogomipz wrote:
               | Thanks these are both very helpful and good bits of
               | historical perspective as well. Cheers.
        
         | kingsuper20 wrote:
         | Oh well, there's RISC and there's RISC.
         | 
         | I always think of the AMD 29000.
        
           | mobilio wrote:
           | Or i860 & i960?
        
       | johnklos wrote:
       | More than 1,500! Holy cow!
       | 
       | While having instructions for everything that are slow in early
       | models but can be significantly improved in silicon over time is
       | one way to look at CISC, I genuinely wonder how much silicon is
       | spent on instructions that are so rarely used they'd be better in
       | software.
       | 
       | Or to ask another way: how many instructions are in billions of
       | x86 cores that rarely if ever get used? Hmmm...
        
         | dreamcompiler wrote:
         | I'd also be curious to discover how many distinct x86
         | instructions gcc can even emit? I expect the answer is "a lot
         | less than all of them."
        
           | saagarjha wrote:
           | Probably a couple hundred at most, and for common programs
           | several dozen.
        
           | Someone wrote:
           | If you cheat: all of them. GCC supports in-line assembly.
           | 
           | Of course, that's not an interesting observation.
           | 
           | It gets (somewhat) interesting when you realize that one may
           | inadvertently include in-line assembly. Do you call including
           | system headers that happen to contain inline assembly
           | cheating? Including headers for a standard C library?
           | Compiling with old-fashioned compiler flags (for example to
           | generate FPU instructions for the x87)?
        
           | kingsuper20 wrote:
           | I mostly just think of the test set they must make use of at
           | Intel. It makes my head hurt. Maybe you end up with a wad of
           | legacy code so large that no one knows how it really works.
           | That ends up being the real definition of the part.
        
         | thechao wrote:
         | It's really more like 6000, if you do the accounting right.
         | And, as someone who just ate a 2x (customer-facing) slow down
         | porting a piece of x86 assembly to ARM, because that "one nifty
         | instruction" was missing, I'm going to say: we should keep all
         | of 'em.
        
           | mhh__ wrote:
           | Which one?
        
             | thechao wrote:
             | _mm256_movemask_epi8, i.e., the "fast lexer" instruction.
             | That instruction takes ~3c (depending on uarch), and the
             | ARM equivalent (7-8 instructions) takes ~5-6c (depending on
             | uarch). It's just _annoying_.
        
               | ConcernedCoder wrote:
               | This instruction does quite a bit of leg work, and it
               | becomes obvious why a RISK architecture would need 7-8
               | instructions to do the same, see: https://software.intel.
               | com/sites/landingpage/IntrinsicsGuide...
        
               | irdc wrote:
               | This looks like a task for gorc: https://five-
               | embeddev.com/riscv-bitmanip/draft/bext.html#gen...
               | 
               | (granted, that's only 32/64 bits, but still...)
        
         | CodeArtisan wrote:
         | AMD Zen dropped a few instructions sets from the previous
         | architecture.
         | 
         | FMA4
         | https://en.wikipedia.org/wiki/FMA_instruction_set#FMA4_instr...
         | 
         | XOP https://en.wikipedia.org/wiki/XOP_instruction_set
         | 
         | TBM
         | https://en.wikipedia.org/wiki/Bit_manipulation_instruction_s...
        
       | davidkuhta wrote:
       | > To not leave you hanging: Intel has an official x86
       | encoder/decoder library called XED. According to Intel's XED, as
       | of this writing, there are 1503 defined x86 instructions
       | ("iclasses" in XED lingo), from AAA to XTEST (this includes AMD-
       | specific extensions too, by the way). Straightforward, right?
       | 
       | Hopefully this will have either saved you a click or validated
       | your time in reading the article.
        
         | siltpotato wrote:
         | Well, it's the first line. Probably doesn't count as click
         | bait.
        
         | anyfoo wrote:
         | Curious, does anyone actually care about the actual number
         | primarily? I thought pretty much everyone who clicks on an
         | article with that title would do so because they are interested
         | in the insights gathered when getting to that number.
        
           | wglb wrote:
           | If you are writing a disassembler or binary program decoder,
           | such a number will help you be sure that you enumerate all
           | the instructions.
        
             | ketralnis wrote:
             | I doubt most people reading the article coming from HN are
             | writing disassemblers, and all such people would have to
             | read it anyway because the number itself isn't sufficient
             | to validate that you've enumerated all of them (because as
             | my sibling points out, it's more complicated than that).
             | The specific number is the least interesting part.
        
             | account4mypc wrote:
             | i think if you were writing such a program, this article
             | would show that using such a number is a much more
             | complicated idea than it sounds
        
               | Out_of_Characte wrote:
               | Enumerating all instructions would be usefull to check
               | wether you can decode all legal instructions.
        
               | saagarjha wrote:
               | The issue is that the list of all legal instructions is
               | hard to define.
        
         | carbonguy wrote:
         | For me the article was well worth it; where else but in ISA
         | discussions can you find gems like the following?
         | 
         | > _Does a non-instruction that is non-defined and unofficially
         | guaranteed to non-execute exactly as if it had never been in
         | the instruction set to begin with count as an x86 instruction?
         | For that matter, does UD2 itself, the defined undefined
         | instruction, count as an instruction?_
        
           | sandinmyjoints wrote:
           | This was why I posted it -- I learned a lot more than the
           | answer to the title.
        
       | xony wrote:
       | 6000 ? cache can't effectively handle the burden of the decoder
       | propagation delays . compiler writer might not be using even 20%
       | of the total instructions. except backward compatibilty its
       | useless architecture
        
       | optimalsolver wrote:
       | Alternate instructions here :-)
       | 
       | https://paws.kettering.edu/~jhuggins/humor/opcodes.html
        
       ___________________________________________________________________
       (page generated 2021-04-21 23:01 UTC)