[HN Gopher] How many x86 instructions are there? (2016) ___________________________________________________________________ How many x86 instructions are there? (2016) Author : sandinmyjoints Score : 111 points Date : 2021-04-21 13:03 UTC (9 hours ago) (HTM) web link (fgiesen.wordpress.com) (TXT) w3m dump (fgiesen.wordpress.com) | hnmullany wrote: | Now ask how many unique x86 instructions do standard compilers | emit... | FpUser wrote: | I feel sudden urge to write some assembly for fun. Have not done | it for at least a couple of years I think. | carbonguy wrote: | The assembly bug bit me again a few months back, but instead of | writing it I had a grand time hand-disassembling some ARMv4T | binaries - scratched something of the same itch as solving a | sudoku. | neurostimulant wrote: | Then you might like TIS-100 [1] and Shenzhen I/O [2]. | | [1] | https://store.steampowered.com/app/370360/TIS100/?curator_cl... | | [2] | https://store.steampowered.com/app/504210/SHENZHEN_IO/?curat... | deelowe wrote: | I've been getting a lot of enjoyment out of 6502/z80 projects | like rc2015 and ben eaters breadboard 6502 kit/videos. | There's also nand2tetris, if you prefer a more academic | approach. | snvzz wrote: | You'll feel less miserable if you target a modern RISC | architecture such as RISC-V or a cleaner CISC architecture such | as the venerable 68000. | | Whenever bored, I read or watch chibiakumas tutorials. There's | no such thing as knowing too many assemblers. | magicalhippo wrote: | Then when checking the results are half the speed of what the | compiler spits out and the fun is gone. At least that's what | happens to me... | kingsuper20 wrote: | I've always been able to beat the compiler, and that's | usually after trying to optimize using C. Admittedly, it's a | whole lot harder to understand what's fast than it used to | be. Access to SSE has it's own benefits. | | It's been a problem (optimizing) for some time though. I | remember it being some work to beat the compiler on the | i960CA. OTOH, I seem to remember the i860 being not-so-great | and for sure the TI C80 C compiler was downright awful (per | usual for DSPs). | magicalhippo wrote: | Back in the Pentium 1 and earlier days I could beat the | compiler. But then it got hard. | | And it changes so often, instructions that are fast on one | CPU are not so fast on the next one, and vice versa. | | Not to mention branch prediction and out-of-order execution | makes it very difficult to meaningfully benchmark. Is my | code really faster, or just seems like it because some | address got better aligned or similar. | | I've gotten significant speed gains in certain projects by | simply replacing certain hand-optimized assembly in | libraries (ie not my code) with the plain C code | equivalent. The assembly was probably faster 10-15 years | ago, but not anymore... | mhh__ wrote: | I've started injecting code fragments into spectre | exploits as a way of seeing whether the CPU likes them or | not. | kingsuper20 wrote: | >I've gotten significant speed gains in certain projects | by simply replacing certain hand-optimized assembly in | libraries (ie not my code) with the plain C code | equivalent. | | That's an interesting point, plus there's the portability | issue. | | My own breadcrumbs of legacy code for this kind of | innerloopish stuff has been to write a straightforward | 'C' implementation (and time it), an optimized 'C' | version (which itself can depend on the processor used), | and a handtuned assembly version where really needed. | | It allows you to back out of the tricky stuff plus acts | as a form of documentation. | zsmi wrote: | One should never loose to the complier, after all you can | see it's output and it can't see yours. | | Also, the programmer can "cheat" by doing things the | compiler would consider invalid but are known to be ok | given the larger context of the application. | | The restrict keyword in c gives an example of how one can | optimize code by knowing the larger context. https://cellpe | rformance.beyond3d.com/articles/2006/05/demyst... | | The problem is the ROI is usually pretty bad as these | assumptions rarely hold as the code evolves, in my | experience, and the optimization usually only lasts for | finite (sometimes shockingly short) amount of time. i.e. OS | changes, hardware changes, memory changes, etc. etc. etc. | f00zz wrote: | That's why it's more fun to try to optimize for size | FpUser wrote: | For this case all fun is in the process ;) | im3w1l wrote: | Last time I wrote assembly, and it was a long while ago, it | was way faster. But let's be honest, 95% of it was doing | manual buffering on top of OS api's rather than use C stdlib. | And the other 5% were by skipping itoa calls, by doing | arithmetic directly on the string representation. | | I think this is why assembler can be faster many times. Not | because I'm better than a compiler. But because the structure | of the language nudges you into faster approaches. | magicalhippo wrote: | Good point, though I usually found that if I then go back | and restructure my high-level code, the compiler beat my | ASM again. | TwoBit wrote: | ARM has all these variations which make it seem as complicated as | x86, but they are distict variations and future CPUs can for | example drop 16 bit Thumb fairly clearly. | klelatti wrote: | I think that one of the things that distinguishes Arm from | Intel (for good or bad) is that Arm _has_ left behind a lot of | the legacy. aarch64 has no Thumb, Jazelle etc | api wrote: | It's _way_ easier to determine instruction length on ARM. It 's | usually fixed. That eliminates a lot of brute force thrashing | that X86 decoders have to do. It doesn't impact transistor | count all that much on a huge modern CPU but it saves a decent | amount of power. It's one of the things that factors into why | ARM is so power efficient. | | ARM has also been willing to drop older optional legacy stuff | like Java oriented instructions that almost nobody used and | Thumb. X86 supports nearly all legacy opcodes, even things like | MMX and other obsolete vector operations that modern programs | never use. | thechao wrote: | It's not that variable length is expensive, it's that | variable length _the way Intel does it_ is expensive. For | instance -- not that this is a good idea -- you could burn | the top 2b to mark instructions as 2 /4/6/8 bytes (or | whatever) in length. Then you can have your variable-width- | cake-and-eat-your-fast-decode-too. | richardwhiuk wrote: | Variable length is always going to restrict the parralelism | of your instruction decode (for a given chip area/power/etc | cost). | thechao wrote: | Only because a single instruction would take the "space" | of multiple instructions in the I$-fetch-to-decode. The | idea with variable-length encodings is that, for example, | an 8B encoding does _more than twice the work_ of a 4B | encoding, so you lose out on the instruction slot, but | win on the work done. | | I mean ... that's the theory. | bogomipz wrote: | Could you elaborate - what is about the Intel design that | makes the decode so inefficient? Is "2b" bits here? | | Are there examples of ISA or chips that handle variable | length instruction encoding efficiently? | pkaye wrote: | Due to how the instruction set evolved, for the Intel x86 | architecture, you have to look at a lot of bits of the | instruction stream to determine where the next | instruction starts. To execute multiple instructions per | clock cycles you also have to decode instruction multiple | instructions per clock cycle. | | I think this old Intel patent talks about one of their | decoder implementations: | | https://patents.google.com/patent/US5758116A/en | Symmetry wrote: | Intel x86 isn't self synchonizing. In theory you have to | decode every byte of the instruction stream that came | before to be sure where the boundaries of an instruction | are. Normally it stabilizes after a while in real world | instruction streams but you can craft malicious | instruction streams which yield two different valid sets | of instructions depending on whether you start reading | them at an offset or not. | | Contrast that to something like utf8 where that isn't | possible. | saagarjha wrote: | Note that random x86 code is usually "eventually self- | synchronizing", which id useful if you're trying to | disassemble a blob at a random offset. | atq2119 wrote: | > you could burn the top 2b to mark instructions as 2/4/6/8 | bytes (or whatever) in length. | | FWIW, this is exactly what RISC-V does. | akira2501 wrote: | > you could burn the top 2b to mark | | Which seems reasonable.. but you just burned 3/4 of the | single opcode instruction space, which may not be worth it | for most general purpose loads. | bogomipz wrote: | Would you mind elaborating on the math of how the "top | 2b" ends up burning 3/4 of the single opcode instruction | space? | gbrown_ wrote: | How about undocumented instructions? Sandsifter[1] is an | interesting project and the video from BlackHat[2] is a good | watch. There's also a previous discussion of it on HN[3]. | | [1] https://github.com/Battelle/sandsifter | | [2] https://www.youtube.com/watch?v=KrksBdWcZgQ | | [3] https://news.ycombinator.com/item?id=18179212 | da_big_ghey wrote: | yes and sansdifter are continue finding more, recent | undocumented microcode modify instruction are find with it. | sandinmyjoints wrote: | Very cool project! Thanks for the pointer. | snvzz wrote: | The only real answer is: Too Many. | | This complexity is pushed down to operating systems, compilers, | assemblers, debuggers. It ends up causing brutal human time | overhead throughout the chain, and its cost effectively prevents | security and high assurance. | | This more than justifies moving away from x86 into RISC | architectures, such as the rising open and royalty-free RISC-V. | saagarjha wrote: | There is complexity, but it's not nearly as large as you claim | it is, nor does RISC magically resolve the issues you've | brought up. | WalterBright wrote: | Back in the 70s, a friend of mine said he loved the Z80 | instruction set the best. I replied that it was the most | complicated and kludgy one. He laughed and said the complexity | allowed his code to shine above others and he got paid more. | | Of course, these days the Z80 instruction set looks trivial :-) | deepspace wrote: | > This complexity is pushed down to operating systems, | compilers, assemblers, debuggers. | | Is that really true though? In my experience, the amount of | complexity in a layered system is roughly constant, and if you | make one layer simpler, you have to make another more complex. | | I would expect that compiler for a RISC architecture, for | example, needs to do a lot of work figuring out which set of | 'simple' instructions encode an intent most efficiently. It | also needs to deal with instruction scheduling and other issues | that a RISC compiler does not care about. | mobilio wrote: | In short - endless battle between RISC or CISC. | Symmetry wrote: | These days RISC is mostly about regular encoding of | instructions which don't present challenges to doing precise | interrupts on a deeply pipelined machine rather than just few | instructions _per se_. | bogomipz wrote: | What do you mean by "precise" interrupts here? Do some types | of interrupts cause worse pipeline stalls than other types? | Are the interrupt handlers in RISC faster because of more | efficient decoding? Is that the issue? | zsmi wrote: | This is the classic definition | | https://dl.acm.org/doi/10.1109/12.4607 | | "An interrupt is precise if the saved process state | corresponds to a sequential model of program execution in | which one instruction completes before the next begins. In | a pipelined processor, precise interrupts are difficult to | implement because an instruction may be initiated before | its predecessors have completed." | Symmetry wrote: | No, it's just that the fact that RISC didn't have any load- | op-store instructions made it easier to pipeline them back | in the day while still being able to handle interrupts as | if they executed one instruction at a time and there was | one instruction that had executed while the following one | hadn't executed at all. That's possible to do with CISC | architectures, of course, we do it all the time these days. | But back in the day being able to implement that easily-ish | in a reasonable number of transistors were a big factor in | early RISC processors having so much better performance | than the CISCs of the time. | | In the modern day it mostly just means that you can | implement fancy out of order schenanigans with a bit fewer | engineer-years than non-RISC ISAs. | bogomipz wrote: | Thanks these are both very helpful and good bits of | historical perspective as well. Cheers. | kingsuper20 wrote: | Oh well, there's RISC and there's RISC. | | I always think of the AMD 29000. | mobilio wrote: | Or i860 & i960? | johnklos wrote: | More than 1,500! Holy cow! | | While having instructions for everything that are slow in early | models but can be significantly improved in silicon over time is | one way to look at CISC, I genuinely wonder how much silicon is | spent on instructions that are so rarely used they'd be better in | software. | | Or to ask another way: how many instructions are in billions of | x86 cores that rarely if ever get used? Hmmm... | dreamcompiler wrote: | I'd also be curious to discover how many distinct x86 | instructions gcc can even emit? I expect the answer is "a lot | less than all of them." | saagarjha wrote: | Probably a couple hundred at most, and for common programs | several dozen. | Someone wrote: | If you cheat: all of them. GCC supports in-line assembly. | | Of course, that's not an interesting observation. | | It gets (somewhat) interesting when you realize that one may | inadvertently include in-line assembly. Do you call including | system headers that happen to contain inline assembly | cheating? Including headers for a standard C library? | Compiling with old-fashioned compiler flags (for example to | generate FPU instructions for the x87)? | kingsuper20 wrote: | I mostly just think of the test set they must make use of at | Intel. It makes my head hurt. Maybe you end up with a wad of | legacy code so large that no one knows how it really works. | That ends up being the real definition of the part. | thechao wrote: | It's really more like 6000, if you do the accounting right. | And, as someone who just ate a 2x (customer-facing) slow down | porting a piece of x86 assembly to ARM, because that "one nifty | instruction" was missing, I'm going to say: we should keep all | of 'em. | mhh__ wrote: | Which one? | thechao wrote: | _mm256_movemask_epi8, i.e., the "fast lexer" instruction. | That instruction takes ~3c (depending on uarch), and the | ARM equivalent (7-8 instructions) takes ~5-6c (depending on | uarch). It's just _annoying_. | ConcernedCoder wrote: | This instruction does quite a bit of leg work, and it | becomes obvious why a RISK architecture would need 7-8 | instructions to do the same, see: https://software.intel. | com/sites/landingpage/IntrinsicsGuide... | irdc wrote: | This looks like a task for gorc: https://five- | embeddev.com/riscv-bitmanip/draft/bext.html#gen... | | (granted, that's only 32/64 bits, but still...) | CodeArtisan wrote: | AMD Zen dropped a few instructions sets from the previous | architecture. | | FMA4 | https://en.wikipedia.org/wiki/FMA_instruction_set#FMA4_instr... | | XOP https://en.wikipedia.org/wiki/XOP_instruction_set | | TBM | https://en.wikipedia.org/wiki/Bit_manipulation_instruction_s... | davidkuhta wrote: | > To not leave you hanging: Intel has an official x86 | encoder/decoder library called XED. According to Intel's XED, as | of this writing, there are 1503 defined x86 instructions | ("iclasses" in XED lingo), from AAA to XTEST (this includes AMD- | specific extensions too, by the way). Straightforward, right? | | Hopefully this will have either saved you a click or validated | your time in reading the article. | siltpotato wrote: | Well, it's the first line. Probably doesn't count as click | bait. | anyfoo wrote: | Curious, does anyone actually care about the actual number | primarily? I thought pretty much everyone who clicks on an | article with that title would do so because they are interested | in the insights gathered when getting to that number. | wglb wrote: | If you are writing a disassembler or binary program decoder, | such a number will help you be sure that you enumerate all | the instructions. | ketralnis wrote: | I doubt most people reading the article coming from HN are | writing disassemblers, and all such people would have to | read it anyway because the number itself isn't sufficient | to validate that you've enumerated all of them (because as | my sibling points out, it's more complicated than that). | The specific number is the least interesting part. | account4mypc wrote: | i think if you were writing such a program, this article | would show that using such a number is a much more | complicated idea than it sounds | Out_of_Characte wrote: | Enumerating all instructions would be usefull to check | wether you can decode all legal instructions. | saagarjha wrote: | The issue is that the list of all legal instructions is | hard to define. | carbonguy wrote: | For me the article was well worth it; where else but in ISA | discussions can you find gems like the following? | | > _Does a non-instruction that is non-defined and unofficially | guaranteed to non-execute exactly as if it had never been in | the instruction set to begin with count as an x86 instruction? | For that matter, does UD2 itself, the defined undefined | instruction, count as an instruction?_ | sandinmyjoints wrote: | This was why I posted it -- I learned a lot more than the | answer to the title. | xony wrote: | 6000 ? cache can't effectively handle the burden of the decoder | propagation delays . compiler writer might not be using even 20% | of the total instructions. except backward compatibilty its | useless architecture | optimalsolver wrote: | Alternate instructions here :-) | | https://paws.kettering.edu/~jhuggins/humor/opcodes.html ___________________________________________________________________ (page generated 2021-04-21 23:01 UTC)