[HN Gopher] The Intel 80376 - A legacy-free i386 with a twist (2... ___________________________________________________________________ The Intel 80376 - A legacy-free i386 with a twist (2010) Author : anyfoo Score : 61 points Date : 2022-08-13 06:13 UTC (2 days ago) (HTM) web link (www.pagetable.com) (TXT) w3m dump (www.pagetable.com) | allenrb wrote: | I'd forgotten about the 80376 but it hits at a question I've | occasionally had over the last few years. Why have we not seen a | "modernized" x86 CPU that strips out everything pre-AMD64? The | answer seems likely to be one or both of: | | 1. There are more users of legacy modes than is obvious to us on | HN. | | 2. The gains in terms of gate lost, critical paths reduced, lower | power consumption just don't amount to much. | | My guess is that #2 is the dominant factor. If there were | actually significant gains to be had on a "clean"(er) x86 design, | we'd see it in the market regardless of #1. | klelatti wrote: | x86 did appear in one context where legacy compatibility is | likely to have been a much smaller (or non issue?) and where | (eg power consumption) efficiencies would have been even more | valuable - that's on mobile running Android. | | The fact that a cleaned version wasn't used would seem to | support your hypothesis. | tenebrisalietum wrote: | So I learned about the "hidden" x86 mode called XuCode (https:/ | /www.intel.com/content/www/us/en/developer/articles/t...) - | which is x86 binary code placed into RAM by the microcode and | then "called" by the microcode for certain instructions - | particularly SGX ones if I'm remembering correctly. | | Wild speculative guess: It's entirely possible some of the pre- | AMD64 stuff is actually internally used by modern Intel and AMD | CPUs to implement complex instructions. | kmeisthax wrote: | Oh boy, we've gone all the way back to Transmeta Code | Morphing Software. What "ring" does this live on now? Ring | -4? :P | | Jokes aside, I doubt XuCode would use pre-AMD64 stuff; | microcode is lower-level than that. The pre-AMD64 stuff is | already handled with sequences of microcode operations | because it's not really useful for modern applications[0]. | It's entirely possible for microcode to implement other | instructions too, and that's what XuCode is doing[1]. | | The real jank is probably hiding in early boot and SMM, | because you need to both jump to modern execution modes for | client or server machines _and_ downgrade to BIOS | compatibility for all those industrial deployments that want | to run ancient software and OSes on modern machines. | | [0] The last time I heard someone even talk about x86 | segmentation, it was as part of enforcing the Native Client | inner sandbox. | | [1] Hell, there's no particular reason why you can't have a | dual-mode CPU with separate decoders for ARM and x86 ISAs. As | far as I'm aware, however, such a thing does not exist... | though evidently at one point AMD was intending on shipping | Ryzen CPUs with ARM decoders in them. | Macha wrote: | As the article points out, modern x86 CPUs boot up in 16 bit | mode, then get transferred into 32 bit mode, then 64 bit mode. | So right out the gate such a CPU is not compatible with | existing operating systems, so now you have a non-compatible | architecture. Sure Intel could easily add support to GRUB and | push Microsoft to do it for new Windows media, but that won't | help the existing install base. Intel tried launching a non- | compatible CPU once, it was Itanium, it didn't go so well for | them. | | Plus I'm sure there's crazy DRM rootkits that depend on these | implementation details. | | Also, AMD has experimented with not-quite-PC-compatible x86 | setups already in the consoles. As the fail0verflow talk about | Linux on PS4 emphasised, the PS4 is x86, but not a PC. So | despite building a x86 CPU with some less legacy, AMD didn't | seem to think it worthwhile bringing it to a more general | purpose platform | | Also AMD/Intel/VIA are the only companies with the licenses to | produce x86, and you'd need both Intel and AMD to sign off on | licensing x64 to someone new. | messe wrote: | > As the article points out, modern x86 CPUs boot up in 16 | bit mode, then get transferred into 32 bit mode, then 64 bit | mode. So right out the gate such a CPU is not compatible with | existing operating systems, so now you have a non-compatible | architecture | | Except that a modern OS is booted in UEFI mode meaning that | the steps of going fro 16 -> 32 -> 64-bit mode are all | handled by the firmware, not the kernel or the bootloader. | The OS kernel will only (at most) switch to 32-bit | compatibility mode (a submode of long-mode not protected | mode) when it needs to run 32-bit apps, otherwise staying in | 64-bit mode 100% of the time. | anyfoo wrote: | Yeah. Long mode is a bit of a line in the sand, leaving | many features behind that were kept for compatibility | (segmentation, vm86...). It came at a time where, | fortunately, the mainstream OSes had enough abstraction so | that software did not have to be written for what | effectively was the bare metal anymore, with DOS almost | being more of a "software bootloader with filesystem | services". | [deleted] | anyfoo wrote: | > Intel tried launching a non-compatible CPU once, it was | Itanium, it didn't go so well for them. | | That may be only secondary, though. Itanium simply failed to | deliver performance promises and be competitive. The compiler | was supposed to effectively perform instruction scheduling | itself, and writing such a compiler turned out more difficult | than anticipated. | FullyFunctional wrote: | I've see this a lot, but IMO the truth is slightly | different: the assumption behind EPIC was that a compiler | _could_ do the scheduling which turned out to be | _impossible_. The EPIC effort roots goes way back, but | still I don't understand how they failed to foresee the | ever growing tower of caches which unavoidably leads to a | crazy wide latency range for loads (3-400+ cycles) which in | turn is why we now have these very deep OoO machines. | (Tachyum's Prodigy appears to be repeating the EPIC mistake | with very limited but undisclosed reordering). | | OoO EPIC has been suggested (I recall an old comp.arch | posting by an Intel architect) but never got green-lit. I | assume they had bet so much on compiler assumption that the | complexity would have killed it. | | It's really a shame because EPIC did get _some_ things | right. The compiler absolutely can make the front-end life | easier by making dependences more explicit (though I would | do it differently) and by making control transfers much | easier to deal with (the 128-bit block alone saves 4 bits | in all BTB entries, etc). On the balance, IA-64 was a | committee-designed train wreck, piling on way too much | complexity, and failed both as a brainiac and speed-daemon. | | Disclaimer: I have an Itanic space heater than I | occasionally boot up for the chuckle - and then shuts down | before the hearing damage gets permanent. | klelatti wrote: | > Intel tried launching a non-compatible CPU once, it was | Itanium, it didn't go so well for them. | | More than once. iAPX432 if anything went worse. | anyfoo wrote: | Yeah, but that, again, was for far worse reasons than just | not being "compatible". In fact, iAPX432 was exceptionally | bad. Intel's i960 for example fared much better in the | embedded space (where PC-compatibility did not matter). | klelatti wrote: | Indeed and I suppose in fairness I don't think 432 was | ever intended as a PC cpu replacement whilst Itanium was | designed to replace some x86 servers. | | As an aside I'm still astonished that 432 and Itanium got | as far as they did with so much cash spent on them before | without conclusive proof that performance would be | competitive. Seems like a prerequisite for projects of | this size. | rodgerd wrote: | Think about the disruption that Apple caused when they moved | from 32-bit x86 being supported to deprecating it - there was a | great deal of angst, and that's on a vertically-integrated | platform that is relatively young (yes, I know that NeXT is | old, but MacOS isn't, really). Now imagine that on Windows - a | much older platform, with much higher expectations for | backwards compat. It would be a bloodbath of user rage. | | More importantly, though, backward compat has been Intel's moat | for a long, long time. Intel have been trying to get people to | buy non-16-bit-compat processors for literally 40 years! | They've tried introducing a lot of mildly (StrongARM - | technically a buy I suppose, i860/i960) and radically (i432, | Itanium) innovative processors, and they've all been | indifferent or outright failures in the marketplace. | | The market has been really clear on this: it doesn't care for | Intel graphics cards or SSDs or memory, it hates non-x86 Intel | processors. Intel stays in business by shipping 16-bit- | compatible processors. | rwmj wrote: | Quite a lot of modern Arm 64 bit processors have dropped 32 bit | (ie. ARMv7) support. Be careful what you wish for though! It's | still useful to be able to run 32 bit i386 code at a decent | speed occasionally. Even on my Linux systems I still have | hundreds of *.i686.rpms installed. | jleahy wrote: | The 64-bit ARM instructions were designed in a way that made | supporting both modes in parallel very expensive from a | silicon perspective. In contrast AMD were very clever with | AMD64 and designed it such that very little additional | silicon area was required to add it. | danbolt wrote: | I feel as though a lot of the consumer value of x86+Windows | comes from its wide library of software and compatibility. | | > than is obvious to us on HN | | I think your average HNer is more likely to interact with | Linux/Mac workstations or servers, where binary compatibility | isn't as necessary. | unnah wrote: | Instruction decoding is a bottleneck for x86 these days: Apple | M1 can do 8-wide decode, Intel just managed to reach 6-wide in | Alder Lake, and AMD Zen 3 only has a 4-wide decoder. One would | think that dropping legacy 16-bit and 32-bit instructions would | enable simpler and more efficient instruction decoders in | future x86 versions. | amluto wrote: | Sadly not. x86_64's encoding is extremely similar to the | legacy encodings. AIUI the fundamental problem is that x86 is | a variable-length encoding, so a fully parallel decoder needs | to decode at guessed offsets, many of which will be wrong. | ARM64 instructions are aligned. | | Dumping legacy features would be great for all kinds of | reasons, but not this particular reason. | hyperman1 wrote: | This suggests another way forward: Re-encode the existing | opcodes with new, more regular byte sequences. E.g. 32 bits | / instruction, with some escape for e.g. 64bit constants. | You'll have to redo the backend of the assembler, but most | of the compiler and optimization wisdom can be reused as- | is. Of course, this breaks backward compatibility | completely so the high performance mode can only be | unlocked for recompiles. | colejohnson66 wrote: | That was Itanium, and it failed for a variety of reasons; | one of which was a compatibility layer that sucked. You | _can 't_ get rid of x86's backwards compatibility. Intel | and AMD have done their best by using vector prefixes | (like VEX and EVEX)[a] that massively simplify decoding, | but there's only so much that can be done. | | People get caught up in the variable length issue that | x86 has, and then claim that M1 beats x86 because of | that. Sure, decoding ARM instructions is easier than x86, | but the variable length aspect is handled in the | predecode/cache stage, not the actual decoder. The | decoder, when it reaches an instruction, already knows | various bits of info are. | | The RISC vs CISC debate is useless today. M1's big | advantage comes from the memory ordering model (and other | things)[0], not the instruction format. Apple actually | had to create a special mode for the M1 (for Rosetta 2) | that enforces the x86 ordering model (TSO with load | forwarding), and native performance is slightly worse | when doing so. | | [0]: | https://twitter.com/ErrataRob/status/1331735383193903104 | | [a]: There's also others that predate AVX (VEX) such as | the 0F38 prefix group consisting only of opcodes that | have a ModR/M byte and no immediate, and the 0F3A prefix | being the same, but with an 8 bit immediate. | ShroudedNight wrote: | I thought the critical failure of Itanium was that a | priori VLIW scheduling turned out to be a non-starter, at | least as far as doing so efficiently | atq2119 wrote: | The entire approach is misguided for single-threaded | performance. It turns out that out-of-order execution is | pretty important for a number is things, perhaps most | importantly dealing with variable memory instruction | latencies (cache hits at various points in the hierarchy | vs. misses). A compiler simply cannot statically predict | those well enough. | Dylan16807 wrote: | > That was Itanium | | What? No. Itanium was a vastly, wildly different | architecture. | FullyFunctional wrote: | Two factual mistakes: | | * IA-64 failed primarily because it failed to deliver the | promised performance. x86 comparability isn't and wasn't | essential to success (behold the success on Arm for | example). | | * M1's advantage has almost nothing to do with the weak | memory model, but it has to do with everything: wider, | deeper, faster (memory). The ISA being Arm64 also help in | many ways. The variable length x86 instructions can be | dealt with via predecoding, sure to an extent, but that | lengthens the pipeline which hurts the branch mispredict | penalty, which absolutely matters. | kmeisthax wrote: | M1 doesn't have a special mode for Rosetta. _All code_ is | executed with x86 TSO on M1 's application processors. | How do I know this? | | Well, did you know Apple ported Rosetta 2 to Linux? You | can get it by running a Linux VM on macOS. It does not | require any kernel changes to support in VMs, and if you | extract the binary to run it on Asahi Linux, it works | just fine too. None of the Asahi team did _anything_ to | support x86 TSO. Rosetta also works just fine in m1n1 's | hypervisor mode, which exists specifically to log all | hardware access to detect these sorts of things. If there | _is_ a hardware toggle for TSO, it 's either part of the | chicken bits (and thus enabled all the time anyway) or | turned on by iBoot (and thus enabled before any user code | runs). | | Related point: Hector Martin just upstreamed a patch to | Linux that fixes a memory ordering bug in workqueues | that's been around since before Linux had Git history. He | also found a bug in some ARM litmus tests that he was | using to validate whether or not they were implemented | correctly. Both of those happened purely because M1 and | M2 are so hilariously wide and speculative that they | trigger memory reorders no other CPU would. | messe wrote: | I'm sorry, but please cite some sources, because this | contradicts everything that's been said about M1's x86 | emulation that I've read so far. | | > Well, did you know Apple ported Rosetta 2 to Linux? You | can get it by running a Linux VM on macOS. It does not | require any kernel changes to support in VMs, and if you | extract the binary to run it on Asahi Linux, it works | just fine too. None of the Asahi team did anything to | support x86 TSO. Rosetta also works just fine in m1n1's | hypervisor mode, which exists specifically to log all | hardware access to detect these sorts of things. If there | is a hardware toggle for TSO, it's either part of the | chicken bits (and thus enabled all the time anyway) or | turned on by iBoot (and thus enabled before any user code | runs). | | Apple tells you to attach a special volume/FS to your | linux VM in order for Rosetta to work. When such a volume | is attached, it runs the VM in TSO mode. As simple as | that. | | The rosetta binary itself doesn't know whether or not TSO | is enabled, so its not surprising that it runs fine under | Asahi. As marcan42 himself said on twitter[1], most x86 | applications will run fine even without TSO enabled. | You're liable to run into edge cases in heavily | multithreaded code though. | | [1]: | https://twitter.com/marcan42/status/1534054757421432833 | | > Both of those happened purely because M1 and M2 are so | hilariously wide and speculative that they trigger memory | reorders no other CPU would. | | In other words, they're not constantly running in TSO | mode? Because if they were, why would they trigger such | re-orders? | | EDIT: I've just run a modified version of the following | test program[2] (removing the references to the | tso_enable sysctl which requires an extension), both | native and under Rosetta. | | Running natively, it fails after ~3500 iterations. Under | Rosetta, it completes the entire test successfully. | | [2] https://github.com/losfair/tsotest/ | anyfoo wrote: | > M1 doesn't have a special mode for Rosetta. All code is | executed with x86 TSO on M1's application processors. | | That's not true. (And doesn't your last paragraph | contradict it already?) | | You might just have figured out that most stuff will run | fine (or appear to run fine for a long time) when TSO | isn't enabled. | messe wrote: | I have no idea why you are being downvoted. You are | entirely correct. | anyfoo wrote: | Thanks, I was puzzled as well. The downvotes seem to have | stopped, though. | umanwizard wrote: | If you're inventing a completely incompatible ISA, why | not just use ARM64 at that point? | anamax wrote: | Perhaps because you don't want to commit to ARM | compatiblity AND licensing fees. | gabereiser wrote: | To someone who is interested in bare metal, can you explain | the significance of this? Is this how much data a CPU can | handle simultaneously? Via instructions from the kernel? | anyfoo wrote: | It means how many instructions the CPU can decode at the | same time, roughly to "figure out what they mean and | dispatch what they actually have to _do_ to the functional | units of the CPU which will perform the work of the | instruction ". It is not directly how much data a | superscalar CPU can handle in parallel, but still plays a | role, in the sense that there is a number of functional | units available in the CPU, and if you cannot keep those | busy with decoded instructions, then they lay around | unused. So a too narrow decoder can be one of the | bottlenecks in optimal CPU usage (but note how as a sibling | commenter mentioned, the complexity of the | instructions/architecture is also important, e.g. a single | CISC instruction may keep things pretty busy by itself). | | Whether the instructions come from the kernel or from | userspace does not matter at all, they all go through the | same decoder and functional units. The kernel/userspace | differentiation is a higher level concept. | monocasa wrote: | It's more complex than that if you'll excuse the pun. | Instructions on CISC cores aren't 1 to 1 with RISC | instructions, and tend to encode quite a bit more micro ops. | Something like inc dword [rbp+16] is one instruction, but | would be a minimum of three micro ops (and would be three | RISC instructions as well). | | Long story short, this isn't really the bottle neck, or we'd | see more simple decoders on the tail end of the decode | window. | mhh__ wrote: | Decode bound performance issues are actually pretty rare. X86 | is quite dense. | johnklos wrote: | > The 80376 doesn't do paging. | | Wait - what? How is that even possible? Do they simply _not_ have | an MMU? That makes it unsuitable for both old OSes and for new | OSes. No wonder it was so uncommon. | anyfoo wrote: | It claims itself as an "embedded" processor, so it likely just | wasn't meant to run PC OSes. According to Wikipedia at least, | Intel did not even expect the 286 to be used for PCs, but for | things such as PBXs. And ARM Cortex M still doesn't have an MMU | either, for some applications you can just do without. | Especially because both the 286 and this 376 beast did have | segmentation, which could subsume some of the need for an MMU | (separated address spaces, if you don't need paging and are | content with e.g. buddy allocation for dividing available | memory among tasks). | [deleted] | marssaxman wrote: | It is very common for embedded systems processors not to have | an MMU. If it ran any OS at all it would likely have been some | kind of RTOS. | blueflow wrote: | Using paging/virtual memory is not a requirement for an OS, | even when all currently popular OSes make use of it. | | Intel CPUs before the 80286 did not have an MMU, either. | anyfoo wrote: | Did they even call it an MMU already? The 286 only had | segmentation, which arguably is just an addressing mode. It | introduced descriptors that had to be resolved, but that | happened when selecting the descriptor (i.e. when loading the | segment register), where a hidden base, limit, and permission | "cache" was updated. Unlike paging, where things are resolved | when accessing memory. ___________________________________________________________________ (page generated 2022-08-15 23:00 UTC)