[HN Gopher] The Intel 80376 - A legacy-free i386 with a twist (2...
       ___________________________________________________________________
        
       The Intel 80376 - A legacy-free i386 with a twist (2010)
        
       Author : anyfoo
       Score  : 61 points
       Date   : 2022-08-13 06:13 UTC (2 days ago)
        
 (HTM) web link (www.pagetable.com)
 (TXT) w3m dump (www.pagetable.com)
        
       | allenrb wrote:
       | I'd forgotten about the 80376 but it hits at a question I've
       | occasionally had over the last few years. Why have we not seen a
       | "modernized" x86 CPU that strips out everything pre-AMD64? The
       | answer seems likely to be one or both of:
       | 
       | 1. There are more users of legacy modes than is obvious to us on
       | HN.
       | 
       | 2. The gains in terms of gate lost, critical paths reduced, lower
       | power consumption just don't amount to much.
       | 
       | My guess is that #2 is the dominant factor. If there were
       | actually significant gains to be had on a "clean"(er) x86 design,
       | we'd see it in the market regardless of #1.
        
         | klelatti wrote:
         | x86 did appear in one context where legacy compatibility is
         | likely to have been a much smaller (or non issue?) and where
         | (eg power consumption) efficiencies would have been even more
         | valuable - that's on mobile running Android.
         | 
         | The fact that a cleaned version wasn't used would seem to
         | support your hypothesis.
        
         | tenebrisalietum wrote:
         | So I learned about the "hidden" x86 mode called XuCode (https:/
         | /www.intel.com/content/www/us/en/developer/articles/t...) -
         | which is x86 binary code placed into RAM by the microcode and
         | then "called" by the microcode for certain instructions -
         | particularly SGX ones if I'm remembering correctly.
         | 
         | Wild speculative guess: It's entirely possible some of the pre-
         | AMD64 stuff is actually internally used by modern Intel and AMD
         | CPUs to implement complex instructions.
        
           | kmeisthax wrote:
           | Oh boy, we've gone all the way back to Transmeta Code
           | Morphing Software. What "ring" does this live on now? Ring
           | -4? :P
           | 
           | Jokes aside, I doubt XuCode would use pre-AMD64 stuff;
           | microcode is lower-level than that. The pre-AMD64 stuff is
           | already handled with sequences of microcode operations
           | because it's not really useful for modern applications[0].
           | It's entirely possible for microcode to implement other
           | instructions too, and that's what XuCode is doing[1].
           | 
           | The real jank is probably hiding in early boot and SMM,
           | because you need to both jump to modern execution modes for
           | client or server machines _and_ downgrade to BIOS
           | compatibility for all those industrial deployments that want
           | to run ancient software and OSes on modern machines.
           | 
           | [0] The last time I heard someone even talk about x86
           | segmentation, it was as part of enforcing the Native Client
           | inner sandbox.
           | 
           | [1] Hell, there's no particular reason why you can't have a
           | dual-mode CPU with separate decoders for ARM and x86 ISAs. As
           | far as I'm aware, however, such a thing does not exist...
           | though evidently at one point AMD was intending on shipping
           | Ryzen CPUs with ARM decoders in them.
        
         | Macha wrote:
         | As the article points out, modern x86 CPUs boot up in 16 bit
         | mode, then get transferred into 32 bit mode, then 64 bit mode.
         | So right out the gate such a CPU is not compatible with
         | existing operating systems, so now you have a non-compatible
         | architecture. Sure Intel could easily add support to GRUB and
         | push Microsoft to do it for new Windows media, but that won't
         | help the existing install base. Intel tried launching a non-
         | compatible CPU once, it was Itanium, it didn't go so well for
         | them.
         | 
         | Plus I'm sure there's crazy DRM rootkits that depend on these
         | implementation details.
         | 
         | Also, AMD has experimented with not-quite-PC-compatible x86
         | setups already in the consoles. As the fail0verflow talk about
         | Linux on PS4 emphasised, the PS4 is x86, but not a PC. So
         | despite building a x86 CPU with some less legacy, AMD didn't
         | seem to think it worthwhile bringing it to a more general
         | purpose platform
         | 
         | Also AMD/Intel/VIA are the only companies with the licenses to
         | produce x86, and you'd need both Intel and AMD to sign off on
         | licensing x64 to someone new.
        
           | messe wrote:
           | > As the article points out, modern x86 CPUs boot up in 16
           | bit mode, then get transferred into 32 bit mode, then 64 bit
           | mode. So right out the gate such a CPU is not compatible with
           | existing operating systems, so now you have a non-compatible
           | architecture
           | 
           | Except that a modern OS is booted in UEFI mode meaning that
           | the steps of going fro 16 -> 32 -> 64-bit mode are all
           | handled by the firmware, not the kernel or the bootloader.
           | The OS kernel will only (at most) switch to 32-bit
           | compatibility mode (a submode of long-mode not protected
           | mode) when it needs to run 32-bit apps, otherwise staying in
           | 64-bit mode 100% of the time.
        
             | anyfoo wrote:
             | Yeah. Long mode is a bit of a line in the sand, leaving
             | many features behind that were kept for compatibility
             | (segmentation, vm86...). It came at a time where,
             | fortunately, the mainstream OSes had enough abstraction so
             | that software did not have to be written for what
             | effectively was the bare metal anymore, with DOS almost
             | being more of a "software bootloader with filesystem
             | services".
        
           | [deleted]
        
           | anyfoo wrote:
           | > Intel tried launching a non-compatible CPU once, it was
           | Itanium, it didn't go so well for them.
           | 
           | That may be only secondary, though. Itanium simply failed to
           | deliver performance promises and be competitive. The compiler
           | was supposed to effectively perform instruction scheduling
           | itself, and writing such a compiler turned out more difficult
           | than anticipated.
        
             | FullyFunctional wrote:
             | I've see this a lot, but IMO the truth is slightly
             | different: the assumption behind EPIC was that a compiler
             | _could_ do the scheduling which turned out to be
             | _impossible_. The EPIC effort roots goes way back, but
             | still I don't understand how they failed to foresee the
             | ever growing tower of caches which unavoidably leads to a
             | crazy wide latency range for loads (3-400+ cycles) which in
             | turn is why we now have these very deep OoO machines.
             | (Tachyum's Prodigy appears to be repeating the EPIC mistake
             | with very limited but undisclosed reordering).
             | 
             | OoO EPIC has been suggested (I recall an old comp.arch
             | posting by an Intel architect) but never got green-lit. I
             | assume they had bet so much on compiler assumption that the
             | complexity would have killed it.
             | 
             | It's really a shame because EPIC did get _some_ things
             | right. The compiler absolutely can make the front-end life
             | easier by making dependences more explicit (though I would
             | do it differently) and by making control transfers much
             | easier to deal with (the 128-bit block alone saves 4 bits
             | in all BTB entries, etc). On the balance, IA-64 was a
             | committee-designed train wreck, piling on way too much
             | complexity, and failed both as a brainiac and speed-daemon.
             | 
             | Disclaimer: I have an Itanic space heater than I
             | occasionally boot up for the chuckle - and then shuts down
             | before the hearing damage gets permanent.
        
           | klelatti wrote:
           | > Intel tried launching a non-compatible CPU once, it was
           | Itanium, it didn't go so well for them.
           | 
           | More than once. iAPX432 if anything went worse.
        
             | anyfoo wrote:
             | Yeah, but that, again, was for far worse reasons than just
             | not being "compatible". In fact, iAPX432 was exceptionally
             | bad. Intel's i960 for example fared much better in the
             | embedded space (where PC-compatibility did not matter).
        
               | klelatti wrote:
               | Indeed and I suppose in fairness I don't think 432 was
               | ever intended as a PC cpu replacement whilst Itanium was
               | designed to replace some x86 servers.
               | 
               | As an aside I'm still astonished that 432 and Itanium got
               | as far as they did with so much cash spent on them before
               | without conclusive proof that performance would be
               | competitive. Seems like a prerequisite for projects of
               | this size.
        
         | rodgerd wrote:
         | Think about the disruption that Apple caused when they moved
         | from 32-bit x86 being supported to deprecating it - there was a
         | great deal of angst, and that's on a vertically-integrated
         | platform that is relatively young (yes, I know that NeXT is
         | old, but MacOS isn't, really). Now imagine that on Windows - a
         | much older platform, with much higher expectations for
         | backwards compat. It would be a bloodbath of user rage.
         | 
         | More importantly, though, backward compat has been Intel's moat
         | for a long, long time. Intel have been trying to get people to
         | buy non-16-bit-compat processors for literally 40 years!
         | They've tried introducing a lot of mildly (StrongARM -
         | technically a buy I suppose, i860/i960) and radically (i432,
         | Itanium) innovative processors, and they've all been
         | indifferent or outright failures in the marketplace.
         | 
         | The market has been really clear on this: it doesn't care for
         | Intel graphics cards or SSDs or memory, it hates non-x86 Intel
         | processors. Intel stays in business by shipping 16-bit-
         | compatible processors.
        
         | rwmj wrote:
         | Quite a lot of modern Arm 64 bit processors have dropped 32 bit
         | (ie. ARMv7) support. Be careful what you wish for though! It's
         | still useful to be able to run 32 bit i386 code at a decent
         | speed occasionally. Even on my Linux systems I still have
         | hundreds of *.i686.rpms installed.
        
           | jleahy wrote:
           | The 64-bit ARM instructions were designed in a way that made
           | supporting both modes in parallel very expensive from a
           | silicon perspective. In contrast AMD were very clever with
           | AMD64 and designed it such that very little additional
           | silicon area was required to add it.
        
         | danbolt wrote:
         | I feel as though a lot of the consumer value of x86+Windows
         | comes from its wide library of software and compatibility.
         | 
         | > than is obvious to us on HN
         | 
         | I think your average HNer is more likely to interact with
         | Linux/Mac workstations or servers, where binary compatibility
         | isn't as necessary.
        
         | unnah wrote:
         | Instruction decoding is a bottleneck for x86 these days: Apple
         | M1 can do 8-wide decode, Intel just managed to reach 6-wide in
         | Alder Lake, and AMD Zen 3 only has a 4-wide decoder. One would
         | think that dropping legacy 16-bit and 32-bit instructions would
         | enable simpler and more efficient instruction decoders in
         | future x86 versions.
        
           | amluto wrote:
           | Sadly not. x86_64's encoding is extremely similar to the
           | legacy encodings. AIUI the fundamental problem is that x86 is
           | a variable-length encoding, so a fully parallel decoder needs
           | to decode at guessed offsets, many of which will be wrong.
           | ARM64 instructions are aligned.
           | 
           | Dumping legacy features would be great for all kinds of
           | reasons, but not this particular reason.
        
             | hyperman1 wrote:
             | This suggests another way forward: Re-encode the existing
             | opcodes with new, more regular byte sequences. E.g. 32 bits
             | / instruction, with some escape for e.g. 64bit constants.
             | You'll have to redo the backend of the assembler, but most
             | of the compiler and optimization wisdom can be reused as-
             | is. Of course, this breaks backward compatibility
             | completely so the high performance mode can only be
             | unlocked for recompiles.
        
               | colejohnson66 wrote:
               | That was Itanium, and it failed for a variety of reasons;
               | one of which was a compatibility layer that sucked. You
               | _can 't_ get rid of x86's backwards compatibility. Intel
               | and AMD have done their best by using vector prefixes
               | (like VEX and EVEX)[a] that massively simplify decoding,
               | but there's only so much that can be done.
               | 
               | People get caught up in the variable length issue that
               | x86 has, and then claim that M1 beats x86 because of
               | that. Sure, decoding ARM instructions is easier than x86,
               | but the variable length aspect is handled in the
               | predecode/cache stage, not the actual decoder. The
               | decoder, when it reaches an instruction, already knows
               | various bits of info are.
               | 
               | The RISC vs CISC debate is useless today. M1's big
               | advantage comes from the memory ordering model (and other
               | things)[0], not the instruction format. Apple actually
               | had to create a special mode for the M1 (for Rosetta 2)
               | that enforces the x86 ordering model (TSO with load
               | forwarding), and native performance is slightly worse
               | when doing so.
               | 
               | [0]:
               | https://twitter.com/ErrataRob/status/1331735383193903104
               | 
               | [a]: There's also others that predate AVX (VEX) such as
               | the 0F38 prefix group consisting only of opcodes that
               | have a ModR/M byte and no immediate, and the 0F3A prefix
               | being the same, but with an 8 bit immediate.
        
               | ShroudedNight wrote:
               | I thought the critical failure of Itanium was that a
               | priori VLIW scheduling turned out to be a non-starter, at
               | least as far as doing so efficiently
        
               | atq2119 wrote:
               | The entire approach is misguided for single-threaded
               | performance. It turns out that out-of-order execution is
               | pretty important for a number is things, perhaps most
               | importantly dealing with variable memory instruction
               | latencies (cache hits at various points in the hierarchy
               | vs. misses). A compiler simply cannot statically predict
               | those well enough.
        
               | Dylan16807 wrote:
               | > That was Itanium
               | 
               | What? No. Itanium was a vastly, wildly different
               | architecture.
        
               | FullyFunctional wrote:
               | Two factual mistakes:
               | 
               | * IA-64 failed primarily because it failed to deliver the
               | promised performance. x86 comparability isn't and wasn't
               | essential to success (behold the success on Arm for
               | example).
               | 
               | * M1's advantage has almost nothing to do with the weak
               | memory model, but it has to do with everything: wider,
               | deeper, faster (memory). The ISA being Arm64 also help in
               | many ways. The variable length x86 instructions can be
               | dealt with via predecoding, sure to an extent, but that
               | lengthens the pipeline which hurts the branch mispredict
               | penalty, which absolutely matters.
        
               | kmeisthax wrote:
               | M1 doesn't have a special mode for Rosetta. _All code_ is
               | executed with x86 TSO on M1 's application processors.
               | How do I know this?
               | 
               | Well, did you know Apple ported Rosetta 2 to Linux? You
               | can get it by running a Linux VM on macOS. It does not
               | require any kernel changes to support in VMs, and if you
               | extract the binary to run it on Asahi Linux, it works
               | just fine too. None of the Asahi team did _anything_ to
               | support x86 TSO. Rosetta also works just fine in m1n1 's
               | hypervisor mode, which exists specifically to log all
               | hardware access to detect these sorts of things. If there
               | _is_ a hardware toggle for TSO, it 's either part of the
               | chicken bits (and thus enabled all the time anyway) or
               | turned on by iBoot (and thus enabled before any user code
               | runs).
               | 
               | Related point: Hector Martin just upstreamed a patch to
               | Linux that fixes a memory ordering bug in workqueues
               | that's been around since before Linux had Git history. He
               | also found a bug in some ARM litmus tests that he was
               | using to validate whether or not they were implemented
               | correctly. Both of those happened purely because M1 and
               | M2 are so hilariously wide and speculative that they
               | trigger memory reorders no other CPU would.
        
               | messe wrote:
               | I'm sorry, but please cite some sources, because this
               | contradicts everything that's been said about M1's x86
               | emulation that I've read so far.
               | 
               | > Well, did you know Apple ported Rosetta 2 to Linux? You
               | can get it by running a Linux VM on macOS. It does not
               | require any kernel changes to support in VMs, and if you
               | extract the binary to run it on Asahi Linux, it works
               | just fine too. None of the Asahi team did anything to
               | support x86 TSO. Rosetta also works just fine in m1n1's
               | hypervisor mode, which exists specifically to log all
               | hardware access to detect these sorts of things. If there
               | is a hardware toggle for TSO, it's either part of the
               | chicken bits (and thus enabled all the time anyway) or
               | turned on by iBoot (and thus enabled before any user code
               | runs).
               | 
               | Apple tells you to attach a special volume/FS to your
               | linux VM in order for Rosetta to work. When such a volume
               | is attached, it runs the VM in TSO mode. As simple as
               | that.
               | 
               | The rosetta binary itself doesn't know whether or not TSO
               | is enabled, so its not surprising that it runs fine under
               | Asahi. As marcan42 himself said on twitter[1], most x86
               | applications will run fine even without TSO enabled.
               | You're liable to run into edge cases in heavily
               | multithreaded code though.
               | 
               | [1]:
               | https://twitter.com/marcan42/status/1534054757421432833
               | 
               | > Both of those happened purely because M1 and M2 are so
               | hilariously wide and speculative that they trigger memory
               | reorders no other CPU would.
               | 
               | In other words, they're not constantly running in TSO
               | mode? Because if they were, why would they trigger such
               | re-orders?
               | 
               | EDIT: I've just run a modified version of the following
               | test program[2] (removing the references to the
               | tso_enable sysctl which requires an extension), both
               | native and under Rosetta.
               | 
               | Running natively, it fails after ~3500 iterations. Under
               | Rosetta, it completes the entire test successfully.
               | 
               | [2] https://github.com/losfair/tsotest/
        
               | anyfoo wrote:
               | > M1 doesn't have a special mode for Rosetta. All code is
               | executed with x86 TSO on M1's application processors.
               | 
               | That's not true. (And doesn't your last paragraph
               | contradict it already?)
               | 
               | You might just have figured out that most stuff will run
               | fine (or appear to run fine for a long time) when TSO
               | isn't enabled.
        
               | messe wrote:
               | I have no idea why you are being downvoted. You are
               | entirely correct.
        
               | anyfoo wrote:
               | Thanks, I was puzzled as well. The downvotes seem to have
               | stopped, though.
        
               | umanwizard wrote:
               | If you're inventing a completely incompatible ISA, why
               | not just use ARM64 at that point?
        
               | anamax wrote:
               | Perhaps because you don't want to commit to ARM
               | compatiblity AND licensing fees.
        
           | gabereiser wrote:
           | To someone who is interested in bare metal, can you explain
           | the significance of this? Is this how much data a CPU can
           | handle simultaneously? Via instructions from the kernel?
        
             | anyfoo wrote:
             | It means how many instructions the CPU can decode at the
             | same time, roughly to "figure out what they mean and
             | dispatch what they actually have to _do_ to the functional
             | units of the CPU which will perform the work of the
             | instruction ". It is not directly how much data a
             | superscalar CPU can handle in parallel, but still plays a
             | role, in the sense that there is a number of functional
             | units available in the CPU, and if you cannot keep those
             | busy with decoded instructions, then they lay around
             | unused. So a too narrow decoder can be one of the
             | bottlenecks in optimal CPU usage (but note how as a sibling
             | commenter mentioned, the complexity of the
             | instructions/architecture is also important, e.g. a single
             | CISC instruction may keep things pretty busy by itself).
             | 
             | Whether the instructions come from the kernel or from
             | userspace does not matter at all, they all go through the
             | same decoder and functional units. The kernel/userspace
             | differentiation is a higher level concept.
        
           | monocasa wrote:
           | It's more complex than that if you'll excuse the pun.
           | Instructions on CISC cores aren't 1 to 1 with RISC
           | instructions, and tend to encode quite a bit more micro ops.
           | Something like inc dword [rbp+16] is one instruction, but
           | would be a minimum of three micro ops (and would be three
           | RISC instructions as well).
           | 
           | Long story short, this isn't really the bottle neck, or we'd
           | see more simple decoders on the tail end of the decode
           | window.
        
           | mhh__ wrote:
           | Decode bound performance issues are actually pretty rare. X86
           | is quite dense.
        
       | johnklos wrote:
       | > The 80376 doesn't do paging.
       | 
       | Wait - what? How is that even possible? Do they simply _not_ have
       | an MMU? That makes it unsuitable for both old OSes and for new
       | OSes. No wonder it was so uncommon.
        
         | anyfoo wrote:
         | It claims itself as an "embedded" processor, so it likely just
         | wasn't meant to run PC OSes. According to Wikipedia at least,
         | Intel did not even expect the 286 to be used for PCs, but for
         | things such as PBXs. And ARM Cortex M still doesn't have an MMU
         | either, for some applications you can just do without.
         | Especially because both the 286 and this 376 beast did have
         | segmentation, which could subsume some of the need for an MMU
         | (separated address spaces, if you don't need paging and are
         | content with e.g. buddy allocation for dividing available
         | memory among tasks).
        
           | [deleted]
        
         | marssaxman wrote:
         | It is very common for embedded systems processors not to have
         | an MMU. If it ran any OS at all it would likely have been some
         | kind of RTOS.
        
         | blueflow wrote:
         | Using paging/virtual memory is not a requirement for an OS,
         | even when all currently popular OSes make use of it.
         | 
         | Intel CPUs before the 80286 did not have an MMU, either.
        
           | anyfoo wrote:
           | Did they even call it an MMU already? The 286 only had
           | segmentation, which arguably is just an addressing mode. It
           | introduced descriptors that had to be resolved, but that
           | happened when selecting the descriptor (i.e. when loading the
           | segment register), where a hidden base, limit, and permission
           | "cache" was updated. Unlike paging, where things are resolved
           | when accessing memory.
        
       ___________________________________________________________________
       (page generated 2022-08-15 23:00 UTC)