[HN Gopher] What's new in CPUs since the 80s? (2015)
       ___________________________________________________________________
        
       What's new in CPUs since the 80s? (2015)
        
       Author : snvzz
       Score  : 140 points
       Date   : 2022-04-20 06:11 UTC (2 days ago)
        
 (HTM) web link (danluu.com)
 (TXT) w3m dump (danluu.com)
        
       | susrev wrote:
       | The vast expanses of text with no formatting rules always makes
       | it hard for me to follow along. Added some simple rules that make
       | it much easier to read.
       | 
       | p { maxwidth: 1000px, text-align: center, margin-left: auto,
       | margin-right: auto }
       | 
       | body { text-align: center }
       | 
       | MUCH easier (code snippets = toast)
        
         | jimjambw wrote:
         | What about reader mode?
        
       | jcadam wrote:
       | My Ryzen 9 is just a bit more performant than the MOS6502 my
       | Apple ][ was rocking back in the 80s.
       | 
       | Ok... it also has things like a built-in FPU, multiple cores,
       | cache, pipelining, branch prediction, more address space, more
       | registers, and manufactured with a significantly better (and
       | smaller) process.
        
       | phendrenad2 wrote:
       | Nice article. Kind of low-hanging fruit though. A comparison
       | between CPUs in 2022 vs CPUs in 2002 would be much more
       | interesting. ;)
        
         | xmprt wrote:
         | Not really low hanging fruit if the last time you studied CPU
         | design was in a university course. I personally found a lot of
         | the information pretty interesting.
        
       | nimbius wrote:
       | my favourite addition since the 80s has been the unrelenting,
       | unquestioned, ubiquitous and permanent inclusion of numerous
       | iterations of poorly planned and executed management engine
       | frameworks designed to completely ablate the user from the
       | experience of general computing in the service of perpetual and
       | mandatory DRM and security theatricality. the best aspect of this
       | new feature is that not only is your processor effectively
       | indifferent from a rented pressure washer, but on a long enough
       | timeline the indelible omniscient slavemaster to which your IO is
       | subservient can and will always find itself hacked. One of the
       | biggest features missing from 80s processors was the ability to
       | watch a C level cog from a multinational conglomerate squirm in
       | an overstarched tom ford three piece as tech journalists
       | methodically connect the dots between a corpulent scumbag and a
       | hastily constructed excuse to hobble user freedoms at the behest
       | of the next minions movie to arrive finally at a conclusion that
       | takes said chipmakers stock to the barber.
       | 
       | oh and chips come with light up fans and crap now but theres no
       | open standard on how to control the light color so everyone just
       | leaves it in Liberace disco mode so its like a wonderful little
       | rainbow coloured pride parade is marching through my case.
        
         | Kuinox wrote:
         | Is that a GPT3 generated text ? It must be.
        
         | Epiphany21 wrote:
         | >oh and chips come with light up fans and crap now but theres
         | no open standard on how to control the light color so everyone
         | just leaves it in Liberace disco mode
         | 
         | This is why I continue to pay a premium for workstation/server
         | grade hardware even when I'm assembling the system myself.
        
         | hengheng wrote:
         | Are you alright
        
       | kccqzy wrote:
       | > With substantial hardware effort, Google was able to avoid
       | interference, but additional isolation features could allow this
       | to be done at higher efficiency with less effort.
       | 
       | This is surprising to me. Running any workload that's not your
       | own will trash the CPU caches and will make your workload slower.
       | 
       | Consider for example your performance sensitive code has nothing
       | to do for the next 500 microseconds. If the core runs some other
       | best effort work, it _will_ trash the CPU caches, so that after
       | that 500 microseconds, even when that other work is immediately
       | preempted by the kernel, your performance sensitive code is now
       | dealing with a cold cache.
        
       | ant6n wrote:
       | 2015? I think there's a date missing on the page.
        
         | MBCook wrote:
         | Near the end in one section the the author provides an update
         | and refers to it being 2016, a year since they wrote the
         | article.
        
       | gotaquestion wrote:
       | The section on power really understates the complexity.
       | Throttling didn't appear until the mid-90's as a coarse clock
       | gating chipwide. Voltage/frequency scaling appeared a few years
       | later (gradual P-state transitions). Then power control units
       | monitored key activity signals and could not only scale the
       | voltage, but estimate power and target specific blocks (e.g.,
       | turning off L1 D$).
       | 
       | There are some more details in there but that's the main gist.
       | The power control unit is its own operating system!
        
       | Aardwolf wrote:
       | From the top of my head (before reading the article):
       | 
       | caches, pipelining, branch prediction, memory protections, SIMD,
       | floating point at all, hyper threading, multi-core, needing
       | cooling fins or let alone active cooling
       | 
       | I wonder how much I've forgotten
        
         | yvdriess wrote:
         | All of those already existed by the 80s.
        
           | ithkuil wrote:
           | US patent for the technology behind hyper-threading was
           | granted to Kenneth Okin at Sun Microsystems in November 1994
        
             | code_biologist wrote:
             | I don't want to dismiss hyper-threading as trite -- it's
             | not, especially in implementation, but it is pretty
             | obvious.
             | 
             | Prior to 1994 the CPU-memory speed delta wasn't so bad that
             | you needed to cover for stalled execution units constantly.
             | Looking at the core clock vs FSB of 1994 Intel chips is a
             | great throwback! [1] Then CPU speed exploded relative to
             | memory, as was probably anticipated by forward looking CPU
             | architects in 1994.
             | 
             | With slow memory there are a few obvious changes you make
             | to the degree you need to cover for load stalls: 1) OoO
             | execution 2) data prefetching 3) find other computation
             | (that likely has its own memory stalls) to interleave. On
             | the thread level is a pretty obvious grain to interleave
             | work, if deeply non-trivial to actually implement.
             | 
             | Performance oriented programmers have always had to think
             | about memory access patterns. Not new since the 80s to need
             | to be friendly to your architecture there.
             | 
             | [1] https://en.wikipedia.org/wiki/Pentium#Pentium
        
             | formerly_proven wrote:
             | CDC 6600 ran ten threads on one processor in a way that
             | seems a lot like the Niagara T1 on paper.
        
         | varjag wrote:
         | Most of that (possibly all) existed by 1980s. The Z80 in my
         | Spectrum had no heatsink ;)
        
         | [deleted]
        
         | amelius wrote:
         | The riddance of segmented memory.
        
           | infogulch wrote:
           | Have you heard of this newfangled device called a Graphics
           | Processing Unit and "VRAM"?
        
             | amelius wrote:
             | Yeah, but that's GPU, not CPU. Hopefully we will see
             | similar progress there in the next 40 years.
        
           | vardump wrote:
           | The alternative wasn't that great either. Having just 16
           | address bits allowed you 64 kB of data memory and code memory
           | that were stuck into same RAM area is a lot worse alternative
           | (for example 8051 was like that). Or 65816 style banks, ugh.
           | 
           | If you had to have _just 16_ address bits, having code (CS),
           | stack (SS), data (DS), extra (ES), etc. segments was actually
           | pretty nice. Memory copying and scanning operations were
           | natural without needing to swap bank in the innermost loop.
           | 
           | Of course if you could afford 32-bit addressing, there's no
           | comparison. Flat memory space is the best option, but I don't
           | think it came for free.
        
           | fpoling wrote:
           | The segmented memory may come back to provide a cheap way to
           | sandbox code within a process.
        
       | VyseofArcadia wrote:
       | > Even though incl is a single instruction, it's not guaranteed
       | to be atomic. Internally, incl is implemented as a load followed
       | by an add followed by an store.
       | 
       | I've heard the joke that RISC won because modern CISC processors
       | are just complex interfaces on top of RISC cores, but there
       | really is some truth to that, isn't there?
        
         | terafo wrote:
         | > _but there really is some truth to that, isn 't there?_
         | 
         | There was an Arm version of AMD's Zen, it was called K12, but
         | it never made it to the market since AMD had to choose their
         | bets very carefully back then.
        
       | [deleted]
        
       | gchadwick wrote:
       | Interesting point about L1 cache sizes and the relationship
       | between page size
       | 
       | > Also, first-level caches are usually limited by the page size
       | times the associativity of the cache. If the cache is smaller
       | than that, the bits used to index into the cache are the same
       | regardless if whether you're looking at the virtual address or
       | the physical address, so you don't have to do a virtual to
       | physical translation before indexing into the cache. If the cache
       | is larger than that, you have to first do a TLB lookup to index
       | into the cache (which will cost at least one extra cycle), or
       | build a virtually indexed cache (which is possible, but adds
       | complexity and coupling to software). You can see this limit in
       | modern chips. Haswell has an 8-way associative cache and 4kB
       | pages. Its l1 data cache is 8 * 4kB = 32kB.
       | 
       | Having helped build the virtually indexed cache of the arm A55 I
       | can confirm it's a complete nightmare and I can see why Intel and
       | AMD have kept to the L1 data cache limit required to avoid it.
       | 
       | Interestingly Apple may have gone down the virtually indexed
       | route (or possibly some other cunning design corner) for the M1
       | with their 128 kB data cache. However I believe they standardized
       | on 16k pages which would allow still allow physical indexing with
       | an 8 way associative cache. So what do they do when they're
       | running x86 code with 4k pages? Does they drop 75% of their L1
       | cache to maintain physical indexing? Do they aggressively try and
       | merge the x86 4k pages into 16k pages with some slow back-up when
       | they can't do that? Maybe they've gone with some special purpose
       | hardware support for emulating x86 4k pages on their 16k page
       | architecture. Have they just indeed implemented a virtually
       | indexed cache?
        
         | zozbot234 wrote:
         | > Do they aggressively try and merge the x86 4k pages into 16k
         | pages with some slow back-up when they can't do that?
         | 
         | This does not seem feasible because the 16k pages on ARM are
         | not "huge" pages; it's a completely different arrangement of
         | the virtual address space and page tables. The two are not
         | interoperable.
        
         | heavyset_go wrote:
         | Please add an RSS or Atom feed to your blog :)
        
       | [deleted]
        
       | jeffbee wrote:
       | 2015. A good exercise would be "What's new in CPUs since 2015?" A
       | few I can think of: branch target alignment has returned as a key
       | to achieving peak optimization, after a brief period of
       | irrelevance on x86; x86 user-space monitor/wait/pause have
       | exposed for the first time explicit power controls to user
       | programs.
       | 
       | One thing I would have added to "since the 80s" is the x86
       | timestamp counter. It really changed the way we get timing
       | information.
        
         | flakiness wrote:
         | Big.Little-like architecture? Even intel has adopted that in
         | their 12 gen.
         | 
         | I believe a lot has happened around mobile and power as well.
         | Apple boasts their progress every year, and at least some of
         | them are real. But they are too secretive to talk about that. I
         | hope some competitors have written some related papers. For
         | example, the OP talks about dark silicon. What's going on
         | around it these days?
        
           | terafo wrote:
        
         | titzer wrote:
         | Spectre. It was a vulnerability before 2015, but not known
         | publicly until early 2018. It's hugely disruptive to
         | microarchitecture, particularly with crossing kernel/user space
         | boundaries, separating state between hyperthreads, etc.
        
         | dragontamer wrote:
         | L3 caches have grown monstrously.
         | 
         | The new AMD Ryzen 5800x3d has 96MB of L3 cache. This is so
         | monstrous that the 2048x entry TLB with 4kB pages only can
         | access 8MB.
         | 
         | That's right, you run out of TLB-entries before you run out of
         | L3 cache these days. (Or you start using hugepages damn it)
         | 
         | ----------
         | 
         | I think Intel's PEXT and PDEP was introduced around 2015-era.
         | But AMD chips now execute PEXT / PDEP quickly, so its now
         | feasible to use it on most people's modern systems (assuming
         | Zen3 or a 2015+ era Intel CPU). Obviously those instructions
         | don't exist in ARM / POWER9 world, but they're really fun to
         | experiment with.
         | 
         | PEXT / PDEP are effectively bitwise-gather and bitwise-scatter
         | instructions, and can be used to perform extremely fast and
         | arbitrary bit-permutations. I played around with them to
         | implement some relational-database operations (join, select,
         | etc. etc.) over bit-relations for the 4-coloring theorem. (Just
         | a toy to amuse myself with. A 16-bit bitset of
         | "0001_1111_0000_0000" means "(Var1 == Color4 and Var2==Color1)
         | or (Var2==Color2)".
         | 
         | There's probably some kind of tight relational algebra /
         | automatic logic proving / binary decision diagram / stuffs that
         | you can do with PEXT/PDEP. It really seems like an unexplored
         | field.
         | 
         | ----
         | 
         | EDIT: Oh, another big one. ARMv8 and POWER9 standardized upon
         | the C++11 memory model of acquire-release. This was inevitable
         | because Java and C++ standardized upon the memory model in the
         | 00s / early 10s, so chips inevitably would be tailored for that
         | model.
        
           | seoaeu wrote:
           | > That's right, you run out of TLB-entries before you run out
           | of L3 cache these days.
           | 
           | This is more reasonable than it sounds. A TLB _miss_ can in
           | many cases be faster than a L3 cache _hit_
        
             | jeffbee wrote:
             | It's also misleading because it has 8 cores and each of
             | them has 2048 l2 TLB entries. Altogether they can cover
             | 64MiB of memory with small pages.
        
               | dragontamer wrote:
               | But 5800x3D has 96MB of L3. So even if all 8 cores are
               | independently working on different memory addresses, you
               | still can't cover all 96MB of L3 with the TLB.
               | 
               | EDIT: Well, unless you use 2MB hugepages of course.
        
               | jeffbee wrote:
               | That's another thing which is recent. Before Haswell, x86
               | cores had almost no huge TLB entries. IvyBridge only had
               | 32 in 2MiB mode, compared to 64 + 512 in 4KiB mode.
        
             | dragontamer wrote:
             | Are you sure? TLB misses mean a pagewalk. Sure, the
             | directory tree is probably in L3 cache, but repeatedly
             | pagewalking through L3 to find a memory address is going to
             | be slower than just fetching it from the in core TLB.
             | 
             | I know that modern cores have dedicated page walking units
             | these days, but I admit that I've never tested the speed of
             | them.
        
               | seoaeu wrote:
               | It only takes ~200KB to store page tables for 96MB of
               | address space. So the page table entries might mostly
               | stay in the L1 and L2 caches
        
         | jcranmer wrote:
         | Intel PT is another thing that's worth calling out since 2015
         | (see the other article on the front page right now,
         | https://news.ycombinator.com/item?id=31121319, for something
         | that benefits from it).
         | 
         | It does look like Hardware Lock Elision/Transactional Memory is
         | something that seems like it will be consigned to the dustbins
         | of history (again).
        
           | jeffbee wrote:
           | Intel did not ship even one working implementation of TSX, so
           | it's not like anyone will be inconvenienced that they
           | cancelled it.
        
       | luhn wrote:
       | One thing I've always wondered: How do pre-compiled binaries take
       | advantage of new instructions, if they do at all? Since the
       | compiler needs to create a binary that will work on any modern-
       | ish machine, is there a way to use new instructions without
       | breaking compatibility for older CPUs?
        
         | Jasper_ wrote:
         | Some compilers have a dynamic dispatch for this; you run the
         | "cpuinfo" instruction and check for capability bits, and then
         | dispatch to the version you can support. Some dynamic linkers
         | can even link different versions of a function depending on the
         | CPU capabilities -- gcc has a special
         | __attribute__((__target__()) pragma for this.
        
         | jeabays wrote:
         | Unfortunately, the answer is usually just to recompile.
        
         | runnerup wrote:
         | You'd need to recompile the binary to take advantage of new
         | instructions.
         | 
         | The compiler alone, but also the code, can create branches
         | where the binary checks if certain instructions are available
         | and if they are not, use a less optimal operation.
         | 
         | Backwards compatibility for modern binaries basically. But not
         | forward ability to see the future instructions that haven't
         | been invented yet.
         | 
         | Not all binaries are fully backwards compatible. If you're
         | missing AVX, a surprising number of games won't run. Sometimes
         | only because the launcher won't run, even though the game plays
         | without AVX.
        
           | HideousKojima wrote:
           | I've actually sometimes seen this as an argument in favor of
           | JITed languages like C# and Java, that you can take advantage
           | of newer CPU features and instructions etc. without having to
           | recompile. In practice languages that compile to native
           | binaries still win at performance, but it was interesting to
           | see it turned into a talking point.
        
             | wvenable wrote:
             | JIT languages still have a bit of a trade off.
             | 
             | But for a pure pre-compiled example there is Apple Bitcode
             | which is meant to be compiled to the destination
             | architecture before run. It's mandatory for Apple watchOS
             | apps and when they released watch with a 64bit CPU they
             | just recompiled all the apps.
        
       ___________________________________________________________________
       (page generated 2022-04-22 23:00 UTC)