[HN Gopher] What's new in CPUs since the 80s? (2015) ___________________________________________________________________ What's new in CPUs since the 80s? (2015) Author : snvzz Score : 140 points Date : 2022-04-20 06:11 UTC (2 days ago) (HTM) web link (danluu.com) (TXT) w3m dump (danluu.com) | susrev wrote: | The vast expanses of text with no formatting rules always makes | it hard for me to follow along. Added some simple rules that make | it much easier to read. | | p { maxwidth: 1000px, text-align: center, margin-left: auto, | margin-right: auto } | | body { text-align: center } | | MUCH easier (code snippets = toast) | jimjambw wrote: | What about reader mode? | jcadam wrote: | My Ryzen 9 is just a bit more performant than the MOS6502 my | Apple ][ was rocking back in the 80s. | | Ok... it also has things like a built-in FPU, multiple cores, | cache, pipelining, branch prediction, more address space, more | registers, and manufactured with a significantly better (and | smaller) process. | phendrenad2 wrote: | Nice article. Kind of low-hanging fruit though. A comparison | between CPUs in 2022 vs CPUs in 2002 would be much more | interesting. ;) | xmprt wrote: | Not really low hanging fruit if the last time you studied CPU | design was in a university course. I personally found a lot of | the information pretty interesting. | nimbius wrote: | my favourite addition since the 80s has been the unrelenting, | unquestioned, ubiquitous and permanent inclusion of numerous | iterations of poorly planned and executed management engine | frameworks designed to completely ablate the user from the | experience of general computing in the service of perpetual and | mandatory DRM and security theatricality. the best aspect of this | new feature is that not only is your processor effectively | indifferent from a rented pressure washer, but on a long enough | timeline the indelible omniscient slavemaster to which your IO is | subservient can and will always find itself hacked. One of the | biggest features missing from 80s processors was the ability to | watch a C level cog from a multinational conglomerate squirm in | an overstarched tom ford three piece as tech journalists | methodically connect the dots between a corpulent scumbag and a | hastily constructed excuse to hobble user freedoms at the behest | of the next minions movie to arrive finally at a conclusion that | takes said chipmakers stock to the barber. | | oh and chips come with light up fans and crap now but theres no | open standard on how to control the light color so everyone just | leaves it in Liberace disco mode so its like a wonderful little | rainbow coloured pride parade is marching through my case. | Kuinox wrote: | Is that a GPT3 generated text ? It must be. | Epiphany21 wrote: | >oh and chips come with light up fans and crap now but theres | no open standard on how to control the light color so everyone | just leaves it in Liberace disco mode | | This is why I continue to pay a premium for workstation/server | grade hardware even when I'm assembling the system myself. | hengheng wrote: | Are you alright | kccqzy wrote: | > With substantial hardware effort, Google was able to avoid | interference, but additional isolation features could allow this | to be done at higher efficiency with less effort. | | This is surprising to me. Running any workload that's not your | own will trash the CPU caches and will make your workload slower. | | Consider for example your performance sensitive code has nothing | to do for the next 500 microseconds. If the core runs some other | best effort work, it _will_ trash the CPU caches, so that after | that 500 microseconds, even when that other work is immediately | preempted by the kernel, your performance sensitive code is now | dealing with a cold cache. | ant6n wrote: | 2015? I think there's a date missing on the page. | MBCook wrote: | Near the end in one section the the author provides an update | and refers to it being 2016, a year since they wrote the | article. | gotaquestion wrote: | The section on power really understates the complexity. | Throttling didn't appear until the mid-90's as a coarse clock | gating chipwide. Voltage/frequency scaling appeared a few years | later (gradual P-state transitions). Then power control units | monitored key activity signals and could not only scale the | voltage, but estimate power and target specific blocks (e.g., | turning off L1 D$). | | There are some more details in there but that's the main gist. | The power control unit is its own operating system! | Aardwolf wrote: | From the top of my head (before reading the article): | | caches, pipelining, branch prediction, memory protections, SIMD, | floating point at all, hyper threading, multi-core, needing | cooling fins or let alone active cooling | | I wonder how much I've forgotten | yvdriess wrote: | All of those already existed by the 80s. | ithkuil wrote: | US patent for the technology behind hyper-threading was | granted to Kenneth Okin at Sun Microsystems in November 1994 | code_biologist wrote: | I don't want to dismiss hyper-threading as trite -- it's | not, especially in implementation, but it is pretty | obvious. | | Prior to 1994 the CPU-memory speed delta wasn't so bad that | you needed to cover for stalled execution units constantly. | Looking at the core clock vs FSB of 1994 Intel chips is a | great throwback! [1] Then CPU speed exploded relative to | memory, as was probably anticipated by forward looking CPU | architects in 1994. | | With slow memory there are a few obvious changes you make | to the degree you need to cover for load stalls: 1) OoO | execution 2) data prefetching 3) find other computation | (that likely has its own memory stalls) to interleave. On | the thread level is a pretty obvious grain to interleave | work, if deeply non-trivial to actually implement. | | Performance oriented programmers have always had to think | about memory access patterns. Not new since the 80s to need | to be friendly to your architecture there. | | [1] https://en.wikipedia.org/wiki/Pentium#Pentium | formerly_proven wrote: | CDC 6600 ran ten threads on one processor in a way that | seems a lot like the Niagara T1 on paper. | varjag wrote: | Most of that (possibly all) existed by 1980s. The Z80 in my | Spectrum had no heatsink ;) | [deleted] | amelius wrote: | The riddance of segmented memory. | infogulch wrote: | Have you heard of this newfangled device called a Graphics | Processing Unit and "VRAM"? | amelius wrote: | Yeah, but that's GPU, not CPU. Hopefully we will see | similar progress there in the next 40 years. | vardump wrote: | The alternative wasn't that great either. Having just 16 | address bits allowed you 64 kB of data memory and code memory | that were stuck into same RAM area is a lot worse alternative | (for example 8051 was like that). Or 65816 style banks, ugh. | | If you had to have _just 16_ address bits, having code (CS), | stack (SS), data (DS), extra (ES), etc. segments was actually | pretty nice. Memory copying and scanning operations were | natural without needing to swap bank in the innermost loop. | | Of course if you could afford 32-bit addressing, there's no | comparison. Flat memory space is the best option, but I don't | think it came for free. | fpoling wrote: | The segmented memory may come back to provide a cheap way to | sandbox code within a process. | VyseofArcadia wrote: | > Even though incl is a single instruction, it's not guaranteed | to be atomic. Internally, incl is implemented as a load followed | by an add followed by an store. | | I've heard the joke that RISC won because modern CISC processors | are just complex interfaces on top of RISC cores, but there | really is some truth to that, isn't there? | terafo wrote: | > _but there really is some truth to that, isn 't there?_ | | There was an Arm version of AMD's Zen, it was called K12, but | it never made it to the market since AMD had to choose their | bets very carefully back then. | [deleted] | gchadwick wrote: | Interesting point about L1 cache sizes and the relationship | between page size | | > Also, first-level caches are usually limited by the page size | times the associativity of the cache. If the cache is smaller | than that, the bits used to index into the cache are the same | regardless if whether you're looking at the virtual address or | the physical address, so you don't have to do a virtual to | physical translation before indexing into the cache. If the cache | is larger than that, you have to first do a TLB lookup to index | into the cache (which will cost at least one extra cycle), or | build a virtually indexed cache (which is possible, but adds | complexity and coupling to software). You can see this limit in | modern chips. Haswell has an 8-way associative cache and 4kB | pages. Its l1 data cache is 8 * 4kB = 32kB. | | Having helped build the virtually indexed cache of the arm A55 I | can confirm it's a complete nightmare and I can see why Intel and | AMD have kept to the L1 data cache limit required to avoid it. | | Interestingly Apple may have gone down the virtually indexed | route (or possibly some other cunning design corner) for the M1 | with their 128 kB data cache. However I believe they standardized | on 16k pages which would allow still allow physical indexing with | an 8 way associative cache. So what do they do when they're | running x86 code with 4k pages? Does they drop 75% of their L1 | cache to maintain physical indexing? Do they aggressively try and | merge the x86 4k pages into 16k pages with some slow back-up when | they can't do that? Maybe they've gone with some special purpose | hardware support for emulating x86 4k pages on their 16k page | architecture. Have they just indeed implemented a virtually | indexed cache? | zozbot234 wrote: | > Do they aggressively try and merge the x86 4k pages into 16k | pages with some slow back-up when they can't do that? | | This does not seem feasible because the 16k pages on ARM are | not "huge" pages; it's a completely different arrangement of | the virtual address space and page tables. The two are not | interoperable. | heavyset_go wrote: | Please add an RSS or Atom feed to your blog :) | [deleted] | jeffbee wrote: | 2015. A good exercise would be "What's new in CPUs since 2015?" A | few I can think of: branch target alignment has returned as a key | to achieving peak optimization, after a brief period of | irrelevance on x86; x86 user-space monitor/wait/pause have | exposed for the first time explicit power controls to user | programs. | | One thing I would have added to "since the 80s" is the x86 | timestamp counter. It really changed the way we get timing | information. | flakiness wrote: | Big.Little-like architecture? Even intel has adopted that in | their 12 gen. | | I believe a lot has happened around mobile and power as well. | Apple boasts their progress every year, and at least some of | them are real. But they are too secretive to talk about that. I | hope some competitors have written some related papers. For | example, the OP talks about dark silicon. What's going on | around it these days? | terafo wrote: | titzer wrote: | Spectre. It was a vulnerability before 2015, but not known | publicly until early 2018. It's hugely disruptive to | microarchitecture, particularly with crossing kernel/user space | boundaries, separating state between hyperthreads, etc. | dragontamer wrote: | L3 caches have grown monstrously. | | The new AMD Ryzen 5800x3d has 96MB of L3 cache. This is so | monstrous that the 2048x entry TLB with 4kB pages only can | access 8MB. | | That's right, you run out of TLB-entries before you run out of | L3 cache these days. (Or you start using hugepages damn it) | | ---------- | | I think Intel's PEXT and PDEP was introduced around 2015-era. | But AMD chips now execute PEXT / PDEP quickly, so its now | feasible to use it on most people's modern systems (assuming | Zen3 or a 2015+ era Intel CPU). Obviously those instructions | don't exist in ARM / POWER9 world, but they're really fun to | experiment with. | | PEXT / PDEP are effectively bitwise-gather and bitwise-scatter | instructions, and can be used to perform extremely fast and | arbitrary bit-permutations. I played around with them to | implement some relational-database operations (join, select, | etc. etc.) over bit-relations for the 4-coloring theorem. (Just | a toy to amuse myself with. A 16-bit bitset of | "0001_1111_0000_0000" means "(Var1 == Color4 and Var2==Color1) | or (Var2==Color2)". | | There's probably some kind of tight relational algebra / | automatic logic proving / binary decision diagram / stuffs that | you can do with PEXT/PDEP. It really seems like an unexplored | field. | | ---- | | EDIT: Oh, another big one. ARMv8 and POWER9 standardized upon | the C++11 memory model of acquire-release. This was inevitable | because Java and C++ standardized upon the memory model in the | 00s / early 10s, so chips inevitably would be tailored for that | model. | seoaeu wrote: | > That's right, you run out of TLB-entries before you run out | of L3 cache these days. | | This is more reasonable than it sounds. A TLB _miss_ can in | many cases be faster than a L3 cache _hit_ | jeffbee wrote: | It's also misleading because it has 8 cores and each of | them has 2048 l2 TLB entries. Altogether they can cover | 64MiB of memory with small pages. | dragontamer wrote: | But 5800x3D has 96MB of L3. So even if all 8 cores are | independently working on different memory addresses, you | still can't cover all 96MB of L3 with the TLB. | | EDIT: Well, unless you use 2MB hugepages of course. | jeffbee wrote: | That's another thing which is recent. Before Haswell, x86 | cores had almost no huge TLB entries. IvyBridge only had | 32 in 2MiB mode, compared to 64 + 512 in 4KiB mode. | dragontamer wrote: | Are you sure? TLB misses mean a pagewalk. Sure, the | directory tree is probably in L3 cache, but repeatedly | pagewalking through L3 to find a memory address is going to | be slower than just fetching it from the in core TLB. | | I know that modern cores have dedicated page walking units | these days, but I admit that I've never tested the speed of | them. | seoaeu wrote: | It only takes ~200KB to store page tables for 96MB of | address space. So the page table entries might mostly | stay in the L1 and L2 caches | jcranmer wrote: | Intel PT is another thing that's worth calling out since 2015 | (see the other article on the front page right now, | https://news.ycombinator.com/item?id=31121319, for something | that benefits from it). | | It does look like Hardware Lock Elision/Transactional Memory is | something that seems like it will be consigned to the dustbins | of history (again). | jeffbee wrote: | Intel did not ship even one working implementation of TSX, so | it's not like anyone will be inconvenienced that they | cancelled it. | luhn wrote: | One thing I've always wondered: How do pre-compiled binaries take | advantage of new instructions, if they do at all? Since the | compiler needs to create a binary that will work on any modern- | ish machine, is there a way to use new instructions without | breaking compatibility for older CPUs? | Jasper_ wrote: | Some compilers have a dynamic dispatch for this; you run the | "cpuinfo" instruction and check for capability bits, and then | dispatch to the version you can support. Some dynamic linkers | can even link different versions of a function depending on the | CPU capabilities -- gcc has a special | __attribute__((__target__()) pragma for this. | jeabays wrote: | Unfortunately, the answer is usually just to recompile. | runnerup wrote: | You'd need to recompile the binary to take advantage of new | instructions. | | The compiler alone, but also the code, can create branches | where the binary checks if certain instructions are available | and if they are not, use a less optimal operation. | | Backwards compatibility for modern binaries basically. But not | forward ability to see the future instructions that haven't | been invented yet. | | Not all binaries are fully backwards compatible. If you're | missing AVX, a surprising number of games won't run. Sometimes | only because the launcher won't run, even though the game plays | without AVX. | HideousKojima wrote: | I've actually sometimes seen this as an argument in favor of | JITed languages like C# and Java, that you can take advantage | of newer CPU features and instructions etc. without having to | recompile. In practice languages that compile to native | binaries still win at performance, but it was interesting to | see it turned into a talking point. | wvenable wrote: | JIT languages still have a bit of a trade off. | | But for a pure pre-compiled example there is Apple Bitcode | which is meant to be compiled to the destination | architecture before run. It's mandatory for Apple watchOS | apps and when they released watch with a 64bit CPU they | just recompiled all the apps. ___________________________________________________________________ (page generated 2022-04-22 23:00 UTC)