[HN Gopher] What do RISC and CISC mean in 2020? ___________________________________________________________________ What do RISC and CISC mean in 2020? Author : socialdemocrat Score : 85 points Date : 2020-11-20 12:13 UTC (10 hours ago) (HTM) web link (erik-engheim.medium.com) (TXT) w3m dump (erik-engheim.medium.com) | seanalltogether wrote: | So where does the power efficiency gains come into play. Is this | a feature of ARM specifically, or RISC in general? | socialdemocrat wrote: | Initially the difference between RISC and CISC processors was | clear. Today many say there is no real difference. This story | digs into the details to explain significant differences which | still exist. | StillBored wrote: | While mostly missing the mark, and just rehashing the old | discussions. AKA the micro architecture concepts of both "RISC" | designs and "CISC" designs is so similar across product lines | as to be mostly meaningless. As mentioned you have RISC designs | using "micro ops" and microcode, and you have CISC designs | doing 1:1 instruction micro op mapping. Both are doing various | forms of cracking and fusing depending on the instruction. All | have the same problems with branch prediction, speculative | execution, and solve problems with OoO in similar ways. | | Maybe the largest remaining difference is around the strength | of the memory model, as the size of the architectural register | file, and the complexity of addressing modes and other | traditional RISC/CISC arguments are mostly pointless in the | face of deep OoO superscaler machines doing register | renaming/etc from mop caches, etc. | | Even then, like variable length instructions (which yes exist | on many RISCs in limited forms) this differentiation is more | about when the ISA was designed rather than anything | fundamental in the philosophy. | [deleted] | socialdemocrat wrote: | The key difference between RISC and CISC is the ISA, which is | still true. x86 have instructions which can in theory be | infinitely long. RISC instructions are typically fixed | length. Yes there are exceptions but that is how most | instructions are designed. | | RISC ISA is still designed around Load/Store, while e.g. x86 | has a variety of address modes. | | All these differences in the ISAs has some impact on what | makes sense to do in the micro-architecture and how well you | can do it. Sure you can pipeline ANY CPU, but it will be | easier to do so when you deal with mostly fixed width | instructions, of quite similar complexity. On x86 there will | be much more variety in the complexity each instruction and | you will be more prone to get gaps in the pipeline. As far as | I understand anyway. | CodeArtisan wrote: | I would say that the CISC philosophy is about lowering the | complexity at the software level to raise it at the circuitry | level while the RISC philosophy is the inverse. Philosophies that | you can apply not only to CPUs but also to virtual machines. | Today ARM SoC are adding more and more ASICs (AI, RayTracing, | Photo post processor, GPU, ...) so instead of dealing with a | complex ISA you now have to deal with multiple simpler ISAs. | spear wrote: | I think that's a misleading way of looking at things. There is | no "CISC philosophy". RISC designs came out as a new way of | doing things and existing designs were called CISC for | contrast. It's not like there were two schools of thought that | were developed at the same time and the CISC designs | intentionally rejected RISC philosophies. | diehunde wrote: | If anyone wants to learn about the history behind RISC/CISC and | much more info on the topic, I recommend listening to the David | Patterson's episode on Lex Fridman's podcast. David Patterson is | one of the original contributors to RISC and author of one of | best books on computer architecture. 2 hours of pure knowledge on | the subject. | klelatti wrote: | Jim Keller on another episode of Lex Fridman's podcast is also | excellent when explaining things like out of order execution. | jcranmer wrote: | The better explanation of RISC v CISC is this old discussion from | comp.arch: https://yarchive.net/comp/risc_definition.html | | In short, the term RISC comes from a new set of architecture | designs in the late 80s/early 90s. CISC is not so much an | architecture design as it is lacking many of the features. The | major features that RISC adds are: | | * Avoid complex operations, which may include things such as | multiply and divide. (Although note that "modern" RISCs have | these nowadays). | | * More registers (32 registers instead of 8 or 16). (ARM has 16. | So does x86-64.) | | * Fixed-length instructions instead of variable-length. | | * Avoid indirect memory references or a lot of memory accessing | types (note that x86 also does this). | | Functionally speaking, x86 itself is pretty close to RISC, | especially in terms of how the operations themselves need to be | implemented. The implementation benefits of RISC (especially in | allowing pipelining) are largely applicable to x86 as well, since | x86 really skips the problematic instructions that other CISCs | have. | | > One of the key ideas of RISC was to push a lot of heavy lifting | over to the compiler. That is still the case. Micro-ops cannot be | re-arranged by the compiler for optimal execution. | | Modern compilers do use instruction scheduling to optimize | execution, and instruction scheduling for microcoded execution is | well-understood. | | > Time is more critical when running micro-ops than when | compiling. It is an obvious advantage in making it possible for | advance compiler to rearrange code rather than relying on | precious silicon to do it. | | All modern high-performance chips are out-of-order execution, | because some instructions (especially memory!) take longer than | others to execute. The "precious silicon" is silicon that's | already been used for that reason, whether RISC or CISC. | Tuna-Fish wrote: | > (ARM has 16. So does x86-64.) | | 64-bit ARM has 32 GPRs. | | > Fixed-length instructions instead of variable-length. | | This is the big legacy of RISC that helps "RISC" cpus most | against x86. M1 has 8-wide decode, with very few stages and low | power consumption. Nothing like it can be done for an x86 cpu. | The way modern x86 handles this is typically with a uop cache. | But this costs a lot of power, area and only provides full | decode width for a relatively small pool of insns -- 4k on | modern Zen, for example. | | > > One of the key ideas of RISC was to push a lot of heavy | lifting over to the compiler. That is still the case. Micro-ops | cannot be re-arranged by the compiler for optimal execution. | | > Modern compilers do use instruction scheduling to optimize | execution, and instruction scheduling for microcoded execution | is well-understood. | | Compiler-level instruction scheduling is mostly irrelevant for | modern OoO architectures. Most of the time the CPU is operating | from mostly full queues, so it will be doing the scheduling | from the past 10-~16 instructions. Compilers are mostly still | doing it out of inertia. | | > "precious silicon" | | The big difference from that 25 years ago to today is indeed | that silicon is now the opposite of precious. We have so many | transistors available that we are looking for ways to | effectively use more of them, rather than saving precious | silicon. | fulafel wrote: | Aside - Modern general purpouse CPUs tend to be OoO but | processors used for demanding computation in things like modems | (signal processing/SDR) and GPUs (graphics) tend not to be. | idividebyzero wrote: | I think it moderately depends on the definition you give it to. | If you require RISC to be a load/store architecture, x86 is not | even close to be one. Also, aarch64 is a variable-length | instructions set and include complex instructions (such as | those to perform AES operations). Compiler optimizations are | meant to be taken advantage by all architectures, regardless of | RISC/CISC. | Tuna-Fish wrote: | 64-bit Arm is fixed width. Modern 32-bit Arm was _not_ fixed | width, as Thumb-2 was widely used. | jcranmer wrote: | Personally, I think the RISC/CISC "question" isn't really | meaningful anymore, and it's not the right lens with which to | compare modern architectures. Partially, this is because the | modern prototypes of RISC and CISC--ARM/AArch64 and x86-64, | respectively--show a lot more convergent evolution and | blurriness than the architectures at the time the terms were | first coined. | | Instead, the real question is microarchitectural. First, what | are the actual capabilities of your ALUs, how are they | pipelined, and how many of them are there? Next, how good are | you at moving stuff into and out of them--the memory | subsystem, branch prediction, reorder buffers, register | renaming, etc. The ISA only matters insofar as it controls | how well you can dispatch into your microarchitecture. | | It's important to note how many of the RISC ideas _haven 't_ | caught on. The actual "small" part of the instruction set, | for example, is discarded by modern architectures (bring on | the MUL and DIV instructions!). Designing your ISA to let you | avoid pipeline complexity (e.g., branch slots) also fell out | of favor. The general notion of "let's push hardware | complexity to the compiler" tends to fail because it turns | out that hardware complexity lets you take advantage of | dynamic opportunities that the compiler fundamentally cannot | do statically. | | The RISC/CISC framing of the debate is unhelpful in that it | draws people's attention to rather more superficial aspects | of processor design instead of the aspects that matter more | for performance. | brandmeyer wrote: | > It's important to note how many of the RISC ideas haven't | caught on. | | 2-in, 1-out didn't, either. Nowadays all floating-point | units support 3-in, 1-out via fused multiply-add. SVE | provides a mask argument to almost everything. | klelatti wrote: | Unless you're using a definition I'm not familiar with | aarch64 isn't a variable length instruction set - here's | Richard Grisenthwaite Arm's lead architect introducing ARMv8 | - the slide here confirms "New Fixed Length Instruction Set": | | https://youtu.be/GBeEEfmJ3NI?t=570 | idividebyzero wrote: | I understand that they refer to it as a fixed-length | instruction set, it's correct, note though that not all | ARMv8 instructions are 4 bytes long. Indeed, some | instructions that are met together are fused to a single | one, or SVE, for instance, introduces prefix; so | practically, this means that sometimes instructions can be | 8 bytes long. | brandmeyer wrote: | Macro-op fusion of the MOVW/MOVT family doesn't count. At | the time of that presentation, SVE didn't exist. Even | now, the masked move instruction in SVE can also stand on | its own as a single instruction and sometimes it does get | emitted as its own uop. | klelatti wrote: | Thanks, yes of course. I guess probably fair to say that | philosophically it's fixed-length, in way that the | original Arm was RISC, i.e. with some very non RISC-y | instruction. Very different to x86 though. | dragontamer wrote: | The main issue with linking a discussion from 25 years ago, is | that the discussion from is almost irrelevant in today's | environment. | | The Apple M1 has over 600 reorder buffer registers (while | Skylake and Zen are around 200ish). The 16 or 32 architectural | registers of the ISA are pretty much irrelevant compared to the | capabilities of the out-of-order engine on modern chips. | | A 200, 300, or 600+ register ISA is unfathomable to those from | 1995. Not only that, but the way we got our software to scale | to such "shadow register" sets is due to an improvement in | compiler technology over the last 20 years. | | Modern compilers write code differently (aka: "dependency | cutting"). Modern chips take advantage of those dependency | cuts, and use them to "malloc" those reorder buffer registers, | and as a basis for out-of-order execution. | | While the tech existed back in the 90s for this... it wasn't | widespread yet. Even the best posts from back then would be | unable to predict what technologies came out 25 years into the | future. | temac wrote: | If I remember correctly the M1 has around 600 reorder buffer | entries, and I just checked and Anantech estimate the int | register file at around 354 entries. That's still big, but | not 600. | dragontamer wrote: | Ah hah, but there's also 300 FP registers!! | | Okay, you got me. I somehow confused the register file with | the reorder buffer in the above post. But I think I may | still manage to be "technically correct" thanks to the FP | register file (even though its not really fair to count | those). | qwerty456127 wrote: | > Avoid complex operations, which may include things such as | multiply and divide. | | How can this possibly make sense? Almost every application | multiplies and divides all the time anyway. It usually is a | good idea to implement frequently used operations in hardware | because hardware implementation is always more efficient than | software implementation, isn't it? | Asooka wrote: | That was for back when CPUs didn't really have native | division or multiplication. So a mul or div would literally | be like calling a function to do it using other arithmetic | instructions, except the function is stored in the CPU. Which | goes against the RISC philosophy and makes the CPU more | complex for not much gain. | fanf2 wrote: | I am also a fan of John Mashey's analysis that you linked to! | The key thing is that he counts things like instruction | formats, addressing modes, memory ops per instruction, | registers, and so on. There is a clear separation in the | numbers between the RISCs and the CISCs. | | What stuck out to me when I first read it 25 years ago is that | the ARM is the least RISCy RISC, and x86 is the least CISCy | CISC. At that time the Pentium was killing the 68060 and many | of the RISCs, and it seemed clear that x86 had a big advantage | in the relatively small number of memory ops per instruction. | qwerty456127 wrote: | > Functionally speaking, x86 itself is pretty close to RISC, | | AFAIK some x86 implementations (e.g. AMD K6-3) had RISC cores | and translation units. | Const-me wrote: | > But is that really true? | | Yes. | | > Microprocessors (CPUs) do very simple things | | Look at the instructions like vfmadd132ps on AMD64, or the ARM | equivalent VMLA.F32. None of them are simple. | | > It is part of Intel's intellectual property | | Patents have expiration dates. You probably can't emulate Intel's | AVX512 because it's relatively new, but whatever patents were | covering SSE1 or SSE2 have expired years ago. | | > If you go with x86 you have to do all that on external chips. | | Encryption is right there, see AES-NI or SHA. | | > Another core idea of RISC was pipelining | | I don't know whose idea it was, but the first hardware | implementation was Intel 80386, in 1985. | tenebrisalietum wrote: | This is a great article. Anyone who parrots "Intel uses RISC | internally" when talking about CISC/RISC should be directed here | for edification and correction. | Analemma_ wrote: | I'm not sure this article refutes those people? Every time I've | heard someone say "Intel uses RISC internally", what they mean | is that the decoding logic used to turn x86 instructions into | uops (and thus get the benefits of RISC) takes up a fixed | amount of transistors on the board that RISC doesn't need, and | this penalty becomes proportionally larger at lower power | levels, hence why x86 is still a good performer on | servers/HEDTs but got crushed in mobile. Which is pretty much | what this article says as well. | StillBored wrote: | No it doesn't explain intel getting crushed on mobile, that | is more a question of focus. You have to remember than those | "big" decoders can be scaled down to a few thousand | transistors as is seen on something like the 486 if you | willing to pay the performance penalty. | | The entire 486 was something like 1M transistors (including | cache/mmu/fpu/etc/etc). Which makes it smaller than pretty | much every modern design that can run a full blown OS like | linux. | | When you look at things like a modern x86 with dual 512bit | vector units, what you see are things consuming power that | frequently don't exist on the smaller designs (like that | vector unit, a modern arm might have a dual issue 128 bit | NEON unit). | | Here is a cute graphic https://en.wikipedia.org/wiki/File:Moo | re%27s_Law_Transistor_... | socialdemocrat wrote: | RISC is about the ISA not the micro ops. One of the points of | RISC is to give the compiler a simpler instruction set to | deal with. Micro-ops are invisible to the compiler. You | cannot spend a bunch of extra compile time to rearrange | things in an optimal fashion. | | Micro-ops is an implementation detail you can change at any | time. The ISA you are stuck with for a long time. | | Thus saying x86 is RISC-like doesn't make sense it would | imply that the x86 ISA is RISC-like which it is not. | | The benefits of uops is separate from the benefits of RISC. | Even RISC processor can turn their instructions into uops. | You cannot break CISC instructions into uops in as easy and | steady stream as a RISC instruction which has a much more | even level of complexity. | dragontamer wrote: | With over 600 reorder buffer registers in the Apple M1 executing | deeply out-of-order code, this blogpost rehashes decades old | arguments without actually discussing what makes the M1 so good. | | The Apple M1 is the widest archtecture, with the thickest | dispatch I've seen in a while. 2nd only to the POWER9 SMT8 (which | had 12-uop dispatch), the Apple M1 dispatches 8-uops per clock | cycle (while x86 only aim at 4-uops / clock tick). | | That's where things start. From there, those 8-instructions | dispatched enter a very wide set of superscalar pipelines and | strongly branch-predicted / out-of-order execution. | | Rehashing old arguments about "not enough registers" just doesn't | match reality. x86-Skylake and x86-Zen have 200+ ROB registers | (reorder-buffers), which the compiler has plenty of access to. | The 32 ARM registers on M1 are similarly "faked", just a | glorified interface to the 600+ reorder buffers on the Apple M1. | | The Apple M1 does NOT show off those 600+ registers in actuality, | because it needs to remain compatible with old ARM code. But old | ARM code compiled _CORRECTLY_ can still use those registers | through a mechanism called dependency cutting. Same thing on x86 | code. All modern assembly does this. | | ------ | | "Hyperthreading" is not a CISC concept. POWER9 SMT8 can push 8 | threads onto one core, there are ARM chips with 4-threads on one | core. Heck, GPUs (which are probably the simplest cores on the | market) have 10 to 20+ wavefronts per execution unit (!!!). | | Pipelining is NOT a RISC concept, not anymore. All chips today | are pipelined: you can execute SIMD multiply-add instructions on | x86 on both Zen3 and Intel Skylake multiple times per clock tick, | despite having ~5 cycles (or was it 3 cycles? I forget...) of | latency. All chips have pipelining. | | ------- | | Skylake / Zen have larger caches than M1 actually. I wouldn't say | M1 has the cache advantage, outside of L1. Loads/stores in | Skylake / Zen to L2 cache can be issued once-per-clock tick, | though at a higher latency than L1 cache. With 256kB or 512kB of | L2 cache, Skylake/Zen actually have ample cache. | | The cache discussion needs to be around the latency | characteristics of L1. By making L1 bigger, the M1 L1 cache is | almost certainly higher latency than Skylake/Zen (especially in | absolute terms, because Skylake/Zen clock at 4GHz+). But there's | probably power-consumption benefits to running the L1 cache wider | at 2.8GHz instead. | | That's the thing about cache: the bigger it is, the harder it is | to keep fast. That's why L1 / L2 caches exist on x86: L2 can be | huge (but higher latency), while L1 can be small but far lower | latency. A compromise in sizes (128kB on M1) is just that: a | compromise. It has nothing to do with CISC or RISC. | dimtion wrote: | Do you happen to know where to find any resources on how Apple | managed to make the M1 so good compared to the competition? | | And why this has not happened before with other manufacturers? | nialo wrote: | The vague impression I get is that maybe the answer is | "Because Apple's software people and chip design people are | in the same company, they did a better job of coordinating to | make good tradeoffs in the chip and software design." | | (I'm getting this from reading between lines on Twitter, so | it's not exactly a high confidence guess) | dragontamer wrote: | > Do you happen to know where to find any resources on how | Apple managed to make the M1 so good compared to the | competition? | | If you know computer microarchitecture, the specs have been | discussed all across the internet by now. Reorder buffers, | execution widths, everything. | | If you don't know how to read those specs... well... that's a | bit harder. I don't really know how to help ya there. Maybe | read Agner Fog's microarchitecture manual until you | understand the subject, and then read the M1 | microarchitecture discussions? | | I do realize this is a non-answer. But... I'm not sure if | there's any way to easily understand computer | microarchitecture unless you put effort to learn it. | | https://www.agner.org/optimize/ | | Read Manual #3: Microarchitecture. Understand what all these | parts of a modern CPU does. Then, when you look at something | like the M1's design, it becomes obvious what all those parts | are doing. | | > And why this has not happened before with other | manufacturers? | | 1. Apple is on TSMC 5nm, and no one else can afford that yet. | So they have the most advanced process in the world, and | Apple pays top-dollar to TSMC to ensure they're the first on | the new node. | | 2. Apple has made some interesting decisions that runs very | much counter to Intel and AMD's approach. Intel is pushing | wider vector units, as you might know (AVX512), and despite | the poo-pooing of AVX512, it gets the job done. AMD's | approach is "more cores", they have a 4-wide execution unit | and are splitting up their chips across multiple dies now to | give better-and-better multithreaded performance. | | Apple's decision to make a 8-wide decoder engine is a | decision, a compromise, which will make scaling up to more- | cores more difficult. Apple's core is simply the biggest core | on the market. | | Whereas AMD decided that 4-wide decode was enough (and then | split into new cores), Apple ran the math and came out with | the opposite conclusion, pushing for 8-wide decode instead. | As such, the M1 will achieve the best single-threaded | numbers. | | --------- | | Note that Apple has also largely given up on SIMD-execute. | ARM 128-bit vectors are supported, but AVX2 from x86 land and | AVX512 support 256-bit and 512-bit vectors respectively. | | As such, the M1's 128-bit wide vectors are its weak point, | and it shows. Apple has decided that integer-performance is | more important. It seems like Apple is using either its iGPU | or Neural Engine for regular math / compute applications | however. (The Neural Engine is a VLIW architecture, and iGPUs | are of course just a wider SIMD unit in general). So Apple's | strategy seems to be to offload the SIMD-compute to other, | more specialized computers (still on the same SoC). | temac wrote: | > Apple's decision to make a 8-wide decoder engine is a | decision, a compromise, which will make scaling up to more- | cores more difficult. Apple's core is simply the biggest | core on the market. | | > Whereas AMD decided that 4-wide decode was enough (and | then split into new cores), Apple ran the math and came out | with the opposite conclusion, pushing for 8-wide decode | instead. As such, the M1 will achieve the best single- | threaded numbers. | | That's not as simple. x86 is way more difficult to decode | than ARM. Also, the insanely large OOO probably helps a lot | to keep the wide M1 beast occupied. Does the large L1 | helps? I don't know. Maybe a large enough L2 would be OK. | And the perf cores do not occupy the largest area of the | die. Can you do a very large L1 with not too bad latency | impact? I guess a small node helps, plus maybe you keep a | reasonable associativity and a traditional L1 lookup thanks | to the larger pages. So I'm curious what happens with 4kB | pages and it probably has that mode for emulation? | | Going specialized instead of putting large vector in the | CPU also makes complete sense. You want to be slow and wide | to optimize for efficiency. Of course it's less possible | for mainly scalar and branch rich workloads, so you can't | be as wide on a CPU. You still need a middle ground for | your low latency compute needs in the middle of your scalar | code and 128-bits certainly is one esp if you can imagine | to scale to lots of execution units (well I this point I | admit you can also support a wider size, but that shows the | impact of staying 128 won't necessarily be crazy if | structured like that), although one could argue for 256, | but 512 starts to not be reasonable and probably has a way | worse impact on core size than wide superscalar - or at | least even if the impact is similar (I'm not sure) I | suspect that wide superscalar is more useful most of the | time. It's understandable that a more CPU oriented vendor | will be far more interested by large vectors. Apple is not | that -- although of course what they will do for their high | end will be extremely interesting to watch. | | Of course you have to solve a wide variety of problems, but | the recent AMD approach has shown that the good old method | of optimizing for real workloads just continue to be the | way to go. Who cares if you have somehow more latency in | not so frequent cases, or if int <-> fp is slower, if in | the end that let you optimise the structures were you reap | most benefits. Now each has its own history obviously and | the mobile roots of the M1 also gives a strong influence, | plus the vertical integration of Apple helps immensely. | | I want to add: even if the M1 is impressive, Apple has not | a too insane advance in the end result compared to what AMD | does on 7nm. But of course they will continue to improve. | klelatti wrote: | Interested in your comment on AMD 'optimising for real | workloads'. Presumably, Apple will have been examining | the workloads they see on their OS (and they are writing | more of that software than AMD) so not sure I see the | distinction. | dragontamer wrote: | AMD's design is clearly designed for cloud-servers with | 4-cores / 8-threads per VM. | | Its so obvious: 4-cores per CCX sharing an L3 cache | (that's inefficient to communicate with other CCXes). | Like, AMD EPYC is so, so so, SOOO very good at it. It | ain't even funny. | | Its like AMD started with the 4-core/8-thread VM problem, | and then designed a chip around that workload. Oh, but it | can't talk to the 5th core very efficiently? | | No problem: VMs just don't really talk to other | customer's cores that often anyway. So that's not really | a disadvantage at all. | temac wrote: | I was not really thinking about Apple when writing that | part, more about some weak details of Zen N vs. Intel, | that do not matter in the end (at least for most | workloads). Be it inter-cores or intra-core. | | I think the logical design space is so vast now that | there is largely enough freedom to compete even when | addressing vast corpus of existing software, even if said | software is tuned for previous/competitor chips. It was | already at the time of the PPro, with thousands times | more transistors it is even more. And that makes it even | more sad that Intel has been stuck on basically Skylake | on their 14nm for so long. | dragontamer wrote: | If I were to guess what this M1 chip was designed for: it | was for JIT-compiling and then execution of JIT-code | (Javascript and/or Rosetta). | klelatti wrote: | Thanks. I commented as my mental model was that Apple had | a significantly easier job with a fairly narrow set of | significant applications to worry about - many of which | they write - compared to a much wider base for say AMD's | server cpus. | | But I guess that this all pales into insignificance | compared to the gains of going from Intel 14nm to TSMC | 5nm. | Zigurd wrote: | This "and Apple pays top-dollar to TSMC to ensure they're | the first on the new node" is Tim Cook's crowning | achievement in the way Apple combines supply chain | dominance with technology strategy. | | They do not win every bet they make (e.g. growing their own | sapphire) but when they win it is stunning. | fulafel wrote: | It did happen before, with | | a) Apple -look at the benchmarks of Apple chips vs other ARM | implementations from past years. The M1 essentially the same | SoC as the current iPad one with more cores and memory. | | b) with other manufacturers: there have been "wow" CPUs from | time to time. Early MIPS chips, The Alpha victorious period | of 21064/21164/21264, Pentium Pro, AMD K7, StrongArm (Apple | connection here as wel), etc. Then Intel managed to torpedo | the fragmented high-performance RISC competition and | convinced their patrons to jump ship to the ill-fated | Itanium, which led to a long lull in serious competition. | ip26 wrote: | The L1 cache size _is_ linked to the architecture though. The | variable length instructions of x86 mean you can fit more of | them in an L1i of a given size. So, in short, ARM pays for | easier decode with a larger L1i, while x86 pays more for decode | in exchange for a smaller L1i. | | As a spectator it's hard to know which is the better tradeoff | in the long run. As area gets cheaper, is a larger L1i so bad? | Yet on the other hand, cache is ever more important as CPU | speed outstrips memory. | | In a form of convergent evolution, the uop cache bridges the | gap- x86 spends some of the area saved in the L1i here. | cesarb wrote: | There's another consideration: for a VIPT cache (which is | usually the case for the L1 cache), the page size limits the | cache size, since it can only be indexed by the bits which | are not translated. For legacy reasons, the base page size on | x86 is always 4096 bytes, so an 8-way VIPT cache is limited | to 32768 bytes (and adding more ways is costly). On 64-bit | ARM, the page size can be either 4K, 16K, or 64K, with the | later being required to reach the maximum amount of physical | memory, and since it has been that way since the beginning, | AFAIK it's common for 64-bit ARM software to be ready for any | of these three page sizes. | | I vaguely recall reading somewhere that Apple uses the 16K | page size, which if they use an 8-way VIPT L1 cache would | limit their L1 cache size to 128K. | dragontamer wrote: | AMD Zen 3 has 512kB L2 cache per-core, with more than enough | bandwidth to support multiple reads per clock tick. | Instructions can fit inside that 512kB L2 cache just fine. | | AMD Zen 3 has 32MB L3 cache across 8-cores. | | By all accounts, Zen3 has "more cache per core" than Apple's | M1. The question whether AMD's (or Intel's) L1/L2 split is | worthwhile. | | --------- | | The difference in cache, is that Apple has decided on having | an L1 cache that's smaller than AMD / Intel's L2 cache, but | larger than AMD / Intel's L1 cache. That's it. | | Its a question of cache configuration: "flatter" 2-level | cache on M1 vs a "bigger" 3-level cache on Skylake / Zen. | | ------- | | That's the thing: its a very complicated question. Bigger | caches simply have more latency. There's no way around that | problem. That's why x86 processors have multi-tiered caches. | | Apple has gone against the grain, and made an absurdly large | L1 cache, and skipped the intermediate cache entirely. I'm | sure Apple engineers have done their research into it, but | there's nothing simple about this decision at all. I'm | interested to see how this performs in the future (whether | new bottlenecks will come forth). | klelatti wrote: | It's an interesting point. I guess ARM must have done quite a | lot of analysis running up to the launch of aarch64 in 2010 | when, with roughly a blank sheet of paper on the ISA, they | could have decided to go for variable length instructions for | this reason (especially given their history with Thumb). On | the other hand presumably the focus was on power given the | immediate market and so the simpler decode would have been | beneficial for that reason. | egsmi wrote: | > With over 600 reorder buffer registers in the Apple M1 | executing deeply out-of-order code | | Can you provide a link to how this was determined? I did some | searches but couldn't find anything. I'd be very interested too | see how it was measured. | dragontamer wrote: | https://www.anandtech.com/show/16226/apple- | silicon-m1-a14-de... | | The reorder-buffers determine how "deep" you can go out-of- | order. Roughly speaking, 600+ means that an instruction 600+ | instructions ago is still waiting for retirement. You can be | "600 instructions out of order", so to speak. | | ---------- | | Each time you hold a load/store out-of-order on a modern CPU, | you have to store that information somewhere. Then the | "retirement unit" waits for all instructions to be put back | into order correctly. | | Something like Apple's M1, with 600+ reorder buffer | registers, will search for instruction-level parallelism all | the way up to 600-instructions into the future, before the | retirement unit tells the rest of the core to start stalling. | | For a realistic example, imagine a division instruction | (which may take 80-clock ticks to execute on AMD Zen). Should | the CPU just wait for the divide to finish before continuing | execution? Heck no! A modern core will out-of-order execute | future instructions while waiting for division to finish. As | long as reorder buffer registers are ready, the CPU can | continue to search for other work to do. | | -------- | | There's nothing special about Apple's retirement unit, aside | from being ~600 big. Skylake and Zen are ~200 to ~300 big | IIRC. Apple just decided they wanted a wider core, and | therefore made one. | egsmi wrote: | I see how it worked. That measurement uses the 2013 | technique published by Henry Wong. I think it's probably a | reasonable estimate of the instruction window length but to | say that's the same as the buffer size is making a number | of architectural assumptions that I haven't seen any | evidence to justify. I suppose in the end it doesn't really | matter as users of the chip though. | | http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ | klelatti wrote: | I so wanted to like this article - and due credit to the author | for trying to explain these points - but it often slides into | comments that are potentially misleading. | | In particular the use of 'the ARM ISA' (singular) with an | allusion to Thumb at one point (aarch32) whilst mostly talking | about (aarch64) M1 isn't helpful (and there are other points | too). | | And I think the RISC vs CISC categorisation was useful in 1990 | but I think there are other more important aspects to focus on in | 2020. | kevin_thibedeau wrote: | x86 has been RISC since the Pentium Pro. There is no point | dithering over the fine details especially when x64 removes the | register pressure issues for compilers and considering that ARM | has a bloated ISA. | socialdemocrat wrote: | RISC isn't about the size of the ISA but about the type of | instructions. RISC instructions are fixed width and how low | complexity from a decoding and pipelining point of view. | | Pentium Pro was not RISC, that is just Intel marketing speak. | Micro-ops can be produced in a RISC CPU as well, it is separate | from having a RISC ISA. The RISC ISA is about what the compiler | sees and can do. The compiler cannot see the Pentium Pro micro- | ops. Those are hidden from the compiler. The compiler cannot | rearrange them and optimize them they way it can with | instructions in the ISA. | pizlonator wrote: | This is really great, but RISC CPUs can have microcode too. | Nothing stops them from doing that. | | The big diff is load/store: | | - Loads and stores are separate instructions in RISC and never | implied by other ops. In CISC, you have orthogonality: most | places that can take a register can also take a memory address. | | - Because of load/store, you need fewer bits in the instruction | encoding for encoding operands. | | - Because you save bits in operands, you can have more bits to | encode the register. | | - Because you have more bits to encode the register, you can have | more architectural registers, so compilers have an easier time | doing register allocation and emit less spill code. | | That might be an oversimplification since it totally skips the | history lesson. But if we take RISC and CISC as trade offs you | can make today, the trade off is as I say above and has little to | do with pipelining or microcode. The trade off is just: you gonna | have finite bits to encode shit, so if you move the loads and | stores into their own instructions, you can have more registers. | bitwize wrote: | RISC typically means "load-store architecture": load operands | from memory to regs, perform operations in regs only, store | results back to memory. | | CISC refers to old-school, programmer-friendly, addressing-mode- | laden ISAs. Add D0 to the address pointed to by A0 plus an | immediate offset and store the result back to memory, that sort | of thing. | Aardwolf wrote: | I guess it means the same as the difference between "VGA" and | "SVGA" in 2020. The "super" 800x600 resolution of SVGA isn't | really that super now. | jleahy wrote: | Or NTSC and PAL. | | I wonder if my 3440x1440 screen is NTSC or PAL? | HideousKojima wrote: | Neither, though theoretically it could have legacy support | for both formats ___________________________________________________________________ (page generated 2020-11-20 23:00 UTC)