[HN Gopher] What do RISC and CISC mean in 2020?
       ___________________________________________________________________
        
       What do RISC and CISC mean in 2020?
        
       Author : socialdemocrat
       Score  : 85 points
       Date   : 2020-11-20 12:13 UTC (10 hours ago)
        
 (HTM) web link (erik-engheim.medium.com)
 (TXT) w3m dump (erik-engheim.medium.com)
        
       | seanalltogether wrote:
       | So where does the power efficiency gains come into play. Is this
       | a feature of ARM specifically, or RISC in general?
        
       | socialdemocrat wrote:
       | Initially the difference between RISC and CISC processors was
       | clear. Today many say there is no real difference. This story
       | digs into the details to explain significant differences which
       | still exist.
        
         | StillBored wrote:
         | While mostly missing the mark, and just rehashing the old
         | discussions. AKA the micro architecture concepts of both "RISC"
         | designs and "CISC" designs is so similar across product lines
         | as to be mostly meaningless. As mentioned you have RISC designs
         | using "micro ops" and microcode, and you have CISC designs
         | doing 1:1 instruction micro op mapping. Both are doing various
         | forms of cracking and fusing depending on the instruction. All
         | have the same problems with branch prediction, speculative
         | execution, and solve problems with OoO in similar ways.
         | 
         | Maybe the largest remaining difference is around the strength
         | of the memory model, as the size of the architectural register
         | file, and the complexity of addressing modes and other
         | traditional RISC/CISC arguments are mostly pointless in the
         | face of deep OoO superscaler machines doing register
         | renaming/etc from mop caches, etc.
         | 
         | Even then, like variable length instructions (which yes exist
         | on many RISCs in limited forms) this differentiation is more
         | about when the ISA was designed rather than anything
         | fundamental in the philosophy.
        
           | [deleted]
        
           | socialdemocrat wrote:
           | The key difference between RISC and CISC is the ISA, which is
           | still true. x86 have instructions which can in theory be
           | infinitely long. RISC instructions are typically fixed
           | length. Yes there are exceptions but that is how most
           | instructions are designed.
           | 
           | RISC ISA is still designed around Load/Store, while e.g. x86
           | has a variety of address modes.
           | 
           | All these differences in the ISAs has some impact on what
           | makes sense to do in the micro-architecture and how well you
           | can do it. Sure you can pipeline ANY CPU, but it will be
           | easier to do so when you deal with mostly fixed width
           | instructions, of quite similar complexity. On x86 there will
           | be much more variety in the complexity each instruction and
           | you will be more prone to get gaps in the pipeline. As far as
           | I understand anyway.
        
       | CodeArtisan wrote:
       | I would say that the CISC philosophy is about lowering the
       | complexity at the software level to raise it at the circuitry
       | level while the RISC philosophy is the inverse. Philosophies that
       | you can apply not only to CPUs but also to virtual machines.
       | Today ARM SoC are adding more and more ASICs (AI, RayTracing,
       | Photo post processor, GPU, ...) so instead of dealing with a
       | complex ISA you now have to deal with multiple simpler ISAs.
        
         | spear wrote:
         | I think that's a misleading way of looking at things. There is
         | no "CISC philosophy". RISC designs came out as a new way of
         | doing things and existing designs were called CISC for
         | contrast. It's not like there were two schools of thought that
         | were developed at the same time and the CISC designs
         | intentionally rejected RISC philosophies.
        
       | diehunde wrote:
       | If anyone wants to learn about the history behind RISC/CISC and
       | much more info on the topic, I recommend listening to the David
       | Patterson's episode on Lex Fridman's podcast. David Patterson is
       | one of the original contributors to RISC and author of one of
       | best books on computer architecture. 2 hours of pure knowledge on
       | the subject.
        
         | klelatti wrote:
         | Jim Keller on another episode of Lex Fridman's podcast is also
         | excellent when explaining things like out of order execution.
        
       | jcranmer wrote:
       | The better explanation of RISC v CISC is this old discussion from
       | comp.arch: https://yarchive.net/comp/risc_definition.html
       | 
       | In short, the term RISC comes from a new set of architecture
       | designs in the late 80s/early 90s. CISC is not so much an
       | architecture design as it is lacking many of the features. The
       | major features that RISC adds are:
       | 
       | * Avoid complex operations, which may include things such as
       | multiply and divide. (Although note that "modern" RISCs have
       | these nowadays).
       | 
       | * More registers (32 registers instead of 8 or 16). (ARM has 16.
       | So does x86-64.)
       | 
       | * Fixed-length instructions instead of variable-length.
       | 
       | * Avoid indirect memory references or a lot of memory accessing
       | types (note that x86 also does this).
       | 
       | Functionally speaking, x86 itself is pretty close to RISC,
       | especially in terms of how the operations themselves need to be
       | implemented. The implementation benefits of RISC (especially in
       | allowing pipelining) are largely applicable to x86 as well, since
       | x86 really skips the problematic instructions that other CISCs
       | have.
       | 
       | > One of the key ideas of RISC was to push a lot of heavy lifting
       | over to the compiler. That is still the case. Micro-ops cannot be
       | re-arranged by the compiler for optimal execution.
       | 
       | Modern compilers do use instruction scheduling to optimize
       | execution, and instruction scheduling for microcoded execution is
       | well-understood.
       | 
       | > Time is more critical when running micro-ops than when
       | compiling. It is an obvious advantage in making it possible for
       | advance compiler to rearrange code rather than relying on
       | precious silicon to do it.
       | 
       | All modern high-performance chips are out-of-order execution,
       | because some instructions (especially memory!) take longer than
       | others to execute. The "precious silicon" is silicon that's
       | already been used for that reason, whether RISC or CISC.
        
         | Tuna-Fish wrote:
         | > (ARM has 16. So does x86-64.)
         | 
         | 64-bit ARM has 32 GPRs.
         | 
         | > Fixed-length instructions instead of variable-length.
         | 
         | This is the big legacy of RISC that helps "RISC" cpus most
         | against x86. M1 has 8-wide decode, with very few stages and low
         | power consumption. Nothing like it can be done for an x86 cpu.
         | The way modern x86 handles this is typically with a uop cache.
         | But this costs a lot of power, area and only provides full
         | decode width for a relatively small pool of insns -- 4k on
         | modern Zen, for example.
         | 
         | > > One of the key ideas of RISC was to push a lot of heavy
         | lifting over to the compiler. That is still the case. Micro-ops
         | cannot be re-arranged by the compiler for optimal execution.
         | 
         | > Modern compilers do use instruction scheduling to optimize
         | execution, and instruction scheduling for microcoded execution
         | is well-understood.
         | 
         | Compiler-level instruction scheduling is mostly irrelevant for
         | modern OoO architectures. Most of the time the CPU is operating
         | from mostly full queues, so it will be doing the scheduling
         | from the past 10-~16 instructions. Compilers are mostly still
         | doing it out of inertia.
         | 
         | > "precious silicon"
         | 
         | The big difference from that 25 years ago to today is indeed
         | that silicon is now the opposite of precious. We have so many
         | transistors available that we are looking for ways to
         | effectively use more of them, rather than saving precious
         | silicon.
        
         | fulafel wrote:
         | Aside - Modern general purpouse CPUs tend to be OoO but
         | processors used for demanding computation in things like modems
         | (signal processing/SDR) and GPUs (graphics) tend not to be.
        
         | idividebyzero wrote:
         | I think it moderately depends on the definition you give it to.
         | If you require RISC to be a load/store architecture, x86 is not
         | even close to be one. Also, aarch64 is a variable-length
         | instructions set and include complex instructions (such as
         | those to perform AES operations). Compiler optimizations are
         | meant to be taken advantage by all architectures, regardless of
         | RISC/CISC.
        
           | Tuna-Fish wrote:
           | 64-bit Arm is fixed width. Modern 32-bit Arm was _not_ fixed
           | width, as Thumb-2 was widely used.
        
           | jcranmer wrote:
           | Personally, I think the RISC/CISC "question" isn't really
           | meaningful anymore, and it's not the right lens with which to
           | compare modern architectures. Partially, this is because the
           | modern prototypes of RISC and CISC--ARM/AArch64 and x86-64,
           | respectively--show a lot more convergent evolution and
           | blurriness than the architectures at the time the terms were
           | first coined.
           | 
           | Instead, the real question is microarchitectural. First, what
           | are the actual capabilities of your ALUs, how are they
           | pipelined, and how many of them are there? Next, how good are
           | you at moving stuff into and out of them--the memory
           | subsystem, branch prediction, reorder buffers, register
           | renaming, etc. The ISA only matters insofar as it controls
           | how well you can dispatch into your microarchitecture.
           | 
           | It's important to note how many of the RISC ideas _haven 't_
           | caught on. The actual "small" part of the instruction set,
           | for example, is discarded by modern architectures (bring on
           | the MUL and DIV instructions!). Designing your ISA to let you
           | avoid pipeline complexity (e.g., branch slots) also fell out
           | of favor. The general notion of "let's push hardware
           | complexity to the compiler" tends to fail because it turns
           | out that hardware complexity lets you take advantage of
           | dynamic opportunities that the compiler fundamentally cannot
           | do statically.
           | 
           | The RISC/CISC framing of the debate is unhelpful in that it
           | draws people's attention to rather more superficial aspects
           | of processor design instead of the aspects that matter more
           | for performance.
        
             | brandmeyer wrote:
             | > It's important to note how many of the RISC ideas haven't
             | caught on.
             | 
             | 2-in, 1-out didn't, either. Nowadays all floating-point
             | units support 3-in, 1-out via fused multiply-add. SVE
             | provides a mask argument to almost everything.
        
           | klelatti wrote:
           | Unless you're using a definition I'm not familiar with
           | aarch64 isn't a variable length instruction set - here's
           | Richard Grisenthwaite Arm's lead architect introducing ARMv8
           | - the slide here confirms "New Fixed Length Instruction Set":
           | 
           | https://youtu.be/GBeEEfmJ3NI?t=570
        
             | idividebyzero wrote:
             | I understand that they refer to it as a fixed-length
             | instruction set, it's correct, note though that not all
             | ARMv8 instructions are 4 bytes long. Indeed, some
             | instructions that are met together are fused to a single
             | one, or SVE, for instance, introduces prefix; so
             | practically, this means that sometimes instructions can be
             | 8 bytes long.
        
               | brandmeyer wrote:
               | Macro-op fusion of the MOVW/MOVT family doesn't count. At
               | the time of that presentation, SVE didn't exist. Even
               | now, the masked move instruction in SVE can also stand on
               | its own as a single instruction and sometimes it does get
               | emitted as its own uop.
        
               | klelatti wrote:
               | Thanks, yes of course. I guess probably fair to say that
               | philosophically it's fixed-length, in way that the
               | original Arm was RISC, i.e. with some very non RISC-y
               | instruction. Very different to x86 though.
        
         | dragontamer wrote:
         | The main issue with linking a discussion from 25 years ago, is
         | that the discussion from is almost irrelevant in today's
         | environment.
         | 
         | The Apple M1 has over 600 reorder buffer registers (while
         | Skylake and Zen are around 200ish). The 16 or 32 architectural
         | registers of the ISA are pretty much irrelevant compared to the
         | capabilities of the out-of-order engine on modern chips.
         | 
         | A 200, 300, or 600+ register ISA is unfathomable to those from
         | 1995. Not only that, but the way we got our software to scale
         | to such "shadow register" sets is due to an improvement in
         | compiler technology over the last 20 years.
         | 
         | Modern compilers write code differently (aka: "dependency
         | cutting"). Modern chips take advantage of those dependency
         | cuts, and use them to "malloc" those reorder buffer registers,
         | and as a basis for out-of-order execution.
         | 
         | While the tech existed back in the 90s for this... it wasn't
         | widespread yet. Even the best posts from back then would be
         | unable to predict what technologies came out 25 years into the
         | future.
        
           | temac wrote:
           | If I remember correctly the M1 has around 600 reorder buffer
           | entries, and I just checked and Anantech estimate the int
           | register file at around 354 entries. That's still big, but
           | not 600.
        
             | dragontamer wrote:
             | Ah hah, but there's also 300 FP registers!!
             | 
             | Okay, you got me. I somehow confused the register file with
             | the reorder buffer in the above post. But I think I may
             | still manage to be "technically correct" thanks to the FP
             | register file (even though its not really fair to count
             | those).
        
         | qwerty456127 wrote:
         | > Avoid complex operations, which may include things such as
         | multiply and divide.
         | 
         | How can this possibly make sense? Almost every application
         | multiplies and divides all the time anyway. It usually is a
         | good idea to implement frequently used operations in hardware
         | because hardware implementation is always more efficient than
         | software implementation, isn't it?
        
           | Asooka wrote:
           | That was for back when CPUs didn't really have native
           | division or multiplication. So a mul or div would literally
           | be like calling a function to do it using other arithmetic
           | instructions, except the function is stored in the CPU. Which
           | goes against the RISC philosophy and makes the CPU more
           | complex for not much gain.
        
         | fanf2 wrote:
         | I am also a fan of John Mashey's analysis that you linked to!
         | The key thing is that he counts things like instruction
         | formats, addressing modes, memory ops per instruction,
         | registers, and so on. There is a clear separation in the
         | numbers between the RISCs and the CISCs.
         | 
         | What stuck out to me when I first read it 25 years ago is that
         | the ARM is the least RISCy RISC, and x86 is the least CISCy
         | CISC. At that time the Pentium was killing the 68060 and many
         | of the RISCs, and it seemed clear that x86 had a big advantage
         | in the relatively small number of memory ops per instruction.
        
         | qwerty456127 wrote:
         | > Functionally speaking, x86 itself is pretty close to RISC,
         | 
         | AFAIK some x86 implementations (e.g. AMD K6-3) had RISC cores
         | and translation units.
        
       | Const-me wrote:
       | > But is that really true?
       | 
       | Yes.
       | 
       | > Microprocessors (CPUs) do very simple things
       | 
       | Look at the instructions like vfmadd132ps on AMD64, or the ARM
       | equivalent VMLA.F32. None of them are simple.
       | 
       | > It is part of Intel's intellectual property
       | 
       | Patents have expiration dates. You probably can't emulate Intel's
       | AVX512 because it's relatively new, but whatever patents were
       | covering SSE1 or SSE2 have expired years ago.
       | 
       | > If you go with x86 you have to do all that on external chips.
       | 
       | Encryption is right there, see AES-NI or SHA.
       | 
       | > Another core idea of RISC was pipelining
       | 
       | I don't know whose idea it was, but the first hardware
       | implementation was Intel 80386, in 1985.
        
       | tenebrisalietum wrote:
       | This is a great article. Anyone who parrots "Intel uses RISC
       | internally" when talking about CISC/RISC should be directed here
       | for edification and correction.
        
         | Analemma_ wrote:
         | I'm not sure this article refutes those people? Every time I've
         | heard someone say "Intel uses RISC internally", what they mean
         | is that the decoding logic used to turn x86 instructions into
         | uops (and thus get the benefits of RISC) takes up a fixed
         | amount of transistors on the board that RISC doesn't need, and
         | this penalty becomes proportionally larger at lower power
         | levels, hence why x86 is still a good performer on
         | servers/HEDTs but got crushed in mobile. Which is pretty much
         | what this article says as well.
        
           | StillBored wrote:
           | No it doesn't explain intel getting crushed on mobile, that
           | is more a question of focus. You have to remember than those
           | "big" decoders can be scaled down to a few thousand
           | transistors as is seen on something like the 486 if you
           | willing to pay the performance penalty.
           | 
           | The entire 486 was something like 1M transistors (including
           | cache/mmu/fpu/etc/etc). Which makes it smaller than pretty
           | much every modern design that can run a full blown OS like
           | linux.
           | 
           | When you look at things like a modern x86 with dual 512bit
           | vector units, what you see are things consuming power that
           | frequently don't exist on the smaller designs (like that
           | vector unit, a modern arm might have a dual issue 128 bit
           | NEON unit).
           | 
           | Here is a cute graphic https://en.wikipedia.org/wiki/File:Moo
           | re%27s_Law_Transistor_...
        
           | socialdemocrat wrote:
           | RISC is about the ISA not the micro ops. One of the points of
           | RISC is to give the compiler a simpler instruction set to
           | deal with. Micro-ops are invisible to the compiler. You
           | cannot spend a bunch of extra compile time to rearrange
           | things in an optimal fashion.
           | 
           | Micro-ops is an implementation detail you can change at any
           | time. The ISA you are stuck with for a long time.
           | 
           | Thus saying x86 is RISC-like doesn't make sense it would
           | imply that the x86 ISA is RISC-like which it is not.
           | 
           | The benefits of uops is separate from the benefits of RISC.
           | Even RISC processor can turn their instructions into uops.
           | You cannot break CISC instructions into uops in as easy and
           | steady stream as a RISC instruction which has a much more
           | even level of complexity.
        
       | dragontamer wrote:
       | With over 600 reorder buffer registers in the Apple M1 executing
       | deeply out-of-order code, this blogpost rehashes decades old
       | arguments without actually discussing what makes the M1 so good.
       | 
       | The Apple M1 is the widest archtecture, with the thickest
       | dispatch I've seen in a while. 2nd only to the POWER9 SMT8 (which
       | had 12-uop dispatch), the Apple M1 dispatches 8-uops per clock
       | cycle (while x86 only aim at 4-uops / clock tick).
       | 
       | That's where things start. From there, those 8-instructions
       | dispatched enter a very wide set of superscalar pipelines and
       | strongly branch-predicted / out-of-order execution.
       | 
       | Rehashing old arguments about "not enough registers" just doesn't
       | match reality. x86-Skylake and x86-Zen have 200+ ROB registers
       | (reorder-buffers), which the compiler has plenty of access to.
       | The 32 ARM registers on M1 are similarly "faked", just a
       | glorified interface to the 600+ reorder buffers on the Apple M1.
       | 
       | The Apple M1 does NOT show off those 600+ registers in actuality,
       | because it needs to remain compatible with old ARM code. But old
       | ARM code compiled _CORRECTLY_ can still use those registers
       | through a mechanism called dependency cutting. Same thing on x86
       | code. All modern assembly does this.
       | 
       | ------
       | 
       | "Hyperthreading" is not a CISC concept. POWER9 SMT8 can push 8
       | threads onto one core, there are ARM chips with 4-threads on one
       | core. Heck, GPUs (which are probably the simplest cores on the
       | market) have 10 to 20+ wavefronts per execution unit (!!!).
       | 
       | Pipelining is NOT a RISC concept, not anymore. All chips today
       | are pipelined: you can execute SIMD multiply-add instructions on
       | x86 on both Zen3 and Intel Skylake multiple times per clock tick,
       | despite having ~5 cycles (or was it 3 cycles? I forget...) of
       | latency. All chips have pipelining.
       | 
       | -------
       | 
       | Skylake / Zen have larger caches than M1 actually. I wouldn't say
       | M1 has the cache advantage, outside of L1. Loads/stores in
       | Skylake / Zen to L2 cache can be issued once-per-clock tick,
       | though at a higher latency than L1 cache. With 256kB or 512kB of
       | L2 cache, Skylake/Zen actually have ample cache.
       | 
       | The cache discussion needs to be around the latency
       | characteristics of L1. By making L1 bigger, the M1 L1 cache is
       | almost certainly higher latency than Skylake/Zen (especially in
       | absolute terms, because Skylake/Zen clock at 4GHz+). But there's
       | probably power-consumption benefits to running the L1 cache wider
       | at 2.8GHz instead.
       | 
       | That's the thing about cache: the bigger it is, the harder it is
       | to keep fast. That's why L1 / L2 caches exist on x86: L2 can be
       | huge (but higher latency), while L1 can be small but far lower
       | latency. A compromise in sizes (128kB on M1) is just that: a
       | compromise. It has nothing to do with CISC or RISC.
        
         | dimtion wrote:
         | Do you happen to know where to find any resources on how Apple
         | managed to make the M1 so good compared to the competition?
         | 
         | And why this has not happened before with other manufacturers?
        
           | nialo wrote:
           | The vague impression I get is that maybe the answer is
           | "Because Apple's software people and chip design people are
           | in the same company, they did a better job of coordinating to
           | make good tradeoffs in the chip and software design."
           | 
           | (I'm getting this from reading between lines on Twitter, so
           | it's not exactly a high confidence guess)
        
           | dragontamer wrote:
           | > Do you happen to know where to find any resources on how
           | Apple managed to make the M1 so good compared to the
           | competition?
           | 
           | If you know computer microarchitecture, the specs have been
           | discussed all across the internet by now. Reorder buffers,
           | execution widths, everything.
           | 
           | If you don't know how to read those specs... well... that's a
           | bit harder. I don't really know how to help ya there. Maybe
           | read Agner Fog's microarchitecture manual until you
           | understand the subject, and then read the M1
           | microarchitecture discussions?
           | 
           | I do realize this is a non-answer. But... I'm not sure if
           | there's any way to easily understand computer
           | microarchitecture unless you put effort to learn it.
           | 
           | https://www.agner.org/optimize/
           | 
           | Read Manual #3: Microarchitecture. Understand what all these
           | parts of a modern CPU does. Then, when you look at something
           | like the M1's design, it becomes obvious what all those parts
           | are doing.
           | 
           | > And why this has not happened before with other
           | manufacturers?
           | 
           | 1. Apple is on TSMC 5nm, and no one else can afford that yet.
           | So they have the most advanced process in the world, and
           | Apple pays top-dollar to TSMC to ensure they're the first on
           | the new node.
           | 
           | 2. Apple has made some interesting decisions that runs very
           | much counter to Intel and AMD's approach. Intel is pushing
           | wider vector units, as you might know (AVX512), and despite
           | the poo-pooing of AVX512, it gets the job done. AMD's
           | approach is "more cores", they have a 4-wide execution unit
           | and are splitting up their chips across multiple dies now to
           | give better-and-better multithreaded performance.
           | 
           | Apple's decision to make a 8-wide decoder engine is a
           | decision, a compromise, which will make scaling up to more-
           | cores more difficult. Apple's core is simply the biggest core
           | on the market.
           | 
           | Whereas AMD decided that 4-wide decode was enough (and then
           | split into new cores), Apple ran the math and came out with
           | the opposite conclusion, pushing for 8-wide decode instead.
           | As such, the M1 will achieve the best single-threaded
           | numbers.
           | 
           | ---------
           | 
           | Note that Apple has also largely given up on SIMD-execute.
           | ARM 128-bit vectors are supported, but AVX2 from x86 land and
           | AVX512 support 256-bit and 512-bit vectors respectively.
           | 
           | As such, the M1's 128-bit wide vectors are its weak point,
           | and it shows. Apple has decided that integer-performance is
           | more important. It seems like Apple is using either its iGPU
           | or Neural Engine for regular math / compute applications
           | however. (The Neural Engine is a VLIW architecture, and iGPUs
           | are of course just a wider SIMD unit in general). So Apple's
           | strategy seems to be to offload the SIMD-compute to other,
           | more specialized computers (still on the same SoC).
        
             | temac wrote:
             | > Apple's decision to make a 8-wide decoder engine is a
             | decision, a compromise, which will make scaling up to more-
             | cores more difficult. Apple's core is simply the biggest
             | core on the market.
             | 
             | > Whereas AMD decided that 4-wide decode was enough (and
             | then split into new cores), Apple ran the math and came out
             | with the opposite conclusion, pushing for 8-wide decode
             | instead. As such, the M1 will achieve the best single-
             | threaded numbers.
             | 
             | That's not as simple. x86 is way more difficult to decode
             | than ARM. Also, the insanely large OOO probably helps a lot
             | to keep the wide M1 beast occupied. Does the large L1
             | helps? I don't know. Maybe a large enough L2 would be OK.
             | And the perf cores do not occupy the largest area of the
             | die. Can you do a very large L1 with not too bad latency
             | impact? I guess a small node helps, plus maybe you keep a
             | reasonable associativity and a traditional L1 lookup thanks
             | to the larger pages. So I'm curious what happens with 4kB
             | pages and it probably has that mode for emulation?
             | 
             | Going specialized instead of putting large vector in the
             | CPU also makes complete sense. You want to be slow and wide
             | to optimize for efficiency. Of course it's less possible
             | for mainly scalar and branch rich workloads, so you can't
             | be as wide on a CPU. You still need a middle ground for
             | your low latency compute needs in the middle of your scalar
             | code and 128-bits certainly is one esp if you can imagine
             | to scale to lots of execution units (well I this point I
             | admit you can also support a wider size, but that shows the
             | impact of staying 128 won't necessarily be crazy if
             | structured like that), although one could argue for 256,
             | but 512 starts to not be reasonable and probably has a way
             | worse impact on core size than wide superscalar - or at
             | least even if the impact is similar (I'm not sure) I
             | suspect that wide superscalar is more useful most of the
             | time. It's understandable that a more CPU oriented vendor
             | will be far more interested by large vectors. Apple is not
             | that -- although of course what they will do for their high
             | end will be extremely interesting to watch.
             | 
             | Of course you have to solve a wide variety of problems, but
             | the recent AMD approach has shown that the good old method
             | of optimizing for real workloads just continue to be the
             | way to go. Who cares if you have somehow more latency in
             | not so frequent cases, or if int <-> fp is slower, if in
             | the end that let you optimise the structures were you reap
             | most benefits. Now each has its own history obviously and
             | the mobile roots of the M1 also gives a strong influence,
             | plus the vertical integration of Apple helps immensely.
             | 
             | I want to add: even if the M1 is impressive, Apple has not
             | a too insane advance in the end result compared to what AMD
             | does on 7nm. But of course they will continue to improve.
        
               | klelatti wrote:
               | Interested in your comment on AMD 'optimising for real
               | workloads'. Presumably, Apple will have been examining
               | the workloads they see on their OS (and they are writing
               | more of that software than AMD) so not sure I see the
               | distinction.
        
               | dragontamer wrote:
               | AMD's design is clearly designed for cloud-servers with
               | 4-cores / 8-threads per VM.
               | 
               | Its so obvious: 4-cores per CCX sharing an L3 cache
               | (that's inefficient to communicate with other CCXes).
               | Like, AMD EPYC is so, so so, SOOO very good at it. It
               | ain't even funny.
               | 
               | Its like AMD started with the 4-core/8-thread VM problem,
               | and then designed a chip around that workload. Oh, but it
               | can't talk to the 5th core very efficiently?
               | 
               | No problem: VMs just don't really talk to other
               | customer's cores that often anyway. So that's not really
               | a disadvantage at all.
        
               | temac wrote:
               | I was not really thinking about Apple when writing that
               | part, more about some weak details of Zen N vs. Intel,
               | that do not matter in the end (at least for most
               | workloads). Be it inter-cores or intra-core.
               | 
               | I think the logical design space is so vast now that
               | there is largely enough freedom to compete even when
               | addressing vast corpus of existing software, even if said
               | software is tuned for previous/competitor chips. It was
               | already at the time of the PPro, with thousands times
               | more transistors it is even more. And that makes it even
               | more sad that Intel has been stuck on basically Skylake
               | on their 14nm for so long.
        
               | dragontamer wrote:
               | If I were to guess what this M1 chip was designed for: it
               | was for JIT-compiling and then execution of JIT-code
               | (Javascript and/or Rosetta).
        
               | klelatti wrote:
               | Thanks. I commented as my mental model was that Apple had
               | a significantly easier job with a fairly narrow set of
               | significant applications to worry about - many of which
               | they write - compared to a much wider base for say AMD's
               | server cpus.
               | 
               | But I guess that this all pales into insignificance
               | compared to the gains of going from Intel 14nm to TSMC
               | 5nm.
        
             | Zigurd wrote:
             | This "and Apple pays top-dollar to TSMC to ensure they're
             | the first on the new node" is Tim Cook's crowning
             | achievement in the way Apple combines supply chain
             | dominance with technology strategy.
             | 
             | They do not win every bet they make (e.g. growing their own
             | sapphire) but when they win it is stunning.
        
           | fulafel wrote:
           | It did happen before, with
           | 
           | a) Apple -look at the benchmarks of Apple chips vs other ARM
           | implementations from past years. The M1 essentially the same
           | SoC as the current iPad one with more cores and memory.
           | 
           | b) with other manufacturers: there have been "wow" CPUs from
           | time to time. Early MIPS chips, The Alpha victorious period
           | of 21064/21164/21264, Pentium Pro, AMD K7, StrongArm (Apple
           | connection here as wel), etc. Then Intel managed to torpedo
           | the fragmented high-performance RISC competition and
           | convinced their patrons to jump ship to the ill-fated
           | Itanium, which led to a long lull in serious competition.
        
         | ip26 wrote:
         | The L1 cache size _is_ linked to the architecture though. The
         | variable length instructions of x86 mean you can fit more of
         | them in an L1i of a given size. So, in short, ARM pays for
         | easier decode with a larger L1i, while x86 pays more for decode
         | in exchange for a smaller L1i.
         | 
         | As a spectator it's hard to know which is the better tradeoff
         | in the long run. As area gets cheaper, is a larger L1i so bad?
         | Yet on the other hand, cache is ever more important as CPU
         | speed outstrips memory.
         | 
         | In a form of convergent evolution, the uop cache bridges the
         | gap- x86 spends some of the area saved in the L1i here.
        
           | cesarb wrote:
           | There's another consideration: for a VIPT cache (which is
           | usually the case for the L1 cache), the page size limits the
           | cache size, since it can only be indexed by the bits which
           | are not translated. For legacy reasons, the base page size on
           | x86 is always 4096 bytes, so an 8-way VIPT cache is limited
           | to 32768 bytes (and adding more ways is costly). On 64-bit
           | ARM, the page size can be either 4K, 16K, or 64K, with the
           | later being required to reach the maximum amount of physical
           | memory, and since it has been that way since the beginning,
           | AFAIK it's common for 64-bit ARM software to be ready for any
           | of these three page sizes.
           | 
           | I vaguely recall reading somewhere that Apple uses the 16K
           | page size, which if they use an 8-way VIPT L1 cache would
           | limit their L1 cache size to 128K.
        
           | dragontamer wrote:
           | AMD Zen 3 has 512kB L2 cache per-core, with more than enough
           | bandwidth to support multiple reads per clock tick.
           | Instructions can fit inside that 512kB L2 cache just fine.
           | 
           | AMD Zen 3 has 32MB L3 cache across 8-cores.
           | 
           | By all accounts, Zen3 has "more cache per core" than Apple's
           | M1. The question whether AMD's (or Intel's) L1/L2 split is
           | worthwhile.
           | 
           | ---------
           | 
           | The difference in cache, is that Apple has decided on having
           | an L1 cache that's smaller than AMD / Intel's L2 cache, but
           | larger than AMD / Intel's L1 cache. That's it.
           | 
           | Its a question of cache configuration: "flatter" 2-level
           | cache on M1 vs a "bigger" 3-level cache on Skylake / Zen.
           | 
           | -------
           | 
           | That's the thing: its a very complicated question. Bigger
           | caches simply have more latency. There's no way around that
           | problem. That's why x86 processors have multi-tiered caches.
           | 
           | Apple has gone against the grain, and made an absurdly large
           | L1 cache, and skipped the intermediate cache entirely. I'm
           | sure Apple engineers have done their research into it, but
           | there's nothing simple about this decision at all. I'm
           | interested to see how this performs in the future (whether
           | new bottlenecks will come forth).
        
           | klelatti wrote:
           | It's an interesting point. I guess ARM must have done quite a
           | lot of analysis running up to the launch of aarch64 in 2010
           | when, with roughly a blank sheet of paper on the ISA, they
           | could have decided to go for variable length instructions for
           | this reason (especially given their history with Thumb). On
           | the other hand presumably the focus was on power given the
           | immediate market and so the simpler decode would have been
           | beneficial for that reason.
        
         | egsmi wrote:
         | > With over 600 reorder buffer registers in the Apple M1
         | executing deeply out-of-order code
         | 
         | Can you provide a link to how this was determined? I did some
         | searches but couldn't find anything. I'd be very interested too
         | see how it was measured.
        
           | dragontamer wrote:
           | https://www.anandtech.com/show/16226/apple-
           | silicon-m1-a14-de...
           | 
           | The reorder-buffers determine how "deep" you can go out-of-
           | order. Roughly speaking, 600+ means that an instruction 600+
           | instructions ago is still waiting for retirement. You can be
           | "600 instructions out of order", so to speak.
           | 
           | ----------
           | 
           | Each time you hold a load/store out-of-order on a modern CPU,
           | you have to store that information somewhere. Then the
           | "retirement unit" waits for all instructions to be put back
           | into order correctly.
           | 
           | Something like Apple's M1, with 600+ reorder buffer
           | registers, will search for instruction-level parallelism all
           | the way up to 600-instructions into the future, before the
           | retirement unit tells the rest of the core to start stalling.
           | 
           | For a realistic example, imagine a division instruction
           | (which may take 80-clock ticks to execute on AMD Zen). Should
           | the CPU just wait for the divide to finish before continuing
           | execution? Heck no! A modern core will out-of-order execute
           | future instructions while waiting for division to finish. As
           | long as reorder buffer registers are ready, the CPU can
           | continue to search for other work to do.
           | 
           | --------
           | 
           | There's nothing special about Apple's retirement unit, aside
           | from being ~600 big. Skylake and Zen are ~200 to ~300 big
           | IIRC. Apple just decided they wanted a wider core, and
           | therefore made one.
        
             | egsmi wrote:
             | I see how it worked. That measurement uses the 2013
             | technique published by Henry Wong. I think it's probably a
             | reasonable estimate of the instruction window length but to
             | say that's the same as the buffer size is making a number
             | of architectural assumptions that I haven't seen any
             | evidence to justify. I suppose in the end it doesn't really
             | matter as users of the chip though.
             | 
             | http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
        
       | klelatti wrote:
       | I so wanted to like this article - and due credit to the author
       | for trying to explain these points - but it often slides into
       | comments that are potentially misleading.
       | 
       | In particular the use of 'the ARM ISA' (singular) with an
       | allusion to Thumb at one point (aarch32) whilst mostly talking
       | about (aarch64) M1 isn't helpful (and there are other points
       | too).
       | 
       | And I think the RISC vs CISC categorisation was useful in 1990
       | but I think there are other more important aspects to focus on in
       | 2020.
        
       | kevin_thibedeau wrote:
       | x86 has been RISC since the Pentium Pro. There is no point
       | dithering over the fine details especially when x64 removes the
       | register pressure issues for compilers and considering that ARM
       | has a bloated ISA.
        
         | socialdemocrat wrote:
         | RISC isn't about the size of the ISA but about the type of
         | instructions. RISC instructions are fixed width and how low
         | complexity from a decoding and pipelining point of view.
         | 
         | Pentium Pro was not RISC, that is just Intel marketing speak.
         | Micro-ops can be produced in a RISC CPU as well, it is separate
         | from having a RISC ISA. The RISC ISA is about what the compiler
         | sees and can do. The compiler cannot see the Pentium Pro micro-
         | ops. Those are hidden from the compiler. The compiler cannot
         | rearrange them and optimize them they way it can with
         | instructions in the ISA.
        
       | pizlonator wrote:
       | This is really great, but RISC CPUs can have microcode too.
       | Nothing stops them from doing that.
       | 
       | The big diff is load/store:
       | 
       | - Loads and stores are separate instructions in RISC and never
       | implied by other ops. In CISC, you have orthogonality: most
       | places that can take a register can also take a memory address.
       | 
       | - Because of load/store, you need fewer bits in the instruction
       | encoding for encoding operands.
       | 
       | - Because you save bits in operands, you can have more bits to
       | encode the register.
       | 
       | - Because you have more bits to encode the register, you can have
       | more architectural registers, so compilers have an easier time
       | doing register allocation and emit less spill code.
       | 
       | That might be an oversimplification since it totally skips the
       | history lesson. But if we take RISC and CISC as trade offs you
       | can make today, the trade off is as I say above and has little to
       | do with pipelining or microcode. The trade off is just: you gonna
       | have finite bits to encode shit, so if you move the loads and
       | stores into their own instructions, you can have more registers.
        
       | bitwize wrote:
       | RISC typically means "load-store architecture": load operands
       | from memory to regs, perform operations in regs only, store
       | results back to memory.
       | 
       | CISC refers to old-school, programmer-friendly, addressing-mode-
       | laden ISAs. Add D0 to the address pointed to by A0 plus an
       | immediate offset and store the result back to memory, that sort
       | of thing.
        
       | Aardwolf wrote:
       | I guess it means the same as the difference between "VGA" and
       | "SVGA" in 2020. The "super" 800x600 resolution of SVGA isn't
       | really that super now.
        
         | jleahy wrote:
         | Or NTSC and PAL.
         | 
         | I wonder if my 3440x1440 screen is NTSC or PAL?
        
           | HideousKojima wrote:
           | Neither, though theoretically it could have legacy support
           | for both formats
        
       ___________________________________________________________________
       (page generated 2020-11-20 23:00 UTC)