[HN Gopher] Memory access on the Apple M1 processor
       ___________________________________________________________________
        
       Memory access on the Apple M1 processor
        
       Author : luigi23
       Score  : 219 points
       Date   : 2021-01-06 17:06 UTC (5 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | lrossi wrote:
       | Shouldn't you choose the random numbers such that array[idx1] ^
       | array[idx1 + 1] are guaranteed to fall in the same cache line?
       | Assuming that it has that. Right now some accesses cross the end
       | of the cache line.
        
         | CyberRabbi wrote:
         | Technically you are correct but it's expected to cross a cache
         | line 1/16 times (or however many ints there are in a cache
         | line). There is an implicit assumption that that is relatively
         | infrequent enough that it shouldn't increase the average time
         | too much, but that assumption should be tested.
        
       | jayd16 wrote:
       | >our naive random-memory model
       | 
       | Doesn't everyone use the (I believe) still valid concepts of
       | latency and bandwidth?
        
         | sroussey wrote:
         | Depends on context.
         | 
         | For example, what is the bandwidth and latency when you ask for
         | the value at the same memory address in an infinite loop? And
         | how does that compare to the latency and bandwidth of a memory
         | module you buy on NewEgg?
        
           | fluffy87 wrote:
           | L1 BW.
           | 
           | When people use BW in their performance models, they don't
           | use only 1 bandwidth, but whatever combination of bandwidth
           | makes sense for the _memory access pattern_.
           | 
           | So if you are always accessing the same word, the first acces
           | runs at DRAM BW, and subsequent ones at L1 BW, and any
           | meaningful performance model will take that into account.
        
         | whoisburbansky wrote:
         | The concepts are still broadly valid, the naivety being
         | referred to is the assumption that two non adjacent memory
         | reads will be twice as slow as one memory read or two adjacent
         | reads.
        
         | wyldfire wrote:
         | How do latency and bandwidth relate to the cost model for the
         | code in the benchmark?
         | 
         | When creating the model discussed in the post, we're using it
         | to try to make a static prediction about how the code will
         | execute.
         | 
         | Note that the goal of the post is not to merely measure the
         | memory access performance, it's to understand the specific
         | microarchitecture and how it might deliver the benefits that we
         | see in benchmarks.
        
       | foota wrote:
       | Is this per core or shared between cores?
        
         | hundchenkatze wrote:
         | Per core I think, emphasis is mine.
         | 
         | > It looks like a _single_ core has about 28 levels of memory
         | parallelism, and possibly more.
        
           | foota wrote:
           | I was wondering if this might be a shared resource though,
           | since it doesn't seem they tested with multiple threads.
        
       | wrsh07 wrote:
       | Ok, summary:
       | 
       | This article lays out three scenarios: 1) accessing two random
       | elements
       | 
       | 2) accessing 3 random elements
       | 
       | 3) accessing two pairs of adjacent elements (same as (1) but also
       | the elements after each random element)
       | 
       | It then does some trivial math to use the loaded data.
       | 
       | A naive model might only consider memory accesses and might
       | assume accessing an adjacent element is free.
       | 
       | On the Mac m1 core, this is not the case. While the naive model
       | might expect cases 1 & 3 to cost the same and case 2 to cost 50%
       | more, instead cases 2 & 3 are nearly the same (3 slightly faster)
       | and case 2 is about 50% more expensive than 1.
        
         | jayd16 wrote:
         | I don't really understand the comparison because it seems like
         | scenario 3 (2+) is doing more XORs and twice the accesses to
         | array over the same amount of iterations.
         | 
         | We have to assume these are byte arrays, yes? Or at least some
         | size that's smaller than the cache line. You would still pay
         | for the extra unaligned fetches. I don't think this is a valid
         | scenario at all, M1 or not.
         | 
         | Anyone want to run these tests on an Intel machine and let us
         | know if the authors "naive model" test hold there?
        
           | wrsh07 wrote:
           | The point of the naive model is that you assume memory
           | accesses dominate
           | 
           | That is, the math part is so trivial compared to the memory
           | access that you could do a bunch of math and you would still
           | only notice a change in the number of memory accesses.
           | 
           | Also it looks like the response to yours links their test and
           | the naive model predicts correctly
        
             | jayd16 wrote:
             | I think 5% is a non-trivial difference but alright, its a
             | much bigger difference on the M1.
             | 
             | I guess I still don't understand whats going on here.
             | 
             | Scenario 1 has two spatially close reads followed by two
             | dependent random access reads.
             | 
             | Scenario 3 (2+) has two spatially close reads, and two
             | pairs of dependent random access reads of two spatially
             | close locations.
             | 
             | Why does it follow that this is caused by a change in
             | memory access concurrency? The two required round trips
             | should dominate both on the M1 and an Intel but for some
             | reason the M1 performs worse than that. Why?
             | 
             | I can't help but feel the first snippet triggers some SIMD
             | path while the 3rd snippet fails to.
        
           | africanboy wrote:
           | I did it on an old i7 laptop
           | 
           | https://news.ycombinator.com/item?id=25661055
        
         | temac wrote:
         | > A naive model might only consider memory accesses and might
         | assume accessing an adjacent element is free.
         | 
         | Really depends on the level of naivety and the definition of
         | "free". It would be less insane to write that: accessing an
         | adjacent element has a negligible overhead if the data must be
         | loaded from RAM and there are some OOO bubbles to execute the
         | adjacent loads. If some data are in cache the free adjacent
         | load claim immediately is less probable. If the latency of a
         | single load is already filled by OOO, adding another one will
         | obviously have an impact. If the workload is highly regular you
         | _can_ get quite chaotic results when making even some trivial
         | changes (even sometimes when _aligning the .text differently_!)
         | 
         | And the proposed microbenchmark is way too simplistic: it is
         | possible that it saturates some units in some processors and
         | completely different units in others...
         | 
         | Is the impact of an extra adjacent load from RAM _likely_ to be
         | negligible in a real world workloads? Absolutely. With precise
         | characteristics depending on your exact model  / current freq /
         | other memory pressure at this time, etc.
        
       | willvarfar wrote:
       | A lot of commenters here are saying that Apples advantage is that
       | it can profile the real workloads and optimise for that.
       | 
       | Well that's true and could very well be an advantage. An
       | advantage in that they did it, not in that only they have access
       | to it.
       | 
       | Intel and AMD can trivially profile real world workloads too.
       | 
       | Did they? I don't know what Apple did, but the impression I get
       | is that intel certainly hasn't.
        
       | nabla9 wrote:
       | What is the cache line size and page table size in M1?
       | sysconf(_SC_PAGESIZE); /* posix */
       | 
       | Can you get direct processor information like LEVEL1_ICACHE_ASSOC
       | and LEVEL1_ICACHE_LINESIZE from the M1??
        
         | momothereal wrote:
         | `getconf PAGESIZE` returns 16384 on the base M1 MacBook Air.
         | 
         | The L1 cache values aren't there. The macOS `getconf` doesn't
         | support -a (listing all variables), so they may just be under a
         | different name.
         | 
         | edit: see replies for `sysctl -a` output
        
           | lilyball wrote:
           | Is it possibly exposed via sysctl, which does support a flag
           | to list all variables?
        
             | messe wrote:
             | From sysctl -a on my M1:
             | hw.cachelinesize: 128         hw.l1icachesize: 131072
             | hw.l1dcachesize: 65536         hw.l2cachesize: 4194304
             | 
             | EDIT: also, when run under Rosetta hw.cachelinesize is
             | halved:                   hw.cachelinesize: 64
             | hw.l1icachesize: 131072         hw.l1dcachesize: 65536
             | hw.l2cachesize: 4194304
        
               | Nokinside wrote:
               | M1 cache lines are double of what is commonly used by
               | Intel, AMD and other ARM microarchtectures use. That's
               | significant difference.
        
               | [deleted]
        
               | JonathonW wrote:
               | Compared to the i9-9880H in my 16" MacBook Pro:
               | hw.cachelinesize: 64         hw.l1icachesize: 32768
               | hw.l1dcachesize: 32768         hw.l2cachesize: 262144
               | hw.l3cachesize: 16777216
               | 
               | The M1 doubles the line size, doubles the L1 data cache
               | (i.e. same number of lines), quadruples the L1
               | instruction cache (i.e. double the lines), and has a 16x
               | larger L2 cache, but no L3 cache.
        
       | waterside81 wrote:
       | For people who know more about this stuff than me: are these
       | sorts optimizations only possible because Apple controls the
       | whole stack and can make the hardware & OS/software perfectly
       | match up with one another or is this something that Intel can do
       | but doesn't for some reasons (tradeoffs)?
        
         | viktorcode wrote:
         | There's at least two M1 optimisations targeting Apple's
         | software stack:
         | 
         | 1. Fast uncontended atomics. Speeds up reference counting which
         | is used heavily by Objective-C code base (and Swift). Increase
         | is massive comparing to Intel.
         | 
         | 2. Guaranteed instruction ordering mode. Allows for faster Arm
         | code to be produced by Rosetta when emulating x86. Without it
         | emulation overhead would be much bigger (similar to what
         | Microsoft is experiencing).
        
         | [deleted]
        
         | [deleted]
        
         | AnthonyMouse wrote:
         | > are these sorts optimizations only possible because Apple
         | controls the whole stack and can make the hardware &
         | OS/software perfectly match up with one another or is this
         | something that Intel can do but doesn't for some reasons
         | (tradeoffs)?
         | 
         | Interestingly it's the other way around. Apple is using TSMC's
         | 5nm process (they don't have their own fabs), which is better
         | than Intel's in-house fabs, so it's _Intel 's_ vertical
         | integration which is _hurting_ them compared to the non-
         | vertically integrated Apple.
         | 
         | Also, the answer to "is this only possible because of vertical
         | integration" is always _no_. Intel and Microsoft regularly
         | coordinate to make hardware and software work together. Intel
         | is one of the largest contributors to the Linux kernel, even
         | though they don 't "own" it. Two companies coordinating with
         | one another can do anything they could do as an individual
         | company.
         | 
         | Sometimes the efficiency of this is lower because there are
         | communication barriers and isn't a single chain of command. But
         | sometimes it's higher because you don't have internal politics
         | screwing everything up when the designers would be happy with
         | outsourcing to TSMC because they have a competitive advantage,
         | but the common CEO knows that would enrich a competitor and
         | trash their internal investment in their own fabs, and forces
         | the decision that leads to less competitive products.
        
           | cma wrote:
           | Not quite vertical integration, but TSMC's 5nm fabs are
           | Apple's fabs. (exclusively for a period of time)
           | 
           | During the iPod era, Toshiba's 1.8in HD production was
           | exclusively Apple's only for music players, but Apple gets
           | all the 5nm output from TSMC for a period of time.
        
           | hinkley wrote:
           | Integration is a petri dish. It can speed up both growth and
           | decay, and it is indifferent to which one wins.
        
         | wmf wrote:
         | No, there's no cross-stack optimization here. The M1 gives very
         | high performance for all code.
        
           | qeternity wrote:
           | I think this gets lost in the fray between the "omg this is
           | magic" and then the Apple haters. The M1 is a very good chip.
           | Apple has hired an amazing team and resourced them well. But
           | from a pure hardware perspective, the M1 is quite
           | evolutionary. However the whole Apple Silicon experience is
           | revolutionary and magical due to the tight software pairing.
           | 
           | Both teams deserve huge praise for the tight coordination and
           | unreal execution.
        
             | acdha wrote:
             | I think this is part of the reason where there are so many
             | people trying to find reasons to downplay it: humans love
             | the idea of "one weird trick" which makes a huge difference
             | and we sometimes find those in tech but rarely for mature
             | fields like CPU design. For many people, this is
             | unsatisfying like asking an athlete their secret, and
             | getting a response like "eat well, train a lot, don't give
             | up" with nary a shortcut in sight.
        
       | djacobs7 wrote:
       | Is the article saying that the M1 is slower than we would have
       | expected in this case?
       | 
       | My understanding, based on the article, is that a normal
       | processor, we would have expected arr[idx] + arr[idx+1] and
       | arr[idx] to take the same amount of time.
       | 
       | But the M1 is so parallelized that it goes to grab both arr[idx]
       | and arr[idx+1] separately. So we have to wait for both of those
       | two return. Meanwhile, on a less parallelized processor, we would
       | have done arr[idx] first and waited for it to return, and the
       | processor would realize that it already had arr[idx+1] without
       | having to do the second fetch.
       | 
       | Am I understanding this right?
        
         | phkahler wrote:
         | >> My understanding, based on the article, is that a normal
         | processor, we would have expected arr[idx] + arr[idx+1] and
         | arr[idx] to take the same amount of time.
         | 
         | That depends. If the two accesses are on the same cache line,
         | then yes. But since idx is random that will not happen
         | sometimes. He never says how big array[] is in elements or what
         | size each element is.
         | 
         | I thought DRAM also had the ability to stream out consecutive
         | addresses. If so then it looks like Apple could be missing out
         | here.
         | 
         | Then again, if his array fits in cache he's just measuring
         | instruction counts. His random indexes need to cover that whole
         | range too. There's not enough info to figure out what's going
         | on.
        
           | SekstiNi wrote:
           | > There's not enough info to figure out what's going on.
           | 
           | If you only look at the article this is true. However, the
           | source code is freely available:
           | https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
           | blog/...
        
             | mrob wrote:
             | I tried it on my old (2009) 2.5GHz Phenom II X4 905e (GCC
             | 10.2.1 -O3, 64 bit) and got results almost perfectly
             | matching the conventional wisdom:                 two  :
             | 97.4 ns       two+  : 97.9 ns       three: 145.8 ns
        
             | egnehots wrote:
             | TLDR: he is using a random index with a big enough array
        
             | [deleted]
        
             | africanboy wrote:
             | I ran the benchmark on my system
             | 
             | It's a 6 years old system, fastest times are in the 25ns
             | range
             | 
             | - 2-wise+ is 5% slower than 2-wise
             | 
             | - 3-wise is 46% slower than 2-wise
             | 
             | - 3-wise is 39% slower than 2-wise+
             | 
             | on the M1
             | 
             | - 2-wise+ is 40% slower than 2-wise
             | 
             | - 3-wise is 46% slower than 2-wise
             | 
             | - 3-wise is 4% slower than 2-wise+
        
               | SekstiNi wrote:
               | Interesting, I ran it on my laptop (i7-7700HQ) with the
               | following results:
               | 
               | - 2-wise+ is 19% slower than 2-wise
               | 
               | - 3-wise is 48% slower than 2-wise
               | 
               | - 3-wise is 25% slower than 2-wise+
               | 
               | However, as mentioned in the post the numbers can vary a
               | lot, and I noticed a maximum run-to-run difference of
               | 23ms on two-wise.
        
             | phkahler wrote:
             | He's only got 3 million random[] numbers. Weather that's
             | enough depends on the cache size. It also bothers me to
             | read code like this where functions take parameters (like
             | N) and never use them.
        
           | eloff wrote:
           | He mentioned it's a 1GB array, and the source code is
           | available.
        
         | jayd16 wrote:
         | Its a little confusing because they're conflating the idea that
         | you almost certainly read at least the entire word (and not a
         | single byte) at a time with the other idea that you could fetch
         | multiple words concurrently.
        
           | duskwuff wrote:
           | Any cached memory access is going to read in the entire cache
           | line -- 64 bytes on x86, apparently 128 on M1. This is true
           | across most architectures which use caches; it isn't specific
           | to M1 or ARM.
        
             | kzrdude wrote:
             | (As I learned from recent Rust concurrency changes) on
             | newer Intel, it usually fetches two cache lines so
             | effectively 128 bytes while AMD usually 64 bytes. That's
             | the sizes they use for "cache line padded" values (I.e
             | making sure to separate two atomics by the fetch size to
             | avoid threads invalidating the cache back and forth too
             | much).
        
             | jayd16 wrote:
             | Yes almost certainly more than the word will be read but it
             | varies by architecture. I would think almost by definition
             | no less than a word can be read so I went with that in my
             | explanation.
        
       | syntaxing wrote:
       | I'm super curious if it's true that my 8GB M1 will die quickly
       | because of the aggressive swaps. I guess time will tell.
        
         | acdha wrote:
         | FWIW, I have a 2010 MBA which was _heavily_ used for years as a
         | primary development system. The SSD only started to show signs
         | of degraded performance last year and that wasn't massive. I
         | would be quite surprised if the technology has become worse.
        
       | [deleted]
        
       | jeffbee wrote:
       | Great practical information. Nice to see people who know what
       | they are talking about putting data out there. I hope eventually
       | these persistent HN memes about M1 memory will die: that it's
       | "on-die" (it's not), that it's the only CPU using LPDDR4X-4267
       | (it's not), or that it's faster because the memory is 2mm closer
       | to the CPU (not that either).
       | 
       | It's faster because it has more microarchitectural resources. It
       | can load and store more, and it can do with a single core what an
       | Intel part needs all cores to accomplish.
        
         | titzer wrote:
         | > it can do with a single core what an Intel part needs all
         | cores to accomplish.
         | 
         | Care to explain what you mean specifically by this?
        
           | saagarjha wrote:
           | The M1 has extremely high single-core performance.
        
             | temac wrote:
             | It is not 4 times faster than an Intel core, though...
        
               | saagarjha wrote:
               | It is in memory performance, which is what I assumed was
               | being measured here.
        
               | kllrnohj wrote:
               | How are you defining memory performance and where are
               | your supporting comparisons? This article only discusses
               | the M1's behavior, and makes no comparisons to any other
               | CPU.
        
               | FabHK wrote:
               | FWIW, I ran it on a MacBook Pro (13-inch, 2019, Four
               | Thunderbolt 3 ports), 2.4 GHz Quad-Core Intel Core i5, 8
               | GB 2133 MHz LPDDR3:                 two  : 49.6 ns  (x
               | 5.5)       two+ : 64.8 ns  (x 5.2)       three: 72.8 ns
               | (x 5.6)
               | 
               | EDIT to add: above was just `cc`. Below is with `cc -O3
               | -Wall`, as in Lemire's article:                 two  :
               | 62.8 ns  (x 7.1)       two+ : 69.2 ns  (x 5.5)
               | three: 95.3 ns  (x 7.3)
        
               | namibj wrote:
               | You _need_ to use -mnative because it otherwise retains
               | backwards compatibility to older x86.
        
               | [deleted]
        
               | africanboy wrote:
               | there must be something wrong there, on my late 2014
               | laptop that mounts                   Type: DDR4
               | Speed: 2133 MT/s
               | 
               | I get                   two  : 27.1 ns (3x)         two+
               | : 28.6 ns (2.2x)         three: 39.7 ns (3x)
               | 
               | which is not much, considering this is an almost 6 years
               | old system with 2x slower memor
        
             | titzer wrote:
             | Sure, and it has a very large out-of-order execution
             | engine, but it is not fundamentally different from what
             | other super scalar processors do. So I am curious what the
             | OP meant by that offhand comment.
        
               | jeffbee wrote:
               | One core of the M1 can drive the memory subsystem to the
               | rails. A single core can copy (load+store) at 60GB/s.
               | This is close to the theoretical design limit for DDR4X.
               | A single core on Tiger Lake can only hit about 34GB/s,
               | and Skylake-SP only gets about 15GB/s. So yes, it is
               | close to 4x faster.
        
               | titzer wrote:
               | Thanks for clarifying. But this isn't any fundamental
               | difference IMO. There isn't any functional limitation in
               | an Intel core that means it cannot saturate the memory
               | bandwidth from a single core, unless I am missing
               | something.
        
               | jeffbee wrote:
               | I agree, it's not fundamental. It is, in particular, not
               | that other popular myth, that it's "because ARM". It's
               | only that 1 core on an Intel chip can have N-many
               | outstanding loads and 1 core of an M1 can have M>N
               | outstanding loads.
        
         | titzer wrote:
         | Frankly, I find Lemire does oversimplified, poor-quality
         | control, back-of-the-envelope microbenchmarking all the time
         | that provides little to no insight other than establishing a
         | general trend. It's sophomoric and a poor demonstration about
         | how to well-controlled benchmarking that might yield useful,
         | repeatable, and transferrable results.
        
         | foldr wrote:
         | >or that it's faster because the memory is 2mm closer to the
         | CPU (not that either)
         | 
         | Not to disagree with your overall point, but 2mm is a long way
         | when dealing with high frequency signals. You can't just
         | eyeball this and infer that it makes no difference to
         | performance or power consumption.
        
           | jeffbee wrote:
           | If it works, it works. There will be no observable
           | performance difference for DDR4 SDRAM implementations with
           | the same timing parameters, regardless of the trace length.
           | There are systems out there with 15cm of traces between the
           | memory controller pins and the DRAM chips. The only thing you
           | can say against them is they might consume more power driving
           | that trace. But you wouldn't say they are meaningfully
           | slower.
        
             | foldr wrote:
             | You can't just eyeball the PCB layout for a GHz frequency
             | circuit and say "yeah that would definitely work just the
             | same if you moved that component 2mm in this direction".
             | It's certainly possible to use longer trace lengths, but
             | that may come with tradeoffs.
             | 
             | >The only thing you can say against them is they might
             | consume more power driving that trace
             | 
             | Power consumption is really important in a laptop, and
             | Apple clearly care deeply about minimising it.
             | 
             | For all we know for sure, moving the memory closer to the
             | CPU may have been part of what's enabled Apple to run
             | higher frequency memory with acceptable (to them) power
             | draw.
        
         | sliken wrote:
         | The most impressive thing I've seen is that when accessed in a
         | TLB friendly fashion that the latency is around 30ns.
         | 
         | Anandtech has a graph showing this, specifically the R per RV
         | prange graph. I've verified this personally with a small
         | microbenchmark I wrote. I've not seen anything else close to
         | this memory latency.
        
           | reasonabl_human wrote:
           | Mind sharing the micro benchmark you wrote? I'm curious to
           | know how that would work
        
           | tandr wrote:
           | Sorry, what would AMD's or Intel's "latest and greatest"
           | numbers for the same be?
        
             | sliken wrote:
             | Here's the M1: https://www.anandtech.com/show/16252/mac-
             | mini-apple-m1-teste...
             | 
             | Scroll down to the latency vs size map and look at the R
             | per RV prange. That gets you 30ns or so.
             | 
             | Similar for AMD's latest/greatest the Ryzen 9 5950X:
             | https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-
             | di...
             | 
             | The same R per RV prange is in the 60ns range.
        
         | ed25519FUUU wrote:
         | In other words, it's better architecture. If anything this
         | makes it seem more impressive to me.
        
           | amelius wrote:
           | No, it's the same architecture but with different parameters.
           | 
           | It's like the difference between the situation where every
           | car uses 4 cylinders, and then Apple comes along and makes a
           | car with 5 cylinders.
        
             | kllrnohj wrote:
             | Your analogy was so close! It's Apple comes along and makes
             | an 8 cylinder engine. Since, you know, the other CPUs are
             | 4-wide decode and Apple's M1 is 8-wide decode :)
        
         | PragmaticPulp wrote:
         | I don't understand this competition to attribute the M1's speed
         | to _one_ specific change, while downplaying all of the others.
         | 
         | M1 is fast because they optimized everything across the board.
         | The speed is the cumulative result of many optimizations, from
         | the on-die memory to the memory clock speed to the
         | architecture.
        
           | adam_arthur wrote:
           | It's fast because they optimized everything across the board,
           | and also paid for exclusive access to TSMC 5nm process.
        
         | pdpi wrote:
         | This seems to be a recurring theme with the M1, and one that,
         | in a sense, actually baffles me even more than the alternative.
         | There is no "magic" at play here, it's just lots and lots of
         | raw muscle. They just seem to have a freakishly successful
         | strategy for choosing what aspects of the processor to throw
         | that muscle at.
         | 
         | Why is that strategy simultaneously remarkably efficient and
         | remarkably high-performance? What enabled/led them to make
         | those choices where others haven't?
        
           | mhh__ wrote:
           | I think it's worth saying that because AMD have only just
           | really hit their stride, Intel were under almost zero
           | pressure to improve which has really hurt them especially
           | with the process.
           | 
           | X86 is definitely a coefficient overhead, but if Intel put
           | their designs on 5nm they'd look pretty good too - Jim Keller
           | (when he was still there) hinted their offerings for a year
           | or so in the future are significantly bigger to the point of
           | him judging it to be worth mentioning so I wouldn't write
           | them off.
        
           | adam_arthur wrote:
           | Certainly Apple's processors are far ahead, but they're a
           | full process generation (5nm) ahead of their competitors.
           | They paid their way to that exclusive right through TSMC.
           | 
           | I'm sure they'll still come out ahead in benchmarks, but the
           | numbers will be much closer once AMD moves to 5nm. You
           | absolutely cannot fairly compare chips from different fab
           | generations.
           | 
           | I don't see many comments hammering this point home enough...
           | it's not like the performance gap is through engineering
           | efforts that are leagues ahead. Certainly some can be
           | attributed to that, and Apple has the resources to poach any
           | talent necessary.
        
             | GeekyBear wrote:
             | A node shrink gives you a choice of cutting power,
             | improving performance, or some mix of the two.
             | 
             | Apple appears to have taken the power reduction when they
             | moved to TSMC 5nm.
             | 
             | >The one explanation and theory I have is that Apple might
             | have finally pulled back on their excessive peak power draw
             | at the maximum performance states of the CPUs and GPUs, and
             | thus peak performance wouldn't have seen such a large jump
             | this generation, but favour more sustainable thermal
             | figures.
             | 
             | Apple's A12 and A13 chips were large performance upgrades
             | both on the side of the CPU and GPU, however one criticism
             | I had made of the company's designs is that they both
             | increased the power draw beyond what was usually
             | sustainable in a mobile thermal envelope. This meant that
             | while the designs had amazing peak performance figures, the
             | chips were unable to sustain them for prolonged periods
             | beyond 2-3 minutes. Keeping that in mind, the devices
             | throttled to performance levels that were still ahead of
             | the competition, leaving Apple in a leadership position in
             | terms of efficiency.
             | 
             | https://www.anandtech.com/show/16088/apple-
             | announces-5nm-a14...
        
             | wmf wrote:
             | From a customer's perspective it's not my problem. Everyone
             | had the opportunity to bid on that fab capacity and they
             | decided not to.
        
               | adam_arthur wrote:
               | Yeah, totally agreed. But if you read these comments,
               | they seem to be in total amazement about the performance
               | gap and not acknowledging how much of an advantage being
               | a fab generation ahead is.
               | 
               | Customers don't care, but discussion of the merits of the
               | chip should be more nuanced about this.
               | 
               | It also implies that the gap won't exist for very long,
               | as AMD will move onto 5nm soon
        
               | tandr wrote:
               | > It also implies that the gap won't exist for very long,
               | as AMD will move onto 5nm soon
               | 
               | ... yes, if there is any capacity left. Capacity for the
               | new process is a limited resource after all.
        
           | coldtea wrote:
           | > _Why is that strategy simultaneously remarkably efficient
           | and remarkably high-performance? What enabled /led them to
           | make those choices where others haven't?_
           | 
           | The things people give them complains about:
           | 
           | (a) keeping a walled garden,
           | 
           | (b) moving fast and taking the platform to new directions all
           | at once
           | 
           | (c) controlling the whole stack
           | 
           | Which means they're not beholden to compatibility with third
           | party frameworks and big players, or with their own past, and
           | thus can rely on their APIs, third party devs etc, to cater
           | to their changes to the architecture.
           | 
           | And they're not chained to the whims of the CPU vendor (as
           | the OS vendor) or the OS vendor (as the CPU vendor) either,
           | as they serve the role of both.
           | 
           | And of course they benchmarked and profiled the hell out of
           | actual systems.
        
             | jeffbee wrote:
             | Neither A nor C makes any sense, are not supported by
             | evidence. There is no aspect of the mac or macOS that can
             | be realistically described as a "walled garden". It comes
             | with a compiler toolchain and ... well, some docs. It
             | natively runs software compiled for a foreign architecture.
             | You can do whatever you want with it. It's pretty open.
             | 
             | A "walled garden" is when there is a single source of
             | software.
        
             | anfilt wrote:
             | I will be honest as long apple keeps this walled garden
             | shenanigans going. I am not buying any of their hardware.
        
             | fartcannon wrote:
             | They could still do all this shit without the walled
             | garden. To me, it suggests they aren't willing to compete.
             | They're anti-competitive.
        
               | marrvelous wrote:
               | With the walled garden, Apple can set enforceable
               | timelines for the software ecosystem to adopt to
               | architectural changes.
               | 
               | Remember the transition to arm64? Apple forced everything
               | on the App Store to ship universal binaries.
               | 
               | Without the App Store walled garden, software isn't
               | required to keep up to date with architectural changes.
               | Instead, keeping current is only a requirement to being
               | featured on the App Store (which would just be a single
               | way to install software, not the only method).
        
               | danaris wrote:
               | Well, and on the Mac, it's _not_ the only method. The
               | walled garden here has big open gates.
               | 
               | That said, _all_ software on the Mac, post-Catalina, has
               | to be 64-bit, whether it 's distributed through the Mac
               | App Store or not, because the 32-bit system libraries are
               | no longer included at all.
        
           | baybal2 wrote:
           | > There is no "magic" at play here, it's just lots and lots
           | of raw muscle. They just seem to have a freakishly successful
           | strategy for choosing what aspects of the processor to throw
           | that muscle at.
           | 
           | There is no freakishly successful strategy at play there as
           | well. It's just all previous attempts at "fast ARM" chip were
           | rather half hearted "add a pipeline step there, add extra
           | register there, increase datapath width there," and not to
           | squeeze it to the limit.
        
           | barkingcat wrote:
           | The answer is that they have raw hard numbers from the
           | hundres of millions of iPads/iPhones sold each year, and can
           | use the metrics from those devices to optimize the next
           | generation of devices.
           | 
           | These improvements didn't come from nowhere. It came from
           | iterations of iOS hardware.
        
           | TYPE_FASTER wrote:
           | Apple has been iterating on their proprietary mobile ARM-
           | based processors since 2010, and has gotten really good at
           | it. I would imagine that producing billions of consumer
           | devices with these chips has helped give them a lot of
           | experience in shortened time frame.
           | 
           | I also wonder if having the hardware and software both worked
           | on in-house is an advantage. I mean, if you're developing
           | power management software for a mobile OS, and you're using a
           | 3rd-party vendor, then you read the documentation, and work
           | with the vendor if you have questions. If it's all internal,
           | you call them, and could make suggestions on future processor
           | design too based on OS usage statistics and metrics.
        
           | jandrese wrote:
           | It seems like Apple listened when people talked about how all
           | modern processors bottleneck on memory access and decided to
           | focus heavily on getting those numbers better.
           | 
           | Of course this leads to the question that if everyone in the
           | industry knew this was the issue why weren't Intel and AMD
           | pushing harder on it? They already both moved the memory
           | controller onboard so they had the opportunity to
           | aggressively optimize it like Apple has done, but instead we
           | have year after year where the memory lags behind the
           | processor in speed improvements, to the point where it is
           | ridiculous how many clock cycles a main memory access takes
           | on a modern x86 chip.
        
           | acdha wrote:
           | > What enabled/led them to make those choices where others
           | haven't?
           | 
           | Others have to some extent -- AMD is certainly not out of the
           | game -- so I'd treat this more as the question of how they've
           | been able to go more aggressively down that path. One of the
           | really obvious answers is that they control the whole stack
           | -- not just the hardware and OS but also the compilers and
           | high-level frameworks used in many demanding contexts.
           | 
           | If you're Intel or Qualcomm, you have a wider range of things
           | to support _and_ less revenue per device to support it, and
           | you are likely to have to coordinate improvements with other
           | companies who may have different priorities. Apple can
           | profile things which their users do and direct attention to
           | the right team. A company like Intel might profile something
           | and see that they can make some changes to the CPU but the
           | biggest gains would require work by a system vendor, a
           | compiler improvement, Windows/Linux kernel change, etc. --
           | they contribute a large amount of code to many open source
           | projects but even that takes time to ship and be used.
        
           | SurfingInVR wrote:
           | Something I've seen no one else mentioning: Apple's low-spec
           | tier is $1000, not $70.
        
           | [deleted]
        
           | dv_dt wrote:
           | No fighting the sales department on where to put the market
           | segmentation bottlenecks?
        
           | gameswithgo wrote:
           | I see two main things behind it:
           | 
           | 1. they are the only ones who have 5nm chips because they
           | paid a lot to TSMC for that right 2. they gave up on
           | expandable memory, which lets them solder it right next to
           | the cpu, which likely makes it easier to ship with really
           | high clocks. and/or they just spent the money it takes to get
           | binned lpddr4 at that speed.
           | 
           | So a good cpu design, just like AMD and Intel have, but one
           | generation ahead on node size, and fast ram. Its not special
           | low latency ram or anything, just clocked higher than maybe
           | any other production machine, though enthusiasts sometimes
           | clock theirs higher on desktops!
        
             | epistasis wrote:
             | > So a good cpu design, just like AMD and Intel have
             | 
             | The design seems to be very different, in that it's far far
             | wider, and supposedly has a much better branch predictor.
             | 
             | > fast ram
             | 
             | Is that a property of the RAM clock, or a function of a
             | better memory controller? The RAM certainly doesn't appear
             | to have any better latency.
        
               | gameswithgo wrote:
               | Right, latency isn't (much) affected by a higher clock
               | rate. Getting ram to run fast requires both good ram
               | chips and good controller/motherboard.
               | 
               | and yes, obviously apples bespoke ARM cpu is quite a bit
               | different than Zen3 Ryzens x86 cpu, but I'm not sure it
               | is net-better. When Zen4 hits at 5nm I expect it will
               | perform on par or better than the M1, but we won't know
               | till it happens!
        
             | mtgx wrote:
             | In other words: money. Throwing money at the (right)
             | problems made them better than others.
             | 
             | "But doesn't Intel have a lot of money, too?"
             | 
             | Sure, but Intel has also been running around like a
             | headless chicken this past decade (pretty much literally,
             | since Otellini left) combined with them getting very
             | complacent because they had "no real competition."
        
           | gavin_gee wrote:
           | didnt they also make some interesting hires a few years ago
           | like Anand from Anandtech and some other silicon vets that
           | likely helped them design the M1 approach?
        
           | jeffbee wrote:
           | I don't have any inside-Apple perspective, but my guess is
           | having a tight feedback cycle between the profiles of their
           | own software and the abilities of their own hardware has
           | helped them greatly.
           | 
           | The reason I think so is when I was at Google is was 7 years
           | between when we told Intel what could be helpful, and when
           | they shipped hardware with the feature. Also, when AMD first
           | shipped the EPYC "Naples" it was crippled by some key uarch
           | weaknesses that anyone could have pointed out if they had
           | been able to simulate realistic large programs, instead of
           | tiny and irrelevant SPEC benchmarks. If Apple is able to
           | simulate or measure their own key workloads and get the
           | improvements in silicon in a year or two they have a gigantic
           | advantage over anyone else.
        
             | martamorena2 wrote:
             | That's bizarre. As if CPU vendors were unable to run
             | "realistic" workloads. If they truly aren't, that's because
             | they are unwilling and then they are designing for failure
             | and Apple can just eat their lunch.
        
             | dcolkitt wrote:
             | Interesting point. This would suggest pretty sizable
             | synergies from the oft-rumored Microsoft acquisition of
             | Intel.
        
               | f6v wrote:
               | > Microsoft acquisition of Intel
               | 
               | Could that possibly be approved by governments?
        
               | gigatexal wrote:
               | Nope. Not a lawyer but I doubt it at all.
        
               | ChuckNorris89 wrote:
               | Microsoft doesn't need to acquire Intel, they need to do
               | what Apple did and acquire a stellar ARM design house
               | that will build a chip with x86 translation, tailored to
               | accelerate the typical workloads on Windows machines and
               | sell those chips to the likes of Dell and Lenovo and tell
               | developers _" ARM Windows is the future, x86 Windows will
               | be sunset in 5 years and no longer supported by us, start
               | porting your apps ASAP and in the mean time, try our X86
               | emulator on our ARM silicon, it works great."_
               | 
               | Microsoft proved with the XBOX and Surface series they
               | can make good hardware if they want, now they need to
               | move to chip design.
        
               | ralfd wrote:
               | Apple has at most 10% of the computer market and is just
               | one player among many. I am skeptical Microsoft with
               | their 90% dominance would or should be allowed this much
               | power over the industry.
        
         | megablast wrote:
         | You'll get people guessing, since Apple itself puts out so
         | little information.
        
         | gameswithgo wrote:
         | What other laptop ships with LPDDR4X clocked at 4267? I agree
         | though that being closer to the cpu isn't having any
         | appreciable effect on latency, but being soldered close to the
         | cpu probably does make it easier for them to hit that high
         | clock rate.
        
           | wmf wrote:
           | Tiger Lake laptops such as the XPS 13.
        
           | jeffbee wrote:
           | As WMF mentions, Tiger Lake laptops like my Razer Book have
           | the same memory. It is not appreciably closer to the CPU in
           | the Apple design. In Intel's Tiger Lake reference designs the
           | memory is also in two chips that are mounted right next to
           | the CPU.
        
             | danaris wrote:
             | And (genuine question) how do the Tiger Lake laptops
             | compare with the M1 MacBooks thus far?
        
               | skavi wrote:
               | AnandTech has decent benchmarks for both Tiger Lake [0]
               | and M1 [1].
               | 
               | [0]: https://www.anandtech.com/show/16084/intel-tiger-
               | lake-review...
               | 
               | [1]: https://www.anandtech.com/show/16252/mac-mini-
               | apple-m1-teste...
        
               | jeffbee wrote:
               | The outcome seems to depend greatly on the physical
               | design of the laptops. The elsewhere-mentioned Dell XPS
               | 13 has a particularly poor cooling design, which is why I
               | chose the Razer Book instead. Despite being marketed in a
               | very silly way to gamers only, it seems to have competent
               | mechanical design.
        
               | SAI_Peregrinus wrote:
               | Gamers are likely to run their systems with demanding
               | workloads, for hours, with a color-coded performance
               | counter (FPS stat). They'll notice if it throttles.
               | They're particularly demanding customers, and there's
               | quite a bit of competition for their money.
        
         | ksec wrote:
         | >HN memes about M1 memory will die
         | 
         | It is not only HN. It is practically the whole Internet. Go
         | around the Top 20 hardware and Apple website forum and you see
         | the same thing, also vastly amplify by a few KOL on twitter.
         | 
         | I dont remember I have ever seen anything quite like it in tech
         | circle. People were happily running around spreading
         | misinformation.
        
           | Bootvis wrote:
           | What is a KOL?
        
             | tyingq wrote:
             | "Key Opinion Leader". I think it's the new word for
             | "Influencer".
        
               | ksec wrote:
               | I am pretty sure KOL predates Influencer in modern
               | internet usage. Before that they were simply known as
               | Internet Celebrities. May be it is rarely used now. So
               | apology for not explaining the acronyms.
        
               | secondcoming wrote:
               | First I've heard of it!
        
               | walterbell wrote:
               | Who introduced the term KOL and bestows the K title?
        
           | jeffbee wrote:
           | Yeah, I know. There was some kid on Twitter who was trying to
           | tell me that it was the solder in an x86 machine (he actually
           | said "a Microsoft computer") that made them slower. Apple,
           | without the solder was much faster.
           | 
           | According to this person's bio they had an undergraduate
           | education in computer science -\\_(tsu)_/-
        
       | s800 wrote:
       | What's the precision of these ns level measurements?
        
         | mhh__ wrote:
         | The answer to that is usually very context dependant, and on
         | what you're measuring. As long as you use a histogram first and
         | don't blindly calculate (say) the mean it should he obvious.
         | 
         | Two examples( that are slightly bigger than this but the same
         | principles apply):
         | 
         | If you benchmark a std::vector at insertion, you'll see a flat
         | graph with n tall spikes at ratios of it's reallocation amount
         | apart, and it scales very very well. The measurements are
         | clean.
         | 
         | If, however, you do the same for a linked list you get a
         | linearly increasing graph _but_ it 's absolutely all over the
         | place because it doesn't play nice with the memory hierarchy.
         | The std_dev of a given value of n might be a hundred times
         | worse than the vector.
        
         | CyberRabbi wrote:
         | Clock_gettime(CLOCK_REALTIME) on macos provides nanosecond-
         | level precision.
        
           | geocar wrote:
           | I seem to recall OSX didn't used to have clock_gettime, so
           | it's news to me that it even exists -- I might have been away
           | from OSX too long.
           | 
           | Is there any performance difference between that and
           | mach_absolute_time() ?
        
             | lilyball wrote:
             | It was added some years ago, and I believe
             | mach_absolute_time is actually now implemented in terms of
             | (the implementation of) clock_gettime. The documentation on
             | mach_absolute_time now even says you should use
             | clock_gettime_nsec_np(CLOCK_UPTIME_RAW) instead.
             | 
             | macOS also has clock constants for a monotonic clock that
             | increases while sleeping (unlike CLOCK_UPTIME_RAW and
             | mach_absolute_time).
        
               | saagarjha wrote:
               | Not yet, at least :)                 _mach_absolute_time:
               | 00000000000012ec        pushq   %rbp
               | 00000000000012ed        movq    %rsp, %rbp
               | 00000000000012f0        movabsq $0x7fffffe00050, %rsi
               | ## imm = 0x7FFFFFE00050       00000000000012fa
               | movl    0x18(%rsi), %r8d       00000000000012fe
               | testl   %r8d, %r8d       0000000000001301        je
               | 0x12fa       0000000000001303        lfence
               | 0000000000001306        rdtsc       0000000000001308
               | lfence       000000000000130b        shlq    $0x20, %rdx
               | 000000000000130f        orq     %rdx, %rax
               | 0000000000001312        movl    0xc(%rsi), %ecx
               | 0000000000001315        andl    $0x1f, %ecx
               | 0000000000001318        subq    (%rsi), %rax
               | 000000000000131b        shlq    %cl, %rax
               | 000000000000131e        movl    0x8(%rsi), %ecx
               | 0000000000001321        mulq    %rcx
               | 0000000000001324        shrdq   $0x20, %rdx, %rax
               | 0000000000001329        addq    0x10(%rsi), %rax
               | 000000000000132d        cmpl    0x18(%rsi), %r8d
               | 0000000000001331        jne     0x12fa
               | 0000000000001333        popq    %rbp
               | 0000000000001334        retq
        
               | Skunkleton wrote:
               | That may be the result of inlining clock_gettime, though
               | that would imply a pretty different implementation from
               | the one I am familiar with.
               | 
               | AFAIR on x86 a locked rdtsc is ~20 cycles. So to answer
               | the gp question, it has around a precision in the few
               | nanoseconds range. Accuracy is a different question, IE
               | compare numbers from the same die, but be a little more
               | suspicious across dies.
               | 
               | No clue how this is implemented on the M1, or if the M1
               | has the same modern tsc guarantees that x86 has grown
               | over the last few generations of chips.
        
               | saagarjha wrote:
               | Yeah, clock_gettime is somewhat more complicated than
               | this. If anything, it might have an inlined
               | mach_absolute_time in it...
        
               | vlovich123 wrote:
               | I was part of the team that really pushed the kernel team
               | to add support for a monotonic clock that counts while
               | sleeping (this had been a persistent ask before just not
               | prioritized). We got it in for iOS 8 or 9. The dance you
               | otherwise have to do is not only complicated in userspace
               | on MacOS, it's expensive & full of footguns due to race
               | conditions (& requires changing the clock basis for your
               | entire app if I recall correctly).
        
             | saagarjha wrote:
             | It's new in macOS Sierra. I believe mach_absolute_time is
             | slightly faster but not by much-both just read the commpage
             | these days to save on a syscall.
        
       | kergonath wrote:
       | It's a good introduction, but it's a bit disappointing that it
       | ends that way. I'd love to read more about what's behind the
       | figure and more technical info about how it might work.
        
       ___________________________________________________________________
       (page generated 2021-01-06 23:00 UTC)