[HN Gopher] Memory access on the Apple M1 processor ___________________________________________________________________ Memory access on the Apple M1 processor Author : luigi23 Score : 219 points Date : 2021-01-06 17:06 UTC (5 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | lrossi wrote: | Shouldn't you choose the random numbers such that array[idx1] ^ | array[idx1 + 1] are guaranteed to fall in the same cache line? | Assuming that it has that. Right now some accesses cross the end | of the cache line. | CyberRabbi wrote: | Technically you are correct but it's expected to cross a cache | line 1/16 times (or however many ints there are in a cache | line). There is an implicit assumption that that is relatively | infrequent enough that it shouldn't increase the average time | too much, but that assumption should be tested. | jayd16 wrote: | >our naive random-memory model | | Doesn't everyone use the (I believe) still valid concepts of | latency and bandwidth? | sroussey wrote: | Depends on context. | | For example, what is the bandwidth and latency when you ask for | the value at the same memory address in an infinite loop? And | how does that compare to the latency and bandwidth of a memory | module you buy on NewEgg? | fluffy87 wrote: | L1 BW. | | When people use BW in their performance models, they don't | use only 1 bandwidth, but whatever combination of bandwidth | makes sense for the _memory access pattern_. | | So if you are always accessing the same word, the first acces | runs at DRAM BW, and subsequent ones at L1 BW, and any | meaningful performance model will take that into account. | whoisburbansky wrote: | The concepts are still broadly valid, the naivety being | referred to is the assumption that two non adjacent memory | reads will be twice as slow as one memory read or two adjacent | reads. | wyldfire wrote: | How do latency and bandwidth relate to the cost model for the | code in the benchmark? | | When creating the model discussed in the post, we're using it | to try to make a static prediction about how the code will | execute. | | Note that the goal of the post is not to merely measure the | memory access performance, it's to understand the specific | microarchitecture and how it might deliver the benefits that we | see in benchmarks. | foota wrote: | Is this per core or shared between cores? | hundchenkatze wrote: | Per core I think, emphasis is mine. | | > It looks like a _single_ core has about 28 levels of memory | parallelism, and possibly more. | foota wrote: | I was wondering if this might be a shared resource though, | since it doesn't seem they tested with multiple threads. | wrsh07 wrote: | Ok, summary: | | This article lays out three scenarios: 1) accessing two random | elements | | 2) accessing 3 random elements | | 3) accessing two pairs of adjacent elements (same as (1) but also | the elements after each random element) | | It then does some trivial math to use the loaded data. | | A naive model might only consider memory accesses and might | assume accessing an adjacent element is free. | | On the Mac m1 core, this is not the case. While the naive model | might expect cases 1 & 3 to cost the same and case 2 to cost 50% | more, instead cases 2 & 3 are nearly the same (3 slightly faster) | and case 2 is about 50% more expensive than 1. | jayd16 wrote: | I don't really understand the comparison because it seems like | scenario 3 (2+) is doing more XORs and twice the accesses to | array over the same amount of iterations. | | We have to assume these are byte arrays, yes? Or at least some | size that's smaller than the cache line. You would still pay | for the extra unaligned fetches. I don't think this is a valid | scenario at all, M1 or not. | | Anyone want to run these tests on an Intel machine and let us | know if the authors "naive model" test hold there? | wrsh07 wrote: | The point of the naive model is that you assume memory | accesses dominate | | That is, the math part is so trivial compared to the memory | access that you could do a bunch of math and you would still | only notice a change in the number of memory accesses. | | Also it looks like the response to yours links their test and | the naive model predicts correctly | jayd16 wrote: | I think 5% is a non-trivial difference but alright, its a | much bigger difference on the M1. | | I guess I still don't understand whats going on here. | | Scenario 1 has two spatially close reads followed by two | dependent random access reads. | | Scenario 3 (2+) has two spatially close reads, and two | pairs of dependent random access reads of two spatially | close locations. | | Why does it follow that this is caused by a change in | memory access concurrency? The two required round trips | should dominate both on the M1 and an Intel but for some | reason the M1 performs worse than that. Why? | | I can't help but feel the first snippet triggers some SIMD | path while the 3rd snippet fails to. | africanboy wrote: | I did it on an old i7 laptop | | https://news.ycombinator.com/item?id=25661055 | temac wrote: | > A naive model might only consider memory accesses and might | assume accessing an adjacent element is free. | | Really depends on the level of naivety and the definition of | "free". It would be less insane to write that: accessing an | adjacent element has a negligible overhead if the data must be | loaded from RAM and there are some OOO bubbles to execute the | adjacent loads. If some data are in cache the free adjacent | load claim immediately is less probable. If the latency of a | single load is already filled by OOO, adding another one will | obviously have an impact. If the workload is highly regular you | _can_ get quite chaotic results when making even some trivial | changes (even sometimes when _aligning the .text differently_!) | | And the proposed microbenchmark is way too simplistic: it is | possible that it saturates some units in some processors and | completely different units in others... | | Is the impact of an extra adjacent load from RAM _likely_ to be | negligible in a real world workloads? Absolutely. With precise | characteristics depending on your exact model / current freq / | other memory pressure at this time, etc. | willvarfar wrote: | A lot of commenters here are saying that Apples advantage is that | it can profile the real workloads and optimise for that. | | Well that's true and could very well be an advantage. An | advantage in that they did it, not in that only they have access | to it. | | Intel and AMD can trivially profile real world workloads too. | | Did they? I don't know what Apple did, but the impression I get | is that intel certainly hasn't. | nabla9 wrote: | What is the cache line size and page table size in M1? | sysconf(_SC_PAGESIZE); /* posix */ | | Can you get direct processor information like LEVEL1_ICACHE_ASSOC | and LEVEL1_ICACHE_LINESIZE from the M1?? | momothereal wrote: | `getconf PAGESIZE` returns 16384 on the base M1 MacBook Air. | | The L1 cache values aren't there. The macOS `getconf` doesn't | support -a (listing all variables), so they may just be under a | different name. | | edit: see replies for `sysctl -a` output | lilyball wrote: | Is it possibly exposed via sysctl, which does support a flag | to list all variables? | messe wrote: | From sysctl -a on my M1: | hw.cachelinesize: 128 hw.l1icachesize: 131072 | hw.l1dcachesize: 65536 hw.l2cachesize: 4194304 | | EDIT: also, when run under Rosetta hw.cachelinesize is | halved: hw.cachelinesize: 64 | hw.l1icachesize: 131072 hw.l1dcachesize: 65536 | hw.l2cachesize: 4194304 | Nokinside wrote: | M1 cache lines are double of what is commonly used by | Intel, AMD and other ARM microarchtectures use. That's | significant difference. | [deleted] | JonathonW wrote: | Compared to the i9-9880H in my 16" MacBook Pro: | hw.cachelinesize: 64 hw.l1icachesize: 32768 | hw.l1dcachesize: 32768 hw.l2cachesize: 262144 | hw.l3cachesize: 16777216 | | The M1 doubles the line size, doubles the L1 data cache | (i.e. same number of lines), quadruples the L1 | instruction cache (i.e. double the lines), and has a 16x | larger L2 cache, but no L3 cache. | waterside81 wrote: | For people who know more about this stuff than me: are these | sorts optimizations only possible because Apple controls the | whole stack and can make the hardware & OS/software perfectly | match up with one another or is this something that Intel can do | but doesn't for some reasons (tradeoffs)? | viktorcode wrote: | There's at least two M1 optimisations targeting Apple's | software stack: | | 1. Fast uncontended atomics. Speeds up reference counting which | is used heavily by Objective-C code base (and Swift). Increase | is massive comparing to Intel. | | 2. Guaranteed instruction ordering mode. Allows for faster Arm | code to be produced by Rosetta when emulating x86. Without it | emulation overhead would be much bigger (similar to what | Microsoft is experiencing). | [deleted] | [deleted] | AnthonyMouse wrote: | > are these sorts optimizations only possible because Apple | controls the whole stack and can make the hardware & | OS/software perfectly match up with one another or is this | something that Intel can do but doesn't for some reasons | (tradeoffs)? | | Interestingly it's the other way around. Apple is using TSMC's | 5nm process (they don't have their own fabs), which is better | than Intel's in-house fabs, so it's _Intel 's_ vertical | integration which is _hurting_ them compared to the non- | vertically integrated Apple. | | Also, the answer to "is this only possible because of vertical | integration" is always _no_. Intel and Microsoft regularly | coordinate to make hardware and software work together. Intel | is one of the largest contributors to the Linux kernel, even | though they don 't "own" it. Two companies coordinating with | one another can do anything they could do as an individual | company. | | Sometimes the efficiency of this is lower because there are | communication barriers and isn't a single chain of command. But | sometimes it's higher because you don't have internal politics | screwing everything up when the designers would be happy with | outsourcing to TSMC because they have a competitive advantage, | but the common CEO knows that would enrich a competitor and | trash their internal investment in their own fabs, and forces | the decision that leads to less competitive products. | cma wrote: | Not quite vertical integration, but TSMC's 5nm fabs are | Apple's fabs. (exclusively for a period of time) | | During the iPod era, Toshiba's 1.8in HD production was | exclusively Apple's only for music players, but Apple gets | all the 5nm output from TSMC for a period of time. | hinkley wrote: | Integration is a petri dish. It can speed up both growth and | decay, and it is indifferent to which one wins. | wmf wrote: | No, there's no cross-stack optimization here. The M1 gives very | high performance for all code. | qeternity wrote: | I think this gets lost in the fray between the "omg this is | magic" and then the Apple haters. The M1 is a very good chip. | Apple has hired an amazing team and resourced them well. But | from a pure hardware perspective, the M1 is quite | evolutionary. However the whole Apple Silicon experience is | revolutionary and magical due to the tight software pairing. | | Both teams deserve huge praise for the tight coordination and | unreal execution. | acdha wrote: | I think this is part of the reason where there are so many | people trying to find reasons to downplay it: humans love | the idea of "one weird trick" which makes a huge difference | and we sometimes find those in tech but rarely for mature | fields like CPU design. For many people, this is | unsatisfying like asking an athlete their secret, and | getting a response like "eat well, train a lot, don't give | up" with nary a shortcut in sight. | djacobs7 wrote: | Is the article saying that the M1 is slower than we would have | expected in this case? | | My understanding, based on the article, is that a normal | processor, we would have expected arr[idx] + arr[idx+1] and | arr[idx] to take the same amount of time. | | But the M1 is so parallelized that it goes to grab both arr[idx] | and arr[idx+1] separately. So we have to wait for both of those | two return. Meanwhile, on a less parallelized processor, we would | have done arr[idx] first and waited for it to return, and the | processor would realize that it already had arr[idx+1] without | having to do the second fetch. | | Am I understanding this right? | phkahler wrote: | >> My understanding, based on the article, is that a normal | processor, we would have expected arr[idx] + arr[idx+1] and | arr[idx] to take the same amount of time. | | That depends. If the two accesses are on the same cache line, | then yes. But since idx is random that will not happen | sometimes. He never says how big array[] is in elements or what | size each element is. | | I thought DRAM also had the ability to stream out consecutive | addresses. If so then it looks like Apple could be missing out | here. | | Then again, if his array fits in cache he's just measuring | instruction counts. His random indexes need to cover that whole | range too. There's not enough info to figure out what's going | on. | SekstiNi wrote: | > There's not enough info to figure out what's going on. | | If you only look at the article this is true. However, the | source code is freely available: | https://github.com/lemire/Code-used-on-Daniel-Lemire-s- | blog/... | mrob wrote: | I tried it on my old (2009) 2.5GHz Phenom II X4 905e (GCC | 10.2.1 -O3, 64 bit) and got results almost perfectly | matching the conventional wisdom: two : | 97.4 ns two+ : 97.9 ns three: 145.8 ns | egnehots wrote: | TLDR: he is using a random index with a big enough array | [deleted] | africanboy wrote: | I ran the benchmark on my system | | It's a 6 years old system, fastest times are in the 25ns | range | | - 2-wise+ is 5% slower than 2-wise | | - 3-wise is 46% slower than 2-wise | | - 3-wise is 39% slower than 2-wise+ | | on the M1 | | - 2-wise+ is 40% slower than 2-wise | | - 3-wise is 46% slower than 2-wise | | - 3-wise is 4% slower than 2-wise+ | SekstiNi wrote: | Interesting, I ran it on my laptop (i7-7700HQ) with the | following results: | | - 2-wise+ is 19% slower than 2-wise | | - 3-wise is 48% slower than 2-wise | | - 3-wise is 25% slower than 2-wise+ | | However, as mentioned in the post the numbers can vary a | lot, and I noticed a maximum run-to-run difference of | 23ms on two-wise. | phkahler wrote: | He's only got 3 million random[] numbers. Weather that's | enough depends on the cache size. It also bothers me to | read code like this where functions take parameters (like | N) and never use them. | eloff wrote: | He mentioned it's a 1GB array, and the source code is | available. | jayd16 wrote: | Its a little confusing because they're conflating the idea that | you almost certainly read at least the entire word (and not a | single byte) at a time with the other idea that you could fetch | multiple words concurrently. | duskwuff wrote: | Any cached memory access is going to read in the entire cache | line -- 64 bytes on x86, apparently 128 on M1. This is true | across most architectures which use caches; it isn't specific | to M1 or ARM. | kzrdude wrote: | (As I learned from recent Rust concurrency changes) on | newer Intel, it usually fetches two cache lines so | effectively 128 bytes while AMD usually 64 bytes. That's | the sizes they use for "cache line padded" values (I.e | making sure to separate two atomics by the fetch size to | avoid threads invalidating the cache back and forth too | much). | jayd16 wrote: | Yes almost certainly more than the word will be read but it | varies by architecture. I would think almost by definition | no less than a word can be read so I went with that in my | explanation. | syntaxing wrote: | I'm super curious if it's true that my 8GB M1 will die quickly | because of the aggressive swaps. I guess time will tell. | acdha wrote: | FWIW, I have a 2010 MBA which was _heavily_ used for years as a | primary development system. The SSD only started to show signs | of degraded performance last year and that wasn't massive. I | would be quite surprised if the technology has become worse. | [deleted] | jeffbee wrote: | Great practical information. Nice to see people who know what | they are talking about putting data out there. I hope eventually | these persistent HN memes about M1 memory will die: that it's | "on-die" (it's not), that it's the only CPU using LPDDR4X-4267 | (it's not), or that it's faster because the memory is 2mm closer | to the CPU (not that either). | | It's faster because it has more microarchitectural resources. It | can load and store more, and it can do with a single core what an | Intel part needs all cores to accomplish. | titzer wrote: | > it can do with a single core what an Intel part needs all | cores to accomplish. | | Care to explain what you mean specifically by this? | saagarjha wrote: | The M1 has extremely high single-core performance. | temac wrote: | It is not 4 times faster than an Intel core, though... | saagarjha wrote: | It is in memory performance, which is what I assumed was | being measured here. | kllrnohj wrote: | How are you defining memory performance and where are | your supporting comparisons? This article only discusses | the M1's behavior, and makes no comparisons to any other | CPU. | FabHK wrote: | FWIW, I ran it on a MacBook Pro (13-inch, 2019, Four | Thunderbolt 3 ports), 2.4 GHz Quad-Core Intel Core i5, 8 | GB 2133 MHz LPDDR3: two : 49.6 ns (x | 5.5) two+ : 64.8 ns (x 5.2) three: 72.8 ns | (x 5.6) | | EDIT to add: above was just `cc`. Below is with `cc -O3 | -Wall`, as in Lemire's article: two : | 62.8 ns (x 7.1) two+ : 69.2 ns (x 5.5) | three: 95.3 ns (x 7.3) | namibj wrote: | You _need_ to use -mnative because it otherwise retains | backwards compatibility to older x86. | [deleted] | africanboy wrote: | there must be something wrong there, on my late 2014 | laptop that mounts Type: DDR4 | Speed: 2133 MT/s | | I get two : 27.1 ns (3x) two+ | : 28.6 ns (2.2x) three: 39.7 ns (3x) | | which is not much, considering this is an almost 6 years | old system with 2x slower memor | titzer wrote: | Sure, and it has a very large out-of-order execution | engine, but it is not fundamentally different from what | other super scalar processors do. So I am curious what the | OP meant by that offhand comment. | jeffbee wrote: | One core of the M1 can drive the memory subsystem to the | rails. A single core can copy (load+store) at 60GB/s. | This is close to the theoretical design limit for DDR4X. | A single core on Tiger Lake can only hit about 34GB/s, | and Skylake-SP only gets about 15GB/s. So yes, it is | close to 4x faster. | titzer wrote: | Thanks for clarifying. But this isn't any fundamental | difference IMO. There isn't any functional limitation in | an Intel core that means it cannot saturate the memory | bandwidth from a single core, unless I am missing | something. | jeffbee wrote: | I agree, it's not fundamental. It is, in particular, not | that other popular myth, that it's "because ARM". It's | only that 1 core on an Intel chip can have N-many | outstanding loads and 1 core of an M1 can have M>N | outstanding loads. | titzer wrote: | Frankly, I find Lemire does oversimplified, poor-quality | control, back-of-the-envelope microbenchmarking all the time | that provides little to no insight other than establishing a | general trend. It's sophomoric and a poor demonstration about | how to well-controlled benchmarking that might yield useful, | repeatable, and transferrable results. | foldr wrote: | >or that it's faster because the memory is 2mm closer to the | CPU (not that either) | | Not to disagree with your overall point, but 2mm is a long way | when dealing with high frequency signals. You can't just | eyeball this and infer that it makes no difference to | performance or power consumption. | jeffbee wrote: | If it works, it works. There will be no observable | performance difference for DDR4 SDRAM implementations with | the same timing parameters, regardless of the trace length. | There are systems out there with 15cm of traces between the | memory controller pins and the DRAM chips. The only thing you | can say against them is they might consume more power driving | that trace. But you wouldn't say they are meaningfully | slower. | foldr wrote: | You can't just eyeball the PCB layout for a GHz frequency | circuit and say "yeah that would definitely work just the | same if you moved that component 2mm in this direction". | It's certainly possible to use longer trace lengths, but | that may come with tradeoffs. | | >The only thing you can say against them is they might | consume more power driving that trace | | Power consumption is really important in a laptop, and | Apple clearly care deeply about minimising it. | | For all we know for sure, moving the memory closer to the | CPU may have been part of what's enabled Apple to run | higher frequency memory with acceptable (to them) power | draw. | sliken wrote: | The most impressive thing I've seen is that when accessed in a | TLB friendly fashion that the latency is around 30ns. | | Anandtech has a graph showing this, specifically the R per RV | prange graph. I've verified this personally with a small | microbenchmark I wrote. I've not seen anything else close to | this memory latency. | reasonabl_human wrote: | Mind sharing the micro benchmark you wrote? I'm curious to | know how that would work | tandr wrote: | Sorry, what would AMD's or Intel's "latest and greatest" | numbers for the same be? | sliken wrote: | Here's the M1: https://www.anandtech.com/show/16252/mac- | mini-apple-m1-teste... | | Scroll down to the latency vs size map and look at the R | per RV prange. That gets you 30ns or so. | | Similar for AMD's latest/greatest the Ryzen 9 5950X: | https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep- | di... | | The same R per RV prange is in the 60ns range. | ed25519FUUU wrote: | In other words, it's better architecture. If anything this | makes it seem more impressive to me. | amelius wrote: | No, it's the same architecture but with different parameters. | | It's like the difference between the situation where every | car uses 4 cylinders, and then Apple comes along and makes a | car with 5 cylinders. | kllrnohj wrote: | Your analogy was so close! It's Apple comes along and makes | an 8 cylinder engine. Since, you know, the other CPUs are | 4-wide decode and Apple's M1 is 8-wide decode :) | PragmaticPulp wrote: | I don't understand this competition to attribute the M1's speed | to _one_ specific change, while downplaying all of the others. | | M1 is fast because they optimized everything across the board. | The speed is the cumulative result of many optimizations, from | the on-die memory to the memory clock speed to the | architecture. | adam_arthur wrote: | It's fast because they optimized everything across the board, | and also paid for exclusive access to TSMC 5nm process. | pdpi wrote: | This seems to be a recurring theme with the M1, and one that, | in a sense, actually baffles me even more than the alternative. | There is no "magic" at play here, it's just lots and lots of | raw muscle. They just seem to have a freakishly successful | strategy for choosing what aspects of the processor to throw | that muscle at. | | Why is that strategy simultaneously remarkably efficient and | remarkably high-performance? What enabled/led them to make | those choices where others haven't? | mhh__ wrote: | I think it's worth saying that because AMD have only just | really hit their stride, Intel were under almost zero | pressure to improve which has really hurt them especially | with the process. | | X86 is definitely a coefficient overhead, but if Intel put | their designs on 5nm they'd look pretty good too - Jim Keller | (when he was still there) hinted their offerings for a year | or so in the future are significantly bigger to the point of | him judging it to be worth mentioning so I wouldn't write | them off. | adam_arthur wrote: | Certainly Apple's processors are far ahead, but they're a | full process generation (5nm) ahead of their competitors. | They paid their way to that exclusive right through TSMC. | | I'm sure they'll still come out ahead in benchmarks, but the | numbers will be much closer once AMD moves to 5nm. You | absolutely cannot fairly compare chips from different fab | generations. | | I don't see many comments hammering this point home enough... | it's not like the performance gap is through engineering | efforts that are leagues ahead. Certainly some can be | attributed to that, and Apple has the resources to poach any | talent necessary. | GeekyBear wrote: | A node shrink gives you a choice of cutting power, | improving performance, or some mix of the two. | | Apple appears to have taken the power reduction when they | moved to TSMC 5nm. | | >The one explanation and theory I have is that Apple might | have finally pulled back on their excessive peak power draw | at the maximum performance states of the CPUs and GPUs, and | thus peak performance wouldn't have seen such a large jump | this generation, but favour more sustainable thermal | figures. | | Apple's A12 and A13 chips were large performance upgrades | both on the side of the CPU and GPU, however one criticism | I had made of the company's designs is that they both | increased the power draw beyond what was usually | sustainable in a mobile thermal envelope. This meant that | while the designs had amazing peak performance figures, the | chips were unable to sustain them for prolonged periods | beyond 2-3 minutes. Keeping that in mind, the devices | throttled to performance levels that were still ahead of | the competition, leaving Apple in a leadership position in | terms of efficiency. | | https://www.anandtech.com/show/16088/apple- | announces-5nm-a14... | wmf wrote: | From a customer's perspective it's not my problem. Everyone | had the opportunity to bid on that fab capacity and they | decided not to. | adam_arthur wrote: | Yeah, totally agreed. But if you read these comments, | they seem to be in total amazement about the performance | gap and not acknowledging how much of an advantage being | a fab generation ahead is. | | Customers don't care, but discussion of the merits of the | chip should be more nuanced about this. | | It also implies that the gap won't exist for very long, | as AMD will move onto 5nm soon | tandr wrote: | > It also implies that the gap won't exist for very long, | as AMD will move onto 5nm soon | | ... yes, if there is any capacity left. Capacity for the | new process is a limited resource after all. | coldtea wrote: | > _Why is that strategy simultaneously remarkably efficient | and remarkably high-performance? What enabled /led them to | make those choices where others haven't?_ | | The things people give them complains about: | | (a) keeping a walled garden, | | (b) moving fast and taking the platform to new directions all | at once | | (c) controlling the whole stack | | Which means they're not beholden to compatibility with third | party frameworks and big players, or with their own past, and | thus can rely on their APIs, third party devs etc, to cater | to their changes to the architecture. | | And they're not chained to the whims of the CPU vendor (as | the OS vendor) or the OS vendor (as the CPU vendor) either, | as they serve the role of both. | | And of course they benchmarked and profiled the hell out of | actual systems. | jeffbee wrote: | Neither A nor C makes any sense, are not supported by | evidence. There is no aspect of the mac or macOS that can | be realistically described as a "walled garden". It comes | with a compiler toolchain and ... well, some docs. It | natively runs software compiled for a foreign architecture. | You can do whatever you want with it. It's pretty open. | | A "walled garden" is when there is a single source of | software. | anfilt wrote: | I will be honest as long apple keeps this walled garden | shenanigans going. I am not buying any of their hardware. | fartcannon wrote: | They could still do all this shit without the walled | garden. To me, it suggests they aren't willing to compete. | They're anti-competitive. | marrvelous wrote: | With the walled garden, Apple can set enforceable | timelines for the software ecosystem to adopt to | architectural changes. | | Remember the transition to arm64? Apple forced everything | on the App Store to ship universal binaries. | | Without the App Store walled garden, software isn't | required to keep up to date with architectural changes. | Instead, keeping current is only a requirement to being | featured on the App Store (which would just be a single | way to install software, not the only method). | danaris wrote: | Well, and on the Mac, it's _not_ the only method. The | walled garden here has big open gates. | | That said, _all_ software on the Mac, post-Catalina, has | to be 64-bit, whether it 's distributed through the Mac | App Store or not, because the 32-bit system libraries are | no longer included at all. | baybal2 wrote: | > There is no "magic" at play here, it's just lots and lots | of raw muscle. They just seem to have a freakishly successful | strategy for choosing what aspects of the processor to throw | that muscle at. | | There is no freakishly successful strategy at play there as | well. It's just all previous attempts at "fast ARM" chip were | rather half hearted "add a pipeline step there, add extra | register there, increase datapath width there," and not to | squeeze it to the limit. | barkingcat wrote: | The answer is that they have raw hard numbers from the | hundres of millions of iPads/iPhones sold each year, and can | use the metrics from those devices to optimize the next | generation of devices. | | These improvements didn't come from nowhere. It came from | iterations of iOS hardware. | TYPE_FASTER wrote: | Apple has been iterating on their proprietary mobile ARM- | based processors since 2010, and has gotten really good at | it. I would imagine that producing billions of consumer | devices with these chips has helped give them a lot of | experience in shortened time frame. | | I also wonder if having the hardware and software both worked | on in-house is an advantage. I mean, if you're developing | power management software for a mobile OS, and you're using a | 3rd-party vendor, then you read the documentation, and work | with the vendor if you have questions. If it's all internal, | you call them, and could make suggestions on future processor | design too based on OS usage statistics and metrics. | jandrese wrote: | It seems like Apple listened when people talked about how all | modern processors bottleneck on memory access and decided to | focus heavily on getting those numbers better. | | Of course this leads to the question that if everyone in the | industry knew this was the issue why weren't Intel and AMD | pushing harder on it? They already both moved the memory | controller onboard so they had the opportunity to | aggressively optimize it like Apple has done, but instead we | have year after year where the memory lags behind the | processor in speed improvements, to the point where it is | ridiculous how many clock cycles a main memory access takes | on a modern x86 chip. | acdha wrote: | > What enabled/led them to make those choices where others | haven't? | | Others have to some extent -- AMD is certainly not out of the | game -- so I'd treat this more as the question of how they've | been able to go more aggressively down that path. One of the | really obvious answers is that they control the whole stack | -- not just the hardware and OS but also the compilers and | high-level frameworks used in many demanding contexts. | | If you're Intel or Qualcomm, you have a wider range of things | to support _and_ less revenue per device to support it, and | you are likely to have to coordinate improvements with other | companies who may have different priorities. Apple can | profile things which their users do and direct attention to | the right team. A company like Intel might profile something | and see that they can make some changes to the CPU but the | biggest gains would require work by a system vendor, a | compiler improvement, Windows/Linux kernel change, etc. -- | they contribute a large amount of code to many open source | projects but even that takes time to ship and be used. | SurfingInVR wrote: | Something I've seen no one else mentioning: Apple's low-spec | tier is $1000, not $70. | [deleted] | dv_dt wrote: | No fighting the sales department on where to put the market | segmentation bottlenecks? | gameswithgo wrote: | I see two main things behind it: | | 1. they are the only ones who have 5nm chips because they | paid a lot to TSMC for that right 2. they gave up on | expandable memory, which lets them solder it right next to | the cpu, which likely makes it easier to ship with really | high clocks. and/or they just spent the money it takes to get | binned lpddr4 at that speed. | | So a good cpu design, just like AMD and Intel have, but one | generation ahead on node size, and fast ram. Its not special | low latency ram or anything, just clocked higher than maybe | any other production machine, though enthusiasts sometimes | clock theirs higher on desktops! | epistasis wrote: | > So a good cpu design, just like AMD and Intel have | | The design seems to be very different, in that it's far far | wider, and supposedly has a much better branch predictor. | | > fast ram | | Is that a property of the RAM clock, or a function of a | better memory controller? The RAM certainly doesn't appear | to have any better latency. | gameswithgo wrote: | Right, latency isn't (much) affected by a higher clock | rate. Getting ram to run fast requires both good ram | chips and good controller/motherboard. | | and yes, obviously apples bespoke ARM cpu is quite a bit | different than Zen3 Ryzens x86 cpu, but I'm not sure it | is net-better. When Zen4 hits at 5nm I expect it will | perform on par or better than the M1, but we won't know | till it happens! | mtgx wrote: | In other words: money. Throwing money at the (right) | problems made them better than others. | | "But doesn't Intel have a lot of money, too?" | | Sure, but Intel has also been running around like a | headless chicken this past decade (pretty much literally, | since Otellini left) combined with them getting very | complacent because they had "no real competition." | gavin_gee wrote: | didnt they also make some interesting hires a few years ago | like Anand from Anandtech and some other silicon vets that | likely helped them design the M1 approach? | jeffbee wrote: | I don't have any inside-Apple perspective, but my guess is | having a tight feedback cycle between the profiles of their | own software and the abilities of their own hardware has | helped them greatly. | | The reason I think so is when I was at Google is was 7 years | between when we told Intel what could be helpful, and when | they shipped hardware with the feature. Also, when AMD first | shipped the EPYC "Naples" it was crippled by some key uarch | weaknesses that anyone could have pointed out if they had | been able to simulate realistic large programs, instead of | tiny and irrelevant SPEC benchmarks. If Apple is able to | simulate or measure their own key workloads and get the | improvements in silicon in a year or two they have a gigantic | advantage over anyone else. | martamorena2 wrote: | That's bizarre. As if CPU vendors were unable to run | "realistic" workloads. If they truly aren't, that's because | they are unwilling and then they are designing for failure | and Apple can just eat their lunch. | dcolkitt wrote: | Interesting point. This would suggest pretty sizable | synergies from the oft-rumored Microsoft acquisition of | Intel. | f6v wrote: | > Microsoft acquisition of Intel | | Could that possibly be approved by governments? | gigatexal wrote: | Nope. Not a lawyer but I doubt it at all. | ChuckNorris89 wrote: | Microsoft doesn't need to acquire Intel, they need to do | what Apple did and acquire a stellar ARM design house | that will build a chip with x86 translation, tailored to | accelerate the typical workloads on Windows machines and | sell those chips to the likes of Dell and Lenovo and tell | developers _" ARM Windows is the future, x86 Windows will | be sunset in 5 years and no longer supported by us, start | porting your apps ASAP and in the mean time, try our X86 | emulator on our ARM silicon, it works great."_ | | Microsoft proved with the XBOX and Surface series they | can make good hardware if they want, now they need to | move to chip design. | ralfd wrote: | Apple has at most 10% of the computer market and is just | one player among many. I am skeptical Microsoft with | their 90% dominance would or should be allowed this much | power over the industry. | megablast wrote: | You'll get people guessing, since Apple itself puts out so | little information. | gameswithgo wrote: | What other laptop ships with LPDDR4X clocked at 4267? I agree | though that being closer to the cpu isn't having any | appreciable effect on latency, but being soldered close to the | cpu probably does make it easier for them to hit that high | clock rate. | wmf wrote: | Tiger Lake laptops such as the XPS 13. | jeffbee wrote: | As WMF mentions, Tiger Lake laptops like my Razer Book have | the same memory. It is not appreciably closer to the CPU in | the Apple design. In Intel's Tiger Lake reference designs the | memory is also in two chips that are mounted right next to | the CPU. | danaris wrote: | And (genuine question) how do the Tiger Lake laptops | compare with the M1 MacBooks thus far? | skavi wrote: | AnandTech has decent benchmarks for both Tiger Lake [0] | and M1 [1]. | | [0]: https://www.anandtech.com/show/16084/intel-tiger- | lake-review... | | [1]: https://www.anandtech.com/show/16252/mac-mini- | apple-m1-teste... | jeffbee wrote: | The outcome seems to depend greatly on the physical | design of the laptops. The elsewhere-mentioned Dell XPS | 13 has a particularly poor cooling design, which is why I | chose the Razer Book instead. Despite being marketed in a | very silly way to gamers only, it seems to have competent | mechanical design. | SAI_Peregrinus wrote: | Gamers are likely to run their systems with demanding | workloads, for hours, with a color-coded performance | counter (FPS stat). They'll notice if it throttles. | They're particularly demanding customers, and there's | quite a bit of competition for their money. | ksec wrote: | >HN memes about M1 memory will die | | It is not only HN. It is practically the whole Internet. Go | around the Top 20 hardware and Apple website forum and you see | the same thing, also vastly amplify by a few KOL on twitter. | | I dont remember I have ever seen anything quite like it in tech | circle. People were happily running around spreading | misinformation. | Bootvis wrote: | What is a KOL? | tyingq wrote: | "Key Opinion Leader". I think it's the new word for | "Influencer". | ksec wrote: | I am pretty sure KOL predates Influencer in modern | internet usage. Before that they were simply known as | Internet Celebrities. May be it is rarely used now. So | apology for not explaining the acronyms. | secondcoming wrote: | First I've heard of it! | walterbell wrote: | Who introduced the term KOL and bestows the K title? | jeffbee wrote: | Yeah, I know. There was some kid on Twitter who was trying to | tell me that it was the solder in an x86 machine (he actually | said "a Microsoft computer") that made them slower. Apple, | without the solder was much faster. | | According to this person's bio they had an undergraduate | education in computer science -\\_(tsu)_/- | s800 wrote: | What's the precision of these ns level measurements? | mhh__ wrote: | The answer to that is usually very context dependant, and on | what you're measuring. As long as you use a histogram first and | don't blindly calculate (say) the mean it should he obvious. | | Two examples( that are slightly bigger than this but the same | principles apply): | | If you benchmark a std::vector at insertion, you'll see a flat | graph with n tall spikes at ratios of it's reallocation amount | apart, and it scales very very well. The measurements are | clean. | | If, however, you do the same for a linked list you get a | linearly increasing graph _but_ it 's absolutely all over the | place because it doesn't play nice with the memory hierarchy. | The std_dev of a given value of n might be a hundred times | worse than the vector. | CyberRabbi wrote: | Clock_gettime(CLOCK_REALTIME) on macos provides nanosecond- | level precision. | geocar wrote: | I seem to recall OSX didn't used to have clock_gettime, so | it's news to me that it even exists -- I might have been away | from OSX too long. | | Is there any performance difference between that and | mach_absolute_time() ? | lilyball wrote: | It was added some years ago, and I believe | mach_absolute_time is actually now implemented in terms of | (the implementation of) clock_gettime. The documentation on | mach_absolute_time now even says you should use | clock_gettime_nsec_np(CLOCK_UPTIME_RAW) instead. | | macOS also has clock constants for a monotonic clock that | increases while sleeping (unlike CLOCK_UPTIME_RAW and | mach_absolute_time). | saagarjha wrote: | Not yet, at least :) _mach_absolute_time: | 00000000000012ec pushq %rbp | 00000000000012ed movq %rsp, %rbp | 00000000000012f0 movabsq $0x7fffffe00050, %rsi | ## imm = 0x7FFFFFE00050 00000000000012fa | movl 0x18(%rsi), %r8d 00000000000012fe | testl %r8d, %r8d 0000000000001301 je | 0x12fa 0000000000001303 lfence | 0000000000001306 rdtsc 0000000000001308 | lfence 000000000000130b shlq $0x20, %rdx | 000000000000130f orq %rdx, %rax | 0000000000001312 movl 0xc(%rsi), %ecx | 0000000000001315 andl $0x1f, %ecx | 0000000000001318 subq (%rsi), %rax | 000000000000131b shlq %cl, %rax | 000000000000131e movl 0x8(%rsi), %ecx | 0000000000001321 mulq %rcx | 0000000000001324 shrdq $0x20, %rdx, %rax | 0000000000001329 addq 0x10(%rsi), %rax | 000000000000132d cmpl 0x18(%rsi), %r8d | 0000000000001331 jne 0x12fa | 0000000000001333 popq %rbp | 0000000000001334 retq | Skunkleton wrote: | That may be the result of inlining clock_gettime, though | that would imply a pretty different implementation from | the one I am familiar with. | | AFAIR on x86 a locked rdtsc is ~20 cycles. So to answer | the gp question, it has around a precision in the few | nanoseconds range. Accuracy is a different question, IE | compare numbers from the same die, but be a little more | suspicious across dies. | | No clue how this is implemented on the M1, or if the M1 | has the same modern tsc guarantees that x86 has grown | over the last few generations of chips. | saagarjha wrote: | Yeah, clock_gettime is somewhat more complicated than | this. If anything, it might have an inlined | mach_absolute_time in it... | vlovich123 wrote: | I was part of the team that really pushed the kernel team | to add support for a monotonic clock that counts while | sleeping (this had been a persistent ask before just not | prioritized). We got it in for iOS 8 or 9. The dance you | otherwise have to do is not only complicated in userspace | on MacOS, it's expensive & full of footguns due to race | conditions (& requires changing the clock basis for your | entire app if I recall correctly). | saagarjha wrote: | It's new in macOS Sierra. I believe mach_absolute_time is | slightly faster but not by much-both just read the commpage | these days to save on a syscall. | kergonath wrote: | It's a good introduction, but it's a bit disappointing that it | ends that way. I'd love to read more about what's behind the | figure and more technical info about how it might work. ___________________________________________________________________ (page generated 2021-01-06 23:00 UTC)