[HN Gopher] How to get 1.5 TFlops of FP32 performance on a singl...
       ___________________________________________________________________
        
       How to get 1.5 TFlops of FP32 performance on a single M1 CPU core
        
       Author : signa11
       Score  : 281 points
       Date   : 2023-01-05 13:22 UTC (9 hours ago)
        
 (HTM) web link (jott.live)
 (TXT) w3m dump (jott.live)
        
       | Tepix wrote:
       | So, this is about Apple's undocumented AMX instructions, issued
       | from CPU, executed on a special accelerator execution unit.
       | 
       | Is there one such unit per CPU core?
        
         | MuffinFlavored wrote:
         | > So, this is about Apple's undocumented AMX instructions,
         | issued from CPU, executed on a special accelerator execution
         | unit.
         | 
         | CPU instruction -> AMX instruction -> AMX result -> CPU?
         | 
         | How are these kinds of things usually kept in sync/in a
         | manageable state? Like does the CPU block until the AMX
         | returns?
        
         | my123 wrote:
         | M1 has one AMX unit per cluster AFAIK. This however can and
         | does change between different chips.
        
           | danieldk wrote:
           | Yes, there is one per core cluster. The title is a bit
           | misleading, because it suggests that going to two or three
           | cores will scale linearly, though it won't be much faster.
           | See here for sgemm benchmarks for everything from the M1 to
           | M1 Ultra and 1 to 16 threads:
           | 
           | https://github.com/danieldk/gemm-benchmark#1-to-16-threads
        
         | adrian_b wrote:
         | No.
         | 
         | So the title is misleading, even if it is true that you get
         | this performance with a program that uses a single CPU core.
        
       | mochomocha wrote:
       | I think the author downplays the significance his work because it
       | only applies to "small neural networks". There are a lot of use-
       | cases that can benefit from this type of optimizations.
       | Discovering how to use an undocumented fast accelerator available
       | on millions of devices is very valuable.
        
         | MuffinFlavored wrote:
         | Not up to date on a lot of "AI"/"ML" things, why isn't this
         | significant for medium/large neural networks as well?
        
           | lostmsu wrote:
           | RTX 3090 theoretical matmul is 142 TFlops. E.g. about 100x of
           | this.
        
             | bee_rider wrote:
             | The 1.5 here is for a single core, though. So if we assume
             | that the performance core on an M1 is around 7.5 watts (I'm
             | not actually sure, seems like a reasonable upper bound
             | though if a whole M1 mini is around 39 watts), we'd be
             | looking at around 750 watts to match. Which seems like a
             | surprisingly non-crazy amount of power given these are 32
             | bit flops, unlike the 16 in the RTX 3090, and they come
             | from a CPU.
        
               | lostmsu wrote:
               | This code runs on AMX co-processor. From the article:
               | 
               | > An important distinction is that the AMX:CPU ratio is
               | not 1:1; not every core has its own AMX co-processor.
               | 
               | My understanding is there's only 1 of those per regular
               | M1 CPU, maybe 4 on the largest one (Ultra).
        
             | johndough wrote:
             | The RTX 3090 has 35.58 TFlops FP32 performance, or 285.48
             | FP16 according to https://en.wikipedia.org/wiki/List_of_Nvi
             | dia_graphics_proces...
             | 
             | EDIT: I fell for NVIDIA's marketing. The dense FP16
             | performance is only half of 284.48, which is 142. Thanks to
             | adgjlsfhk1 for the correction.
        
               | adgjlsfhk1 wrote:
               | That 285 is listed as (2:1 sparse) which means it's only
               | valid for matrices where 2 out of every 4 numbers are
               | zero. For dense matrices it's half that.
        
               | bee_rider wrote:
               | Are 2:1 sparse matrices a common thing? It seems weird,
               | like clearly that's not sparse enough to want to use,
               | like, sparse matrix "CSR" style storage or something,
               | haha. I would just treat it as dense I guess.
        
               | adgjlsfhk1 wrote:
               | They aren't. As far as I can tell, Nvidia does this to be
               | able to double the number of TFlops they put on their
               | website. (this might be a little unfair, the real reason
               | is that in ML it _might_ be possible to train a NN such
               | that your matrices have this structure, but I haven 't
               | seen anyone other than Nvidia use it)
        
               | bee_rider wrote:
               | I'm trying to think of cases where it might accidentally
               | come up, and all I can think of is something like "oops I
               | used complex but my values are actually real."
        
               | [deleted]
        
               | dotnet00 wrote:
               | There has been some work in that direction but it hasn't
               | really caught on as fast as NVIDIA may have expected it
               | to.
        
               | lostmsu wrote:
               | Yeah, still waiting for this feature to be available in
               | PyTorch natively.
        
         | my123 wrote:
         | Apple did prefer to expose it through their own
         | Accelerator.framework API however...
        
           | capableweb wrote:
           | Of course they do, Apple like to remain as much in control as
           | possible. If suddenly it becomes more efficient/faster to run
           | ML/AI stuff on Asahi Linux on Mac hardware then with macOS,
           | I'm sure they be embarrassed enough to take some sort of
           | action. And I'm pretty sure that action will be towards the
           | side of "closing things down" rather than "opening stuff up",
           | as is tradition.
        
             | my123 wrote:
             | Wrong answer.
             | 
             | AMX is an unstable ISA that changes between product
             | generations. That's why it's not publicly documented.
             | 
             | Arm SME is the standardisation of the concept, but is not
             | inmarket yet.
             | 
             | https://community.arm.com/arm-community-
             | blogs/b/architecture...
        
           | svantana wrote:
           | Has it been verified that they actually use these
           | instructions in Accelerate.framework? I just benchmarked this
           | on my 2019 intel i9 mbp, and got the following speeds for
           | 128x128 matrices, 32 repeats:                 cblas_sgemm: 36
           | GFLOP/s       vDSP_mmul: 41 GFLOP/s
           | 
           | That's a pretty big deal if these functions are >30x faster
           | on the M1...!
           | 
           | edit: that seems to be verified in the tlkh.dev blog post
           | above. Interestingly, I ran the same code on my bargain-
           | basement 2020 iphone SE, and got 259GFLOP/s! These apple
           | devices are pretty mindblowing.
        
             | danieldk wrote:
             | _Has it been verified that they actually use these
             | instructions in Accelerate.framework?_
             | 
             | Yes. Aside from benchmarks, you can easily verify this by
             | profiling an application with Instruments and then
             | inspecting the disassembly.
             | 
             | However, it should be said that AMX does not scale linearly
             | with the number of cores, but with the number of core
             | clusters. So, on the M1 if you use Accelerate in two
             | threads (rather than one), performance will barely improve,
             | because the first thread can keep the AMX unit busy enough.
             | 
             | However, e.g. the M1 Pro and M1 Max have two performance
             | core clusters with AMX units in them. So matrix
             | multiplication doubles roughly two times compared to the
             | M1. Similarly, the M1 Ultra has fours performance core
             | clusters, so matrix multiplication performance is roughly
             | twice that of the M1 Pro/Max and four times that of the M1.
             | 
             | Benchmarks:
             | 
             | https://github.com/danieldk/gemm-benchmark#1-to-16-threads
        
         | mark_l_watson wrote:
         | Apple has done a wonderful job making CoreML smoothly
         | integrated with iOS, iWatchOS, iPadOS, and macOS development.
        
           | ur-whale wrote:
           | > Apple has done a wonderful job making CoreML
           | 
           | Apple has done a wonderful job of further locking their user
           | into the golden cage they call a platform.
        
             | bee_rider wrote:
             | The worst thing is, their users don't even seem to be
             | totally happy with the state of affairs! It's like they
             | don't even realize their preferences are wrong. :(
        
               | bee_rider wrote:
               | This was intended to be obvious sarcasm, but I somehow
               | accidentally added "don't" which... really just makes it
               | confusing. Oops, haha.
        
             | gcr wrote:
             | I think you're right and you're wrong, it's a bit more
             | complicated.
             | 
             | ML is one of the few applications that benefit from
             | platform-specific optimizations, so if you need every ounce
             | of performance, you have your choice of which walled garden
             | you want to tether your application to. The "lock-in" comes
             | from the specific capabilities of your special-purpose
             | hardware, and for serious applications, you're already
             | thinking hard about whether to design your entire
             | implementation around Apple, NVidia, Google/TPU, or even
             | Android devices. For big models, platform-specific needs
             | influence every aspect of model design, including
             | data/model sharding, quantization, training loops...
             | 
             | For non-scientific applications, it's usual practice to
             | train your model in platform-agnostic ways using PyTorch or
             | Tensorflow or whatever and _then_ deploy it to devices in
             | platform-specific ways, whether that 's XLA, CoreML, Edge
             | TPU, Android NNAPI, TensorflowJS, or hell, custom-written
             | GLSL shaders or whatever.
             | 
             | We're just starting to see cross-platform frameworks that
             | abstract model inference: TFLite, PyTorch Mobile, ONNX. To
             | their credit, CoreML can act as a backend for any of these,
             | so you don't even need to worry about your platform.
        
             | gjsman-1000 wrote:
             | Every platform is a golden cage in some respect. Ask any
             | business who is stuck on ancient Win32 and even DOS
             | applications, source code long gone. (Looking at you my
             | local McDonalds, Menards, Tractor Supply)...
        
           | brookst wrote:
           | I get the value of the common APIs, but as a developer how do
           | you deal with the wide range of performance in different form
           | factors and product generations? Is there some way to
           | gracefully adapt the same models to a specific device's
           | capabilities?
        
             | londons_explore wrote:
             | There are a bunch of easy ways to scale neural nets.
             | Quantization and distillation being the main approaches (or
             | some combination of the two). Both typically require more
             | training time, but not much more human-effort.
             | 
             | You can normally expect to get way more than half the
             | 'outcome' from a neural net with half the
             | ram/compute/time/power budget. So neural nets scale 'down'
             | pretty well.
        
       | londons_explore wrote:
       | For comparison...
       | 
       | A single Google TPUv4 'pod' (entire row of datacenter racks)
       | gives 1,126,400 TFlops.
       | 
       | Thats why your pet ML projects will always be behind those done
       | at big companies.
        
         | mhuffman wrote:
         | I have always been under the impression that there will
         | eventually be a way to distribute ML projects across many
         | personal computers (like the Folding@home or SETI@home) that
         | could give even Google a run for their money! A few hundred
         | million personal computers is a lot of processing!
        
       | Thaxll wrote:
       | Why would Apple hide such optimization from public APIs?
        
         | esskay wrote:
         | It's only really used for their internal applications and the
         | OS level stuff so assume they want to prevent performance
         | issues with it having to deal with 3rd party stuff.
        
         | frogblast wrote:
         | It is available via public APIs, but the hardware instructions
         | themselves are not documented. This lets the instructions
         | change in future CPUs, vs having the ISA be baked in stone
         | forever.
         | 
         | Example: AMX predated the standard ARM matrix multiply
         | instructions. Perhaps Apple will add the ARM versions someday
         | and now can remove AMX without breaking compatibility. Or maybe
         | there will be a non-additive AMXv2.
        
       | bob1029 wrote:
       | This is a fairly ridiculous amount of performance, all things
       | considered.
       | 
       | It always seemed to me like SIMD/AVX/etc would eventually come
       | for the GPU's lunch money... How many more product generations of
       | "SIMD on steroids" before this is practically true?
       | 
       | The latency factor is the biggest thing for me. The GPU is a
       | turtle compared to CPU-bound techniques. I can see emerging
       | applications for this in real-time/streaming where every
       | millisecond counts.
        
         | zozbot234 wrote:
         | The GPU is more like a slow U-Haul truck, whereas the CPU is a
         | super fast race car. Both have merit in their own domain. And
         | GPU training is pretty solidly in the "slow and steady" camp.
        
           | pletnes wrote:
           | Training in production, yes. Developing locally is still a
           | thing for many reasons. More importantly, inference is more
           | <<sports car>> - you want the app to stay interactive!
        
         | fnordpiglet wrote:
         | A typical goal is 60hz, which is 17k microseconds. My cursory
         | research says as of 10 years ago a typical write/receive
         | latency for an nvidia card, i7, pcie2.0 is 20 microseconds.
         | That gives you a large budget despite the fact SIMD on chip is
         | measured in cycles not microseconds. Inside the GPU you have a
         | huge amount of space and resources to do highly specialized
         | operations in vast concurrency, I.e., bandwidth for compute is
         | huge and specialized. I don't see how CPU's or SOC will solve
         | this without vastly increasing die sizes and heat and power
         | consumption to be close to that of a GPU with all its cooling
         | requirements and heavy power needs.
         | 
         | That said I think the "good enough" metric is already there and
         | unless you're doing hardware ray tracing or extreme details at
         | high resolutions you won't need or care about a GPU any more.
         | 
         | Latency though isn't the issue. The times involved for human
         | perception are long and not getting shorter.
        
           | fragmede wrote:
           | Things have been "good enough" since 2012. But then VSCode
           | and bigger webpages came along and suddenly a Core2Duo just
           | doesn't cut it anymore. ML models need somewhere to run,
           | locally, and both Apple and Google have dedicated hardware on
           | smartphones for that. Support for bigger and bigger models
           | (read GPU performance) in smaller and smaller packages is
           | just the latest iteration of progress.
        
             | fnordpiglet wrote:
             | Yes I agree. Except I think real time ray tracing really is
             | that much better and shifts the goal posts again.
        
         | jasonwatkinspdx wrote:
         | One interesting data point here is the Fugaku supercomputer is
         | based around ARM's scalable vector stuff (basically Cray style
         | variable length vectors vs short vector SIMD like AVX) and no
         | gpu. Using HBM is a key enabler here.
         | 
         | I'm not sure GPUs will be displaced, looking at the
         | difficulties Larrabe had on the driver side, but I do think
         | we'll see more flexible alternatives becoming popular.
        
         | kllrnohj wrote:
         | You'd need a fairly drastic shift in the memory architecture of
         | CPUs for that. Not something unheard of, such as Intel's new
         | Xeon Max beast with HBM 2e on the CPU module. But it's
         | definitely not an issue of just throwing some big SIMD blocks
         | onto the die & calling it a day. That is, after all, basically
         | what AVX-512 is. And while it has it's place, it's also not
         | eating anyone's lunch money.
         | 
         | And also, as weird as it is, 1.5TFlops isn't actually that
         | ridiculous. We had that performance 14 years ago at 150w with
         | desktop GPUs. 14 years to reduce from 150w to what, 5w?, is
         | cool but also honestly pretty par for the course is it not?
         | Especially for a fixed-function block?
        
           | sliken wrote:
           | "You'd need a fairly drastic shift in the memory architecture
           | of CPU". You mean like selling laptops (at 400GB/sec) and
           | desktops (at 800GB/sec) with much improved memory systems.
           | 
           | I don't want to give up SO-DIMMs for a few mm thinner laptop,
           | but going from the intel/amd standard 70GB/sec to 400GB/sec
           | is a pretty big incentive.
        
           | r00fus wrote:
           | Aside from Apple's processors is 1.5TFlops in 5w possible
           | with other archs?
        
           | roxgib wrote:
           | Apple Silicon chips share memory between the CPU and GPU,
           | would that play into any calculation of the relative
           | benefits? Presumably the GPU isn't getting the full benefits
           | of a GPU optimised memory set up so the difference would be
           | smaller?
        
         | touisteur wrote:
         | The GPU people are also reaching for simd and fixed matmul hw
         | to increase perf. Tensor Cores (int, fp16, tf32 and even fp64
         | on A100) and the new DPX instructions. RT cores are a different
         | kind of horse but still specialized for BVH traversal and ray-
         | triangle intersection.
        
           | theLiminator wrote:
           | We're reaching a point where CPUs are increasingly getting
           | more specialized, and GPUs are becoming increasingly
           | generalized. Going at improvements from both sides of the
           | sandwich.
        
         | thechao wrote:
         | That's how we felt when we were writing the software rasterizer
         | for Larrabee! The issue is that that 1.5TFLOP is probably way
         | more power than the M1 GPU's ~2.5TFLOP. The second issue is
         | that a SW rasterizer is going to spend ~50% of its budget
         | emulating fixed function. So, now you're way more power, for
         | 1/4 the perf (best case). Also, you can't run any other apps,
         | and you're probably going to have bandwidth issues to the
         | display controller.
         | 
         | GPUs are an optimization to try to use the excess Moore's law
         | we have to get to the ghost of Dennard's law.
        
           | bob1029 wrote:
           | I think a power/latency/perf tradeoff could be agreeable for
           | certain applications. GPUs in the cloud are not exactly
           | cheap. Many gaming experiences do not require nanite-level
           | graphics.
           | 
           | Building something that can reliably output reasonable-
           | quality 3d graphics without relying on specific GPU
           | technologies will give you a much broader realm to operate
           | with.
           | 
           | I believe something along this path is the solution for
           | streaming gaming. I perceive the failure of Stadia, et. al.
           | as being a consequence of trying to bolt streaming onto
           | existing GPU-based, local gaming solutions. Build something
           | from scratch with streaming/latency as a #1 priority, and you
           | can dramatically expand the operational radius of each
           | datacenter (e.g. ~100km per millisecond saved).
        
             | dotnet00 wrote:
             | I feel like that's a somewhat out-of-touch interpretation,
             | as Stadia failed largely because of Google's terrible
             | reputation and the (completely valid) concerns from gamers
             | about companies intending to turn even single player games
             | into fragmented streaming platforms where the content is
             | entirely dependent on the whims of the company (a fitting
             | example being Google doing its thing and killing Stadia).
             | They had no shortage of GPUs.
             | 
             | NVIDIA's streaming service is doing relatively fine in
             | comparison. They simply share a GPU between several users
             | for anything that isn't demanding enough. They also get
             | around some of the concerns about gaming being turned into
             | another streaming style fragmented mess by not actually
             | selling the games. You simply log into your account on
             | Steam/GOG/whatever and play the games you already own as
             | you might on a local PC.
             | 
             | Additionally, "building something that can reliably output
             | reasonable-quality 3d graphics without relying on specific
             | GPU technologies" doesn't make much sense to me. If it's an
             | accelerator designed to handle relatively modern 3d
             | graphics, due to the programmability of a modern graphics
             | pipeline it's effectively just a GPU. There aren't any
             | underlying technologies that are required to be used as
             | long as they can produce a similar output (mobile GPUs tend
             | to have a different approach to how they implement the
             | graphics pipeline compared to desktop GPUs for instance).
        
             | jvanderbot wrote:
             | Light is 300km/ms in a vacuum. Is it that much slower
             | through switched fiber?
        
               | Someone wrote:
               | Signal speed in fiber is about  2/3  of that in vacuum (h
               | ttps://en.wikipedia.org/wiki/Optical_fiber#Refractive_ind
               | ex), but fiber won't be straight-line between sender and
               | receiver, light doesn't move in a straight line inside
               | the fiber, and the _switched_ adds delays.
               | 
               | https://www.pingdom.com/blog/theoretical-vs-real-world-
               | speed...: _"you should probably double the "ideal"
               | response times shown above for a more realistic target to
               | aim at"_
               | 
               | So yes,  1/3  of light speed in vacuum seems a decent
               | heuristic.
        
               | rkangel wrote:
               | Speed of light in glass is about 2/3 of the speed of
               | light in a vacuum (refractive index of glass is around
               | 1.5).
        
               | MobiusHorizons wrote:
               | Round trip latency matters here, which would get you down
               | to 150km without any slowdown through fiber.
        
             | thechao wrote:
             | Well; except for the fact that the 1.5TFLOP quoted in the
             | article is because of the AMX part. The _actually useful_
             | throughput of the big-core is probably more like 35GFLOP
             | _peak_. This compares to the 1-2TFLOP throughput of the
             | GPU. The CPU is easily going to be 50-100x slower than the
             | GPU.
             | 
             | If you're talking full-screen Angry Birds with, say, a 2x
             | average compositing, you're going to be fine on the CPU;
             | but, energy- and jitter- wise you'll still be happier with
             | the GPU, overall.
        
       | superkuh wrote:
       | It's a fast console, there's no doubt of that. Kind of like the
       | Playstation 3 when it came out. Fast, not much software support
       | without lots of special considerations, non-upgradable hardware,
       | limited peripheral support. All in all, a fast CPU embedded in a
       | marginal console-like "computer". People out there who were
       | tricked into buying the M1 8GB ram version can confirm.
        
         | 2fast4you wrote:
         | Tricked how? I've got an M2 8GB and loving it
        
           | smoldesu wrote:
           | Even with swap, 8gb is pretty paltry on a memory-hungry
           | system like MacOS, let alone a system that shares GPU and
           | system memory. 16gb is the minimum for me even though I
           | really only edit/compile code, and even then it can be pretty
           | easy to max out your system memory after a couple docker
           | containers...
           | 
           | It might not be a 'trick' per-se, but anyone who intends to
           | use a Mac for work should consider upgraded memory (IMO).
        
             | 2fast4you wrote:
             | I will agree with your last point. If I had bought this
             | machine for doing seriously development I would've gone for
             | 16gb. Saying that, I've been pleasantly surprised with its
             | power. I've been playing with Metal, throwing together a
             | native version ShaderToy, and it hasn't felt unpowered
             | once. Even when running the iPad emulator.
             | 
             | I did feel a little duped when I learned that some M1/M2
             | machines can only support one external monitor. Now I have
             | to replace my two monitors with a widescreen.
        
               | smoldesu wrote:
               | IMO, the 'problem' is that MacOS will use 4-5gb OOB, and
               | using an Electron app with a browser open will easily
               | push that into swap space. For most daily drivers, even
               | light users, they'll be happy to have upgraded memory.
        
               | 2fast4you wrote:
               | Right now with just safari and a few background things
               | I'm hovering at 6gb in use, so you're not wrong about how
               | much memory is being used. Regardless I don't think it's
               | a problem for light users. A light user imo would be just
               | browsing and email. 8GB will give you plenty of headroom
               | in that case.
               | 
               | I'm going to keep an eye on ram usage for the next few
               | days. I'm curious what it will look like on a more full
               | workload because if things have been swapping out, I
               | haven't noticed.
        
         | robertoandred wrote:
         | I love how Apple hater rhetoric hasn't changed in 30 years.
        
       | kolbusa wrote:
       | Nitpick... This paragraph is somewhat confusing. I think it is
       | worded incorrectly:
       | 
       |  _> Let 's simplify the problem and implicitly transpose the
       | matrix multiplication. Both A and B (our inputs) will have K (our
       | reduction dimension) as the leading dimension. This doesn't
       | really matter much in practice, but it simplifies our code a
       | lot._
       | 
       | The code is                 C[n * 16 + m] += A[k * 16 + m] * B[k
       | * 16 + n];
       | 
       | Which means that actually *m* is the leading dimension of A with
       | stride 16, and for B it is *n* with stride 16.
        
       | nimish wrote:
       | How is fp64? Nvidia crippled the 4090 to just 1.3T fp64 flops so
       | if a mac mini with m1 could match that it'd be a solid win
        
         | bsdnoob wrote:
         | You know you can see your upvoted stories right?
        
       | [deleted]
        
       | varunkmohan wrote:
       | Posts like these are always awesome to look at how much we can
       | push consumer hardware.
       | 
       | It's hard not to really appreciate some of the devices we have
       | today. For instance, an RTX 4090 is capable of 660 TFlops of FP8
       | (MSRP 1600). Would not be surprised if we soon have laptops that
       | can do petaflops of computation soon!
        
       | kristianp wrote:
       | Anyone with a comparison with Intel's deep learning boost or
       | VNNI, which is available on avx-512 processors such as
       | https://ark.intel.com/content/www/us/en/ark/products/213805/...
        
       | gcr wrote:
       | It's amazing to me that there are four separate pieces of
       | hardware in M1 devices that can do matrix multiplies.
       | 
       | In addition to running on the CPU, M1 Max devices have three
       | separate kinds of hardware-accelerated `gemm`: the GPU, the ANE
       | (Apple Neural Engine), and this special matrix coprocessor.
       | Here's a fairly detailed post that benchmarks each:
       | 
       | https://tlkh.dev/benchmarking-the-apple-m1-max
       | 
       | And here's a great post about the justification for having so
       | much special-purpose hardware:
       | 
       | https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...
       | 
       | As for the matrix coprocessor, Apple's built-in BLAS
       | implementation (Accelerate.framework) uses this chip. You can
       | link Numpy against this to benefit in your Python programs, for
       | example. Here are some old instructions:
       | https://gist.github.com/MarkDana/a9481b8134cf38a556cf23e1e81...
       | 
       | All this represents yet another cycle on the Wheel of
       | Reincarnation... (http://catb.org/jargon/html/W/wheel-of-
       | reincarnation.html)
        
         | amelius wrote:
         | Isn't this wheel of reincarnation simply a result of a shifting
         | bottleneck? A computation can be CPU-bound or memory-bound, and
         | this can change over hardware generations.
        
           | gcr wrote:
           | Makes sense... We're also seeing energy efficiency and model
           | size and latency becoming significant constraints these days,
           | and the more unique constraints an application has, perhaps
           | the more beneficial it is to have many different
           | implementations with different tradeoffs.
        
             | ElectricalUnion wrote:
             | > energy efficiency (...) many different implementations
             | 
             | Yep, thermal throttling is a thing, and sometimes all you
             | need is either useless silicon padding or some specialized,
             | most of the time dark silicon to both make it feasible to
             | cool and prevent it from melting.
        
               | roxgib wrote:
               | I suspect Apple was more worried about battery use in
               | this case.
        
           | throw10920 wrote:
           | It is, but the fact that the bottleneck has shifted multiple
           | times (as opposed to just this one recent time) is nonobvious
           | (to someone unfamiliar with computing history) and worthy of
           | pointing out.
        
         | Dylan16807 wrote:
         | > All this represents yet another cycle on the Wheel of
         | Reincarnation...
         | 
         | Isn't this adding new cores directly onto the main chip? That
         | doesn't sound like it fits to me.
         | 
         | And at this point GPUs have been straddling both sides of the
         | divide for decades, depending on the particular device form
         | factor and the necessary power.
         | 
         | The only thing I would actually say has gone through a _cycle_
         | lately is the crypto accelerator for mac SSDs.
        
           | TimTheTinker wrote:
           | > Isn't this adding new cores directly onto the main chip?
           | That doesn't sound like it fits to me.
           | 
           | These are _coprocessors_ , which are a very different thing
           | from just another CPU core. For one, they use a different
           | architecture (instruction set, registers/memory, etc.).
           | 
           | The "wheel of reincarnation" refers to features/capabilities
           | on coprocessors eventually being folded into the main CPU.
           | While CPUs have adopted insights from GPU implementations,
           | GPU functionality has never been fully folded into CPUs
           | (software rasterizers don't count).
        
         | lalaithion wrote:
         | There's also the media encoder hardware accelerator, which
         | isn't quite `gemm`, but certainly contains hardware that
         | performs `mm`s.
        
         | MrBuddyCasino wrote:
         | Since there is no summary, these are the benchmark findings:
         | AMX co-processor 2 TFLOPS FP32         GPU 8 TFLOPS FP32
         | Neural Engine 5.5 TFLOPS FP16
        
           | Firadeoclus wrote:
           | Note that AMX can achieve roughly double the FLOPS with FP16,
           | and 8 TFLOPS for the GPU is only about 77% of peak. You can
           | do better than that, especially using FP16 90+% is possible
           | (which is >9.4 TFLOPS).
        
           | londons_explore wrote:
           | So why would you choose to use the Neural Engine rather than
           | the GPU?
           | 
           | Just power efficiency?
        
             | potatolicious wrote:
             | That and if you want to use the GPU at the same time.
        
           | londons_explore wrote:
           | Is there any easy way to use all of these at the same time?
           | Ie. some library you can ask to do a big matrix multiply and
           | it will loadbalance between the bits of hardware?
           | 
           | Or do you have to manually split the computation between
           | them?
        
             | thewebcount wrote:
             | I'm by no means an expert in any of this. I mainly work on
             | video processing using the GPU. That said, I would think if
             | any library would do load balancing between them, it would
             | likely be the Accelerate.framework that ships with the
             | system.
             | 
             | However, I do have some experience with having the same
             | code run on the GPU and the CPU. In my work, we have tried
             | breaking images (usually frames of video) into various
             | sized chunks and processing on both the CPU and GPU at the
             | same time. Our conclusion is that the overhead of using
             | both outweighs any benefit you'd get. The GPU is so much
             | faster than the CPU, there's no point in involving the CPU
             | at all. These experiments were done several years ago, so
             | perhaps the landscape has changed since then, but that was
             | what we found.
        
               | jasonwatkinspdx wrote:
               | You might find David Wright's presentations about Unreal
               | 5 interesting:
               | 
               | https://highperformancegraphics.org/slides22/Journey_to_N
               | ani...
               | 
               | https://advances.realtimerendering.com/s2022/index.html#L
               | ume...
               | 
               | They're great presentations with a lot of depth in the
               | notes. I think videos are around somewhere if you prefer
               | that.
               | 
               | Two specifics I'd mention:
               | 
               | It seems a lot of games now use feedback between frames
               | as a way to tolerate the latency of moving data between
               | CPU and GPU. Eg the CPU will use GPU crunched data from
               | the previous frame as a source for CPU crunching that
               | optimizes what data gets passed to the GPU next.
               | 
               | The other is that fixed functionality is moving into
               | shaders. Unreal 5 uses a mix of hardware rasterization
               | and software rasterization in a shader (and path tracing
               | now as well). There the tradeoff between the two is
               | triangle size in pixels.
        
               | thewebcount wrote:
               | Oh wow! Thanks! That looks really cool.
        
               | jasonwatkinspdx wrote:
               | They're great. I dunno if you find 3d interesting vs
               | video, but the section in that nanite presentation where
               | he goes through how he arrived at the LoD clustering
               | design is some of the smartest stuff I've ever seen any
               | developer say, ever. Like John Carmack probably saw this
               | and went "dang, wish I'd thought of that" levels of
               | smart.
        
       | muricula wrote:
       | Some folks may be interested in the Armv9 Scalable Matrix
       | Extensions which appear to do something very very similar.
       | https://community.arm.com/arm-community-blogs/b/architecture...
        
       | FL33TW00D wrote:
       | I love all the posts by Bram. Please keep writing them!
        
       | NelsonMinar wrote:
       | Does Apple use the AMX in their own code? Is anything like the
       | AMX present in their mobile CPUs?
        
       | danieldk wrote:
       | The AMX units are really nice, especially because you can use
       | them simply with standard _sgemm_ through the Accelerate
       | framework. However, in most applications where latency is not an
       | issue, you 'll probably want to use Metal Performance shaders
       | instead, not only are they much faster most applications, they
       | can also be more energy efficient.
       | 
       | For instance, we did benchmarks of spaCy (natural language
       | processing) transformer models across various Apple Silicon SoCs
       | and MPS was 1.9x (M1) to 5.5x faster (M1 Ultra) while providing
       | far more performance per Watt. E.g. using MPS on an M2 MacBook
       | Air used 4W less energy while being 2.7x faster than AMX.
       | 
       | Full benchmarks are at the end of this post:
       | 
       | https://explosion.ai/blog/metal-performance-shaders
        
       | atonse wrote:
       | I remember driving to college nearly 20 years ago and one of the
       | headlines on the radio (NPR morning show) was that the DOE had
       | unveiled the world's fastest supercomputer at the White House
       | that day. It was going to do a whopping 6 Teraflops, and they
       | explained what that meant. And I remember thinking about all the
       | possibilities with that kind of compute.
       | 
       | I understand that this 1.5 TFlops may not be an exact comparison
       | (or maybe it's the same), but if it's even within an order of
       | magnitude, it is beyond mind-blowing, and we've just crossed over
       | into at Exaflops at the supercomputer level.
        
       ___________________________________________________________________
       (page generated 2023-01-05 23:00 UTC)