[HN Gopher] How to get 1.5 TFlops of FP32 performance on a singl... ___________________________________________________________________ How to get 1.5 TFlops of FP32 performance on a single M1 CPU core Author : signa11 Score : 281 points Date : 2023-01-05 13:22 UTC (9 hours ago) (HTM) web link (jott.live) (TXT) w3m dump (jott.live) | Tepix wrote: | So, this is about Apple's undocumented AMX instructions, issued | from CPU, executed on a special accelerator execution unit. | | Is there one such unit per CPU core? | MuffinFlavored wrote: | > So, this is about Apple's undocumented AMX instructions, | issued from CPU, executed on a special accelerator execution | unit. | | CPU instruction -> AMX instruction -> AMX result -> CPU? | | How are these kinds of things usually kept in sync/in a | manageable state? Like does the CPU block until the AMX | returns? | my123 wrote: | M1 has one AMX unit per cluster AFAIK. This however can and | does change between different chips. | danieldk wrote: | Yes, there is one per core cluster. The title is a bit | misleading, because it suggests that going to two or three | cores will scale linearly, though it won't be much faster. | See here for sgemm benchmarks for everything from the M1 to | M1 Ultra and 1 to 16 threads: | | https://github.com/danieldk/gemm-benchmark#1-to-16-threads | adrian_b wrote: | No. | | So the title is misleading, even if it is true that you get | this performance with a program that uses a single CPU core. | mochomocha wrote: | I think the author downplays the significance his work because it | only applies to "small neural networks". There are a lot of use- | cases that can benefit from this type of optimizations. | Discovering how to use an undocumented fast accelerator available | on millions of devices is very valuable. | MuffinFlavored wrote: | Not up to date on a lot of "AI"/"ML" things, why isn't this | significant for medium/large neural networks as well? | lostmsu wrote: | RTX 3090 theoretical matmul is 142 TFlops. E.g. about 100x of | this. | bee_rider wrote: | The 1.5 here is for a single core, though. So if we assume | that the performance core on an M1 is around 7.5 watts (I'm | not actually sure, seems like a reasonable upper bound | though if a whole M1 mini is around 39 watts), we'd be | looking at around 750 watts to match. Which seems like a | surprisingly non-crazy amount of power given these are 32 | bit flops, unlike the 16 in the RTX 3090, and they come | from a CPU. | lostmsu wrote: | This code runs on AMX co-processor. From the article: | | > An important distinction is that the AMX:CPU ratio is | not 1:1; not every core has its own AMX co-processor. | | My understanding is there's only 1 of those per regular | M1 CPU, maybe 4 on the largest one (Ultra). | johndough wrote: | The RTX 3090 has 35.58 TFlops FP32 performance, or 285.48 | FP16 according to https://en.wikipedia.org/wiki/List_of_Nvi | dia_graphics_proces... | | EDIT: I fell for NVIDIA's marketing. The dense FP16 | performance is only half of 284.48, which is 142. Thanks to | adgjlsfhk1 for the correction. | adgjlsfhk1 wrote: | That 285 is listed as (2:1 sparse) which means it's only | valid for matrices where 2 out of every 4 numbers are | zero. For dense matrices it's half that. | bee_rider wrote: | Are 2:1 sparse matrices a common thing? It seems weird, | like clearly that's not sparse enough to want to use, | like, sparse matrix "CSR" style storage or something, | haha. I would just treat it as dense I guess. | adgjlsfhk1 wrote: | They aren't. As far as I can tell, Nvidia does this to be | able to double the number of TFlops they put on their | website. (this might be a little unfair, the real reason | is that in ML it _might_ be possible to train a NN such | that your matrices have this structure, but I haven 't | seen anyone other than Nvidia use it) | bee_rider wrote: | I'm trying to think of cases where it might accidentally | come up, and all I can think of is something like "oops I | used complex but my values are actually real." | [deleted] | dotnet00 wrote: | There has been some work in that direction but it hasn't | really caught on as fast as NVIDIA may have expected it | to. | lostmsu wrote: | Yeah, still waiting for this feature to be available in | PyTorch natively. | my123 wrote: | Apple did prefer to expose it through their own | Accelerator.framework API however... | capableweb wrote: | Of course they do, Apple like to remain as much in control as | possible. If suddenly it becomes more efficient/faster to run | ML/AI stuff on Asahi Linux on Mac hardware then with macOS, | I'm sure they be embarrassed enough to take some sort of | action. And I'm pretty sure that action will be towards the | side of "closing things down" rather than "opening stuff up", | as is tradition. | my123 wrote: | Wrong answer. | | AMX is an unstable ISA that changes between product | generations. That's why it's not publicly documented. | | Arm SME is the standardisation of the concept, but is not | inmarket yet. | | https://community.arm.com/arm-community- | blogs/b/architecture... | svantana wrote: | Has it been verified that they actually use these | instructions in Accelerate.framework? I just benchmarked this | on my 2019 intel i9 mbp, and got the following speeds for | 128x128 matrices, 32 repeats: cblas_sgemm: 36 | GFLOP/s vDSP_mmul: 41 GFLOP/s | | That's a pretty big deal if these functions are >30x faster | on the M1...! | | edit: that seems to be verified in the tlkh.dev blog post | above. Interestingly, I ran the same code on my bargain- | basement 2020 iphone SE, and got 259GFLOP/s! These apple | devices are pretty mindblowing. | danieldk wrote: | _Has it been verified that they actually use these | instructions in Accelerate.framework?_ | | Yes. Aside from benchmarks, you can easily verify this by | profiling an application with Instruments and then | inspecting the disassembly. | | However, it should be said that AMX does not scale linearly | with the number of cores, but with the number of core | clusters. So, on the M1 if you use Accelerate in two | threads (rather than one), performance will barely improve, | because the first thread can keep the AMX unit busy enough. | | However, e.g. the M1 Pro and M1 Max have two performance | core clusters with AMX units in them. So matrix | multiplication doubles roughly two times compared to the | M1. Similarly, the M1 Ultra has fours performance core | clusters, so matrix multiplication performance is roughly | twice that of the M1 Pro/Max and four times that of the M1. | | Benchmarks: | | https://github.com/danieldk/gemm-benchmark#1-to-16-threads | mark_l_watson wrote: | Apple has done a wonderful job making CoreML smoothly | integrated with iOS, iWatchOS, iPadOS, and macOS development. | ur-whale wrote: | > Apple has done a wonderful job making CoreML | | Apple has done a wonderful job of further locking their user | into the golden cage they call a platform. | bee_rider wrote: | The worst thing is, their users don't even seem to be | totally happy with the state of affairs! It's like they | don't even realize their preferences are wrong. :( | bee_rider wrote: | This was intended to be obvious sarcasm, but I somehow | accidentally added "don't" which... really just makes it | confusing. Oops, haha. | gcr wrote: | I think you're right and you're wrong, it's a bit more | complicated. | | ML is one of the few applications that benefit from | platform-specific optimizations, so if you need every ounce | of performance, you have your choice of which walled garden | you want to tether your application to. The "lock-in" comes | from the specific capabilities of your special-purpose | hardware, and for serious applications, you're already | thinking hard about whether to design your entire | implementation around Apple, NVidia, Google/TPU, or even | Android devices. For big models, platform-specific needs | influence every aspect of model design, including | data/model sharding, quantization, training loops... | | For non-scientific applications, it's usual practice to | train your model in platform-agnostic ways using PyTorch or | Tensorflow or whatever and _then_ deploy it to devices in | platform-specific ways, whether that 's XLA, CoreML, Edge | TPU, Android NNAPI, TensorflowJS, or hell, custom-written | GLSL shaders or whatever. | | We're just starting to see cross-platform frameworks that | abstract model inference: TFLite, PyTorch Mobile, ONNX. To | their credit, CoreML can act as a backend for any of these, | so you don't even need to worry about your platform. | gjsman-1000 wrote: | Every platform is a golden cage in some respect. Ask any | business who is stuck on ancient Win32 and even DOS | applications, source code long gone. (Looking at you my | local McDonalds, Menards, Tractor Supply)... | brookst wrote: | I get the value of the common APIs, but as a developer how do | you deal with the wide range of performance in different form | factors and product generations? Is there some way to | gracefully adapt the same models to a specific device's | capabilities? | londons_explore wrote: | There are a bunch of easy ways to scale neural nets. | Quantization and distillation being the main approaches (or | some combination of the two). Both typically require more | training time, but not much more human-effort. | | You can normally expect to get way more than half the | 'outcome' from a neural net with half the | ram/compute/time/power budget. So neural nets scale 'down' | pretty well. | londons_explore wrote: | For comparison... | | A single Google TPUv4 'pod' (entire row of datacenter racks) | gives 1,126,400 TFlops. | | Thats why your pet ML projects will always be behind those done | at big companies. | mhuffman wrote: | I have always been under the impression that there will | eventually be a way to distribute ML projects across many | personal computers (like the Folding@home or SETI@home) that | could give even Google a run for their money! A few hundred | million personal computers is a lot of processing! | Thaxll wrote: | Why would Apple hide such optimization from public APIs? | esskay wrote: | It's only really used for their internal applications and the | OS level stuff so assume they want to prevent performance | issues with it having to deal with 3rd party stuff. | frogblast wrote: | It is available via public APIs, but the hardware instructions | themselves are not documented. This lets the instructions | change in future CPUs, vs having the ISA be baked in stone | forever. | | Example: AMX predated the standard ARM matrix multiply | instructions. Perhaps Apple will add the ARM versions someday | and now can remove AMX without breaking compatibility. Or maybe | there will be a non-additive AMXv2. | bob1029 wrote: | This is a fairly ridiculous amount of performance, all things | considered. | | It always seemed to me like SIMD/AVX/etc would eventually come | for the GPU's lunch money... How many more product generations of | "SIMD on steroids" before this is practically true? | | The latency factor is the biggest thing for me. The GPU is a | turtle compared to CPU-bound techniques. I can see emerging | applications for this in real-time/streaming where every | millisecond counts. | zozbot234 wrote: | The GPU is more like a slow U-Haul truck, whereas the CPU is a | super fast race car. Both have merit in their own domain. And | GPU training is pretty solidly in the "slow and steady" camp. | pletnes wrote: | Training in production, yes. Developing locally is still a | thing for many reasons. More importantly, inference is more | <<sports car>> - you want the app to stay interactive! | fnordpiglet wrote: | A typical goal is 60hz, which is 17k microseconds. My cursory | research says as of 10 years ago a typical write/receive | latency for an nvidia card, i7, pcie2.0 is 20 microseconds. | That gives you a large budget despite the fact SIMD on chip is | measured in cycles not microseconds. Inside the GPU you have a | huge amount of space and resources to do highly specialized | operations in vast concurrency, I.e., bandwidth for compute is | huge and specialized. I don't see how CPU's or SOC will solve | this without vastly increasing die sizes and heat and power | consumption to be close to that of a GPU with all its cooling | requirements and heavy power needs. | | That said I think the "good enough" metric is already there and | unless you're doing hardware ray tracing or extreme details at | high resolutions you won't need or care about a GPU any more. | | Latency though isn't the issue. The times involved for human | perception are long and not getting shorter. | fragmede wrote: | Things have been "good enough" since 2012. But then VSCode | and bigger webpages came along and suddenly a Core2Duo just | doesn't cut it anymore. ML models need somewhere to run, | locally, and both Apple and Google have dedicated hardware on | smartphones for that. Support for bigger and bigger models | (read GPU performance) in smaller and smaller packages is | just the latest iteration of progress. | fnordpiglet wrote: | Yes I agree. Except I think real time ray tracing really is | that much better and shifts the goal posts again. | jasonwatkinspdx wrote: | One interesting data point here is the Fugaku supercomputer is | based around ARM's scalable vector stuff (basically Cray style | variable length vectors vs short vector SIMD like AVX) and no | gpu. Using HBM is a key enabler here. | | I'm not sure GPUs will be displaced, looking at the | difficulties Larrabe had on the driver side, but I do think | we'll see more flexible alternatives becoming popular. | kllrnohj wrote: | You'd need a fairly drastic shift in the memory architecture of | CPUs for that. Not something unheard of, such as Intel's new | Xeon Max beast with HBM 2e on the CPU module. But it's | definitely not an issue of just throwing some big SIMD blocks | onto the die & calling it a day. That is, after all, basically | what AVX-512 is. And while it has it's place, it's also not | eating anyone's lunch money. | | And also, as weird as it is, 1.5TFlops isn't actually that | ridiculous. We had that performance 14 years ago at 150w with | desktop GPUs. 14 years to reduce from 150w to what, 5w?, is | cool but also honestly pretty par for the course is it not? | Especially for a fixed-function block? | sliken wrote: | "You'd need a fairly drastic shift in the memory architecture | of CPU". You mean like selling laptops (at 400GB/sec) and | desktops (at 800GB/sec) with much improved memory systems. | | I don't want to give up SO-DIMMs for a few mm thinner laptop, | but going from the intel/amd standard 70GB/sec to 400GB/sec | is a pretty big incentive. | r00fus wrote: | Aside from Apple's processors is 1.5TFlops in 5w possible | with other archs? | roxgib wrote: | Apple Silicon chips share memory between the CPU and GPU, | would that play into any calculation of the relative | benefits? Presumably the GPU isn't getting the full benefits | of a GPU optimised memory set up so the difference would be | smaller? | touisteur wrote: | The GPU people are also reaching for simd and fixed matmul hw | to increase perf. Tensor Cores (int, fp16, tf32 and even fp64 | on A100) and the new DPX instructions. RT cores are a different | kind of horse but still specialized for BVH traversal and ray- | triangle intersection. | theLiminator wrote: | We're reaching a point where CPUs are increasingly getting | more specialized, and GPUs are becoming increasingly | generalized. Going at improvements from both sides of the | sandwich. | thechao wrote: | That's how we felt when we were writing the software rasterizer | for Larrabee! The issue is that that 1.5TFLOP is probably way | more power than the M1 GPU's ~2.5TFLOP. The second issue is | that a SW rasterizer is going to spend ~50% of its budget | emulating fixed function. So, now you're way more power, for | 1/4 the perf (best case). Also, you can't run any other apps, | and you're probably going to have bandwidth issues to the | display controller. | | GPUs are an optimization to try to use the excess Moore's law | we have to get to the ghost of Dennard's law. | bob1029 wrote: | I think a power/latency/perf tradeoff could be agreeable for | certain applications. GPUs in the cloud are not exactly | cheap. Many gaming experiences do not require nanite-level | graphics. | | Building something that can reliably output reasonable- | quality 3d graphics without relying on specific GPU | technologies will give you a much broader realm to operate | with. | | I believe something along this path is the solution for | streaming gaming. I perceive the failure of Stadia, et. al. | as being a consequence of trying to bolt streaming onto | existing GPU-based, local gaming solutions. Build something | from scratch with streaming/latency as a #1 priority, and you | can dramatically expand the operational radius of each | datacenter (e.g. ~100km per millisecond saved). | dotnet00 wrote: | I feel like that's a somewhat out-of-touch interpretation, | as Stadia failed largely because of Google's terrible | reputation and the (completely valid) concerns from gamers | about companies intending to turn even single player games | into fragmented streaming platforms where the content is | entirely dependent on the whims of the company (a fitting | example being Google doing its thing and killing Stadia). | They had no shortage of GPUs. | | NVIDIA's streaming service is doing relatively fine in | comparison. They simply share a GPU between several users | for anything that isn't demanding enough. They also get | around some of the concerns about gaming being turned into | another streaming style fragmented mess by not actually | selling the games. You simply log into your account on | Steam/GOG/whatever and play the games you already own as | you might on a local PC. | | Additionally, "building something that can reliably output | reasonable-quality 3d graphics without relying on specific | GPU technologies" doesn't make much sense to me. If it's an | accelerator designed to handle relatively modern 3d | graphics, due to the programmability of a modern graphics | pipeline it's effectively just a GPU. There aren't any | underlying technologies that are required to be used as | long as they can produce a similar output (mobile GPUs tend | to have a different approach to how they implement the | graphics pipeline compared to desktop GPUs for instance). | jvanderbot wrote: | Light is 300km/ms in a vacuum. Is it that much slower | through switched fiber? | Someone wrote: | Signal speed in fiber is about 2/3 of that in vacuum (h | ttps://en.wikipedia.org/wiki/Optical_fiber#Refractive_ind | ex), but fiber won't be straight-line between sender and | receiver, light doesn't move in a straight line inside | the fiber, and the _switched_ adds delays. | | https://www.pingdom.com/blog/theoretical-vs-real-world- | speed...: _"you should probably double the "ideal" | response times shown above for a more realistic target to | aim at"_ | | So yes, 1/3 of light speed in vacuum seems a decent | heuristic. | rkangel wrote: | Speed of light in glass is about 2/3 of the speed of | light in a vacuum (refractive index of glass is around | 1.5). | MobiusHorizons wrote: | Round trip latency matters here, which would get you down | to 150km without any slowdown through fiber. | thechao wrote: | Well; except for the fact that the 1.5TFLOP quoted in the | article is because of the AMX part. The _actually useful_ | throughput of the big-core is probably more like 35GFLOP | _peak_. This compares to the 1-2TFLOP throughput of the | GPU. The CPU is easily going to be 50-100x slower than the | GPU. | | If you're talking full-screen Angry Birds with, say, a 2x | average compositing, you're going to be fine on the CPU; | but, energy- and jitter- wise you'll still be happier with | the GPU, overall. | superkuh wrote: | It's a fast console, there's no doubt of that. Kind of like the | Playstation 3 when it came out. Fast, not much software support | without lots of special considerations, non-upgradable hardware, | limited peripheral support. All in all, a fast CPU embedded in a | marginal console-like "computer". People out there who were | tricked into buying the M1 8GB ram version can confirm. | 2fast4you wrote: | Tricked how? I've got an M2 8GB and loving it | smoldesu wrote: | Even with swap, 8gb is pretty paltry on a memory-hungry | system like MacOS, let alone a system that shares GPU and | system memory. 16gb is the minimum for me even though I | really only edit/compile code, and even then it can be pretty | easy to max out your system memory after a couple docker | containers... | | It might not be a 'trick' per-se, but anyone who intends to | use a Mac for work should consider upgraded memory (IMO). | 2fast4you wrote: | I will agree with your last point. If I had bought this | machine for doing seriously development I would've gone for | 16gb. Saying that, I've been pleasantly surprised with its | power. I've been playing with Metal, throwing together a | native version ShaderToy, and it hasn't felt unpowered | once. Even when running the iPad emulator. | | I did feel a little duped when I learned that some M1/M2 | machines can only support one external monitor. Now I have | to replace my two monitors with a widescreen. | smoldesu wrote: | IMO, the 'problem' is that MacOS will use 4-5gb OOB, and | using an Electron app with a browser open will easily | push that into swap space. For most daily drivers, even | light users, they'll be happy to have upgraded memory. | 2fast4you wrote: | Right now with just safari and a few background things | I'm hovering at 6gb in use, so you're not wrong about how | much memory is being used. Regardless I don't think it's | a problem for light users. A light user imo would be just | browsing and email. 8GB will give you plenty of headroom | in that case. | | I'm going to keep an eye on ram usage for the next few | days. I'm curious what it will look like on a more full | workload because if things have been swapping out, I | haven't noticed. | robertoandred wrote: | I love how Apple hater rhetoric hasn't changed in 30 years. | kolbusa wrote: | Nitpick... This paragraph is somewhat confusing. I think it is | worded incorrectly: | | _> Let 's simplify the problem and implicitly transpose the | matrix multiplication. Both A and B (our inputs) will have K (our | reduction dimension) as the leading dimension. This doesn't | really matter much in practice, but it simplifies our code a | lot._ | | The code is C[n * 16 + m] += A[k * 16 + m] * B[k | * 16 + n]; | | Which means that actually *m* is the leading dimension of A with | stride 16, and for B it is *n* with stride 16. | nimish wrote: | How is fp64? Nvidia crippled the 4090 to just 1.3T fp64 flops so | if a mac mini with m1 could match that it'd be a solid win | bsdnoob wrote: | You know you can see your upvoted stories right? | [deleted] | varunkmohan wrote: | Posts like these are always awesome to look at how much we can | push consumer hardware. | | It's hard not to really appreciate some of the devices we have | today. For instance, an RTX 4090 is capable of 660 TFlops of FP8 | (MSRP 1600). Would not be surprised if we soon have laptops that | can do petaflops of computation soon! | kristianp wrote: | Anyone with a comparison with Intel's deep learning boost or | VNNI, which is available on avx-512 processors such as | https://ark.intel.com/content/www/us/en/ark/products/213805/... | gcr wrote: | It's amazing to me that there are four separate pieces of | hardware in M1 devices that can do matrix multiplies. | | In addition to running on the CPU, M1 Max devices have three | separate kinds of hardware-accelerated `gemm`: the GPU, the ANE | (Apple Neural Engine), and this special matrix coprocessor. | Here's a fairly detailed post that benchmarks each: | | https://tlkh.dev/benchmarking-the-apple-m1-max | | And here's a great post about the justification for having so | much special-purpose hardware: | | https://medium.com/swlh/apples-m1-secret-coprocessor-6599492... | | As for the matrix coprocessor, Apple's built-in BLAS | implementation (Accelerate.framework) uses this chip. You can | link Numpy against this to benefit in your Python programs, for | example. Here are some old instructions: | https://gist.github.com/MarkDana/a9481b8134cf38a556cf23e1e81... | | All this represents yet another cycle on the Wheel of | Reincarnation... (http://catb.org/jargon/html/W/wheel-of- | reincarnation.html) | amelius wrote: | Isn't this wheel of reincarnation simply a result of a shifting | bottleneck? A computation can be CPU-bound or memory-bound, and | this can change over hardware generations. | gcr wrote: | Makes sense... We're also seeing energy efficiency and model | size and latency becoming significant constraints these days, | and the more unique constraints an application has, perhaps | the more beneficial it is to have many different | implementations with different tradeoffs. | ElectricalUnion wrote: | > energy efficiency (...) many different implementations | | Yep, thermal throttling is a thing, and sometimes all you | need is either useless silicon padding or some specialized, | most of the time dark silicon to both make it feasible to | cool and prevent it from melting. | roxgib wrote: | I suspect Apple was more worried about battery use in | this case. | throw10920 wrote: | It is, but the fact that the bottleneck has shifted multiple | times (as opposed to just this one recent time) is nonobvious | (to someone unfamiliar with computing history) and worthy of | pointing out. | Dylan16807 wrote: | > All this represents yet another cycle on the Wheel of | Reincarnation... | | Isn't this adding new cores directly onto the main chip? That | doesn't sound like it fits to me. | | And at this point GPUs have been straddling both sides of the | divide for decades, depending on the particular device form | factor and the necessary power. | | The only thing I would actually say has gone through a _cycle_ | lately is the crypto accelerator for mac SSDs. | TimTheTinker wrote: | > Isn't this adding new cores directly onto the main chip? | That doesn't sound like it fits to me. | | These are _coprocessors_ , which are a very different thing | from just another CPU core. For one, they use a different | architecture (instruction set, registers/memory, etc.). | | The "wheel of reincarnation" refers to features/capabilities | on coprocessors eventually being folded into the main CPU. | While CPUs have adopted insights from GPU implementations, | GPU functionality has never been fully folded into CPUs | (software rasterizers don't count). | lalaithion wrote: | There's also the media encoder hardware accelerator, which | isn't quite `gemm`, but certainly contains hardware that | performs `mm`s. | MrBuddyCasino wrote: | Since there is no summary, these are the benchmark findings: | AMX co-processor 2 TFLOPS FP32 GPU 8 TFLOPS FP32 | Neural Engine 5.5 TFLOPS FP16 | Firadeoclus wrote: | Note that AMX can achieve roughly double the FLOPS with FP16, | and 8 TFLOPS for the GPU is only about 77% of peak. You can | do better than that, especially using FP16 90+% is possible | (which is >9.4 TFLOPS). | londons_explore wrote: | So why would you choose to use the Neural Engine rather than | the GPU? | | Just power efficiency? | potatolicious wrote: | That and if you want to use the GPU at the same time. | londons_explore wrote: | Is there any easy way to use all of these at the same time? | Ie. some library you can ask to do a big matrix multiply and | it will loadbalance between the bits of hardware? | | Or do you have to manually split the computation between | them? | thewebcount wrote: | I'm by no means an expert in any of this. I mainly work on | video processing using the GPU. That said, I would think if | any library would do load balancing between them, it would | likely be the Accelerate.framework that ships with the | system. | | However, I do have some experience with having the same | code run on the GPU and the CPU. In my work, we have tried | breaking images (usually frames of video) into various | sized chunks and processing on both the CPU and GPU at the | same time. Our conclusion is that the overhead of using | both outweighs any benefit you'd get. The GPU is so much | faster than the CPU, there's no point in involving the CPU | at all. These experiments were done several years ago, so | perhaps the landscape has changed since then, but that was | what we found. | jasonwatkinspdx wrote: | You might find David Wright's presentations about Unreal | 5 interesting: | | https://highperformancegraphics.org/slides22/Journey_to_N | ani... | | https://advances.realtimerendering.com/s2022/index.html#L | ume... | | They're great presentations with a lot of depth in the | notes. I think videos are around somewhere if you prefer | that. | | Two specifics I'd mention: | | It seems a lot of games now use feedback between frames | as a way to tolerate the latency of moving data between | CPU and GPU. Eg the CPU will use GPU crunched data from | the previous frame as a source for CPU crunching that | optimizes what data gets passed to the GPU next. | | The other is that fixed functionality is moving into | shaders. Unreal 5 uses a mix of hardware rasterization | and software rasterization in a shader (and path tracing | now as well). There the tradeoff between the two is | triangle size in pixels. | thewebcount wrote: | Oh wow! Thanks! That looks really cool. | jasonwatkinspdx wrote: | They're great. I dunno if you find 3d interesting vs | video, but the section in that nanite presentation where | he goes through how he arrived at the LoD clustering | design is some of the smartest stuff I've ever seen any | developer say, ever. Like John Carmack probably saw this | and went "dang, wish I'd thought of that" levels of | smart. | muricula wrote: | Some folks may be interested in the Armv9 Scalable Matrix | Extensions which appear to do something very very similar. | https://community.arm.com/arm-community-blogs/b/architecture... | FL33TW00D wrote: | I love all the posts by Bram. Please keep writing them! | NelsonMinar wrote: | Does Apple use the AMX in their own code? Is anything like the | AMX present in their mobile CPUs? | danieldk wrote: | The AMX units are really nice, especially because you can use | them simply with standard _sgemm_ through the Accelerate | framework. However, in most applications where latency is not an | issue, you 'll probably want to use Metal Performance shaders | instead, not only are they much faster most applications, they | can also be more energy efficient. | | For instance, we did benchmarks of spaCy (natural language | processing) transformer models across various Apple Silicon SoCs | and MPS was 1.9x (M1) to 5.5x faster (M1 Ultra) while providing | far more performance per Watt. E.g. using MPS on an M2 MacBook | Air used 4W less energy while being 2.7x faster than AMX. | | Full benchmarks are at the end of this post: | | https://explosion.ai/blog/metal-performance-shaders | atonse wrote: | I remember driving to college nearly 20 years ago and one of the | headlines on the radio (NPR morning show) was that the DOE had | unveiled the world's fastest supercomputer at the White House | that day. It was going to do a whopping 6 Teraflops, and they | explained what that meant. And I remember thinking about all the | possibilities with that kind of compute. | | I understand that this 1.5 TFlops may not be an exact comparison | (or maybe it's the same), but if it's even within an order of | magnitude, it is beyond mind-blowing, and we've just crossed over | into at Exaflops at the supercomputer level. ___________________________________________________________________ (page generated 2023-01-05 23:00 UTC)