[HN Gopher] AI's compute fragmentation: what matrix multiplicati... ___________________________________________________________________ AI's compute fragmentation: what matrix multiplication teaches us Author : tzhenghao Score : 46 points Date : 2023-03-23 18:34 UTC (4 hours ago) (HTM) web link (www.modular.com) (TXT) w3m dump (www.modular.com) | BenoitP wrote: | There's hope in intermediate representations, in OpenXLA: | | https://opensource.googleblog.com/2023/03/openxla-is-ready-t... | | > OpenXLA is an open source ML compiler ecosystem co-developed by | AI/ML industry leaders including Alibaba, Amazon Web Services, | AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, | Intel, Meta, and NVIDIA. It enables developers to compile and | optimize models from all leading ML frameworks for efficient | training and serving on a wide variety of hardware | junrushao1994 wrote: | One thing I really love about XLA is GSPMD which effectively | allows scalable distributed training in practice. However, I | was quite curious how it is related to matrix multiplication | though, given XLA is more focusing on graph-level optimization | and basically offloads matmul to other libraries like Triton | and cuBLAS | photochemsyn wrote: | > "Think about it: how can a small number of specialized experts, | who hand write and tune assembly code, possibly scale their work | to all the different configurations while also incorporating | their work into all the AI frameworks?! It's simply an impossible | task." | | Naively, I wonder if this is the kind of problem that AI itself | can solve, which is a rather singularity-approaching concept. | Maybe there's too much logic involved and not enough training | data on different configurations for that to work? A bit spooky | however, the thought of self-bootstrapping AI. | dimatura wrote: | There has been work on using AI for this at various levels - at | the neural architecture level (finding neural architectures | with high throughput/latency for a given hardware), at the | algorithm level (finding faster matrix multiplication | routines), and at the hardware level (iirc Google stated the | latest version of google TPUs were partially designed with AI). | bigbillheck wrote: | This is the kind of problem AI's been solving for 25 years and | more: https://www.fftw.org | bigbillheck wrote: | Surely one solution is for the AI frameworks to each themselves | understand the operating environment and choose the best | implementation at run-time, much like the way they currently do. | kickingvegas wrote: | Off topic, but related. | https://mastodon.social/@mcc/110024854706734967 | adamnemecek wrote: | No, they compute spectra. | junrushao1994 wrote: | My take: optimizing matrix multiplication is not hard on modern | architecture if you have the right abstraction. The code itself | could be fragmented across different programming models, which is | true, but the underlying techniques are not hard for a 2nd/3rd | year undergrad to understand. There are only a few important ones | on GPU: loop tiling, pipelining, shared memory swizzle, memory | coalescing. A properly designed compiler can allow developers to | optimize matmuls within 100 lines of code. | touisteur wrote: | Looking at the effort plunked into things like cutlass and them | still not reaching cuBLAS perf (which very few can beat - in | the places where cuBLAS shines! which is... not that many...), | and even in cuDNN and they're still eeking out single digit | improvements regularly, I'd say this is probably harder than | that. At least if you're reaching for the >50% use of the 37 | TFLOPS of an A40. If you're fine throwing more GPUs at the | problem, sure. | | Edit: I mean when you still see papers every year with large | improvements in perf, and things like 'we used tensor cores and | managed to get back fp32 accuracy with 3 rounds of the things' | - what? - I can attest it doesn't take 2 weeks to get this kind | of results. And it's just getting started on tensor cores! And | when on the nvidia forums someone says 'nah probably no | improvement to use tensor cores for fft' and you get a link | with a paper with a significative improvement in perf using | tensor cores, I say we're just starting. | junrushao1994 wrote: | This is definitely a great point! With the context of AI | workloads, where critical matmuls are basically of regular | large shapes, are there many cases where cutlass/Triton are | worse than cuBLAS where we need to throw more GPUs at it? | touisteur wrote: | cuBLAS is very often too heavy (too much overhead, memory | movement to fit the API, not optimized for small batches of | small matrices) and you can get huge improvements while | chaining cudnn/cutlass/autotuned kernels. Especially if | you're still on GDDR6 every data movement is a killer so if | you can put it all together and never go back to global | memory, you get amazing improvements. And this is without | tensor cores. Programming them by hand is a pain so here | enters cutlass... | junrushao1994 wrote: | Yeah cuBLAS is definitely not perfect in many cases :-(( | | Speaking of GEMM fusion that you mentioned, flash | attention is basically GEMM fusion with online softmax | right? This is something I believe really cool and can be | made really easy wit a proper abstraction. Say, you may | move a chunk of computation under a certain loop and | instruct the compiler to optimize data movement or cache | intermediate tiles somewhere on chip | touisteur wrote: | There's something of this in cutlass with prologues and | epilogues, and in the 'backend mode' of cudnn, but | overall breaking the 'cuBLAS takes your whole device and | tries to saturate it for this one matmul' is going to | require a huge lot of abstraction work. | | Cutlass is supposed to be the first step and to anyone | who struggles to understand WTF you're doing when using | it, you are not alone. I've seen literally amazing room- | silencing stuff with it, but heavy template stuff is | really not my thing. | junrushao1994 wrote: | > we used tensor cores and managed to get back fp32 accuracy | with 3 rounds of the things | | Hey are you referring to 3xTF32 (https://github.com/NVIDIA/cu | tlass/tree/master/examples/28_am...)? IMO this is a perfect | example where proper abstraction could save engineers non- | trivial amount of time - imagine a compiler stack which | allows 3xTF32 as a normal dtype and subsequent analysis | compatible with this special dtype :-) | mathisfun123 wrote: | > A properly designed compiler can allow developers to optimize | matmuls within 100 lines of code. | | man this is such a funny closing comment - what exactly do you | think is involved in designing a compiler that enables devs to | optimize matmuls if not 1000s of person hours/years/etc of very | "fine-grained" perf research? what the "abstraction" people | don't understand (because they only deal in abstractions) is | that achieving performance involves literally the antithesis of | abstraction - you need to understand your hardware down to the | gate level (sometimes). | | > loop tiling, pipelining, shared memory swizzle, memory | coalescing | | have you ever applied any of these? the only way you could | apply these as a generic (without consideration of your | particular hardware) algo is using a tuner; this is of course | widely the route taken but that's not an "understanding" of | anything except guess and check. | brucethemoose2 wrote: | Yeah well tell all that to Nvidia, who very much likes the | fragmentation and wants to keep things that way. | dekhn wrote: | they are the one vendor who had the insight ~20 years ago to | invest long-term in GPUs and have continuously made impressive | products while supporting a cross-platform developer base. For | this, I reward them with my $$$ (both work and home). | misnome wrote: | And they developed this fragmentation by... building good | tools, good documentation, and comprehensively supporting them | for 15 years in a way that makes people feel safe building on | top of them. | | It's not fragmentation, they built a moat. | touisteur wrote: | And with their actual understanding of the hardware | limitations of GPUs (memory bandwidth) and the parallel work | on things like cutlass (if there was ever an unportable thing | :-), the coming *Dx libraries (the explosion of | cuBLAS/Solver/FFT to allow kernel fusion and new in-kernel | linear algebra shenanigans) the slow but steady introduction | of sparsity everywhere, I can't see how anyone can but play | catch-up. | spookie wrote: | It's not like other vendors have made meaningful efforts in | alternatives. AMD still hasn't released RDNA3 support for ROCm, | their open compute platform. Hell, I don't even think RDNA2 has | proper support as of now. | | There's also the issue of poor documentation and learning | material in the wild. | turmeric_root wrote: | yeah when getting DL up and running on AMD requires using a | datacentre card then it's no wonder CUDA is more popular. AMD | is enabling ROCm for commercial GPUs now but it's still a | pain to get it up and running, because of the inertia that | CUDA has. | version_five wrote: | A cool mission ___________________________________________________________________ (page generated 2023-03-23 23:00 UTC)