[HN Gopher] AI's compute fragmentation: what matrix multiplicati...
       ___________________________________________________________________
        
       AI's compute fragmentation: what matrix multiplication teaches us
        
       Author : tzhenghao
       Score  : 46 points
       Date   : 2023-03-23 18:34 UTC (4 hours ago)
        
 (HTM) web link (www.modular.com)
 (TXT) w3m dump (www.modular.com)
        
       | BenoitP wrote:
       | There's hope in intermediate representations, in OpenXLA:
       | 
       | https://opensource.googleblog.com/2023/03/openxla-is-ready-t...
       | 
       | > OpenXLA is an open source ML compiler ecosystem co-developed by
       | AI/ML industry leaders including Alibaba, Amazon Web Services,
       | AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face,
       | Intel, Meta, and NVIDIA. It enables developers to compile and
       | optimize models from all leading ML frameworks for efficient
       | training and serving on a wide variety of hardware
        
         | junrushao1994 wrote:
         | One thing I really love about XLA is GSPMD which effectively
         | allows scalable distributed training in practice. However, I
         | was quite curious how it is related to matrix multiplication
         | though, given XLA is more focusing on graph-level optimization
         | and basically offloads matmul to other libraries like Triton
         | and cuBLAS
        
       | photochemsyn wrote:
       | > "Think about it: how can a small number of specialized experts,
       | who hand write and tune assembly code, possibly scale their work
       | to all the different configurations while also incorporating
       | their work into all the AI frameworks?! It's simply an impossible
       | task."
       | 
       | Naively, I wonder if this is the kind of problem that AI itself
       | can solve, which is a rather singularity-approaching concept.
       | Maybe there's too much logic involved and not enough training
       | data on different configurations for that to work? A bit spooky
       | however, the thought of self-bootstrapping AI.
        
         | dimatura wrote:
         | There has been work on using AI for this at various levels - at
         | the neural architecture level (finding neural architectures
         | with high throughput/latency for a given hardware), at the
         | algorithm level (finding faster matrix multiplication
         | routines), and at the hardware level (iirc Google stated the
         | latest version of google TPUs were partially designed with AI).
        
         | bigbillheck wrote:
         | This is the kind of problem AI's been solving for 25 years and
         | more: https://www.fftw.org
        
       | bigbillheck wrote:
       | Surely one solution is for the AI frameworks to each themselves
       | understand the operating environment and choose the best
       | implementation at run-time, much like the way they currently do.
        
       | kickingvegas wrote:
       | Off topic, but related.
       | https://mastodon.social/@mcc/110024854706734967
        
       | adamnemecek wrote:
       | No, they compute spectra.
        
       | junrushao1994 wrote:
       | My take: optimizing matrix multiplication is not hard on modern
       | architecture if you have the right abstraction. The code itself
       | could be fragmented across different programming models, which is
       | true, but the underlying techniques are not hard for a 2nd/3rd
       | year undergrad to understand. There are only a few important ones
       | on GPU: loop tiling, pipelining, shared memory swizzle, memory
       | coalescing. A properly designed compiler can allow developers to
       | optimize matmuls within 100 lines of code.
        
         | touisteur wrote:
         | Looking at the effort plunked into things like cutlass and them
         | still not reaching cuBLAS perf (which very few can beat - in
         | the places where cuBLAS shines! which is... not that many...),
         | and even in cuDNN and they're still eeking out single digit
         | improvements regularly, I'd say this is probably harder than
         | that. At least if you're reaching for the >50% use of the 37
         | TFLOPS of an A40. If you're fine throwing more GPUs at the
         | problem, sure.
         | 
         | Edit: I mean when you still see papers every year with large
         | improvements in perf, and things like 'we used tensor cores and
         | managed to get back fp32 accuracy with 3 rounds of the things'
         | - what? - I can attest it doesn't take 2 weeks to get this kind
         | of results. And it's just getting started on tensor cores! And
         | when on the nvidia forums someone says 'nah probably no
         | improvement to use tensor cores for fft' and you get a link
         | with a paper with a significative improvement in perf using
         | tensor cores, I say we're just starting.
        
           | junrushao1994 wrote:
           | This is definitely a great point! With the context of AI
           | workloads, where critical matmuls are basically of regular
           | large shapes, are there many cases where cutlass/Triton are
           | worse than cuBLAS where we need to throw more GPUs at it?
        
             | touisteur wrote:
             | cuBLAS is very often too heavy (too much overhead, memory
             | movement to fit the API, not optimized for small batches of
             | small matrices) and you can get huge improvements while
             | chaining cudnn/cutlass/autotuned kernels. Especially if
             | you're still on GDDR6 every data movement is a killer so if
             | you can put it all together and never go back to global
             | memory, you get amazing improvements. And this is without
             | tensor cores. Programming them by hand is a pain so here
             | enters cutlass...
        
               | junrushao1994 wrote:
               | Yeah cuBLAS is definitely not perfect in many cases :-((
               | 
               | Speaking of GEMM fusion that you mentioned, flash
               | attention is basically GEMM fusion with online softmax
               | right? This is something I believe really cool and can be
               | made really easy wit a proper abstraction. Say, you may
               | move a chunk of computation under a certain loop and
               | instruct the compiler to optimize data movement or cache
               | intermediate tiles somewhere on chip
        
               | touisteur wrote:
               | There's something of this in cutlass with prologues and
               | epilogues, and in the 'backend mode' of cudnn, but
               | overall breaking the 'cuBLAS takes your whole device and
               | tries to saturate it for this one matmul' is going to
               | require a huge lot of abstraction work.
               | 
               | Cutlass is supposed to be the first step and to anyone
               | who struggles to understand WTF you're doing when using
               | it, you are not alone. I've seen literally amazing room-
               | silencing stuff with it, but heavy template stuff is
               | really not my thing.
        
           | junrushao1994 wrote:
           | > we used tensor cores and managed to get back fp32 accuracy
           | with 3 rounds of the things
           | 
           | Hey are you referring to 3xTF32 (https://github.com/NVIDIA/cu
           | tlass/tree/master/examples/28_am...)? IMO this is a perfect
           | example where proper abstraction could save engineers non-
           | trivial amount of time - imagine a compiler stack which
           | allows 3xTF32 as a normal dtype and subsequent analysis
           | compatible with this special dtype :-)
        
         | mathisfun123 wrote:
         | > A properly designed compiler can allow developers to optimize
         | matmuls within 100 lines of code.
         | 
         | man this is such a funny closing comment - what exactly do you
         | think is involved in designing a compiler that enables devs to
         | optimize matmuls if not 1000s of person hours/years/etc of very
         | "fine-grained" perf research? what the "abstraction" people
         | don't understand (because they only deal in abstractions) is
         | that achieving performance involves literally the antithesis of
         | abstraction - you need to understand your hardware down to the
         | gate level (sometimes).
         | 
         | > loop tiling, pipelining, shared memory swizzle, memory
         | coalescing
         | 
         | have you ever applied any of these? the only way you could
         | apply these as a generic (without consideration of your
         | particular hardware) algo is using a tuner; this is of course
         | widely the route taken but that's not an "understanding" of
         | anything except guess and check.
        
       | brucethemoose2 wrote:
       | Yeah well tell all that to Nvidia, who very much likes the
       | fragmentation and wants to keep things that way.
        
         | dekhn wrote:
         | they are the one vendor who had the insight ~20 years ago to
         | invest long-term in GPUs and have continuously made impressive
         | products while supporting a cross-platform developer base. For
         | this, I reward them with my $$$ (both work and home).
        
         | misnome wrote:
         | And they developed this fragmentation by... building good
         | tools, good documentation, and comprehensively supporting them
         | for 15 years in a way that makes people feel safe building on
         | top of them.
         | 
         | It's not fragmentation, they built a moat.
        
           | touisteur wrote:
           | And with their actual understanding of the hardware
           | limitations of GPUs (memory bandwidth) and the parallel work
           | on things like cutlass (if there was ever an unportable thing
           | :-), the coming *Dx libraries (the explosion of
           | cuBLAS/Solver/FFT to allow kernel fusion and new in-kernel
           | linear algebra shenanigans) the slow but steady introduction
           | of sparsity everywhere, I can't see how anyone can but play
           | catch-up.
        
         | spookie wrote:
         | It's not like other vendors have made meaningful efforts in
         | alternatives. AMD still hasn't released RDNA3 support for ROCm,
         | their open compute platform. Hell, I don't even think RDNA2 has
         | proper support as of now.
         | 
         | There's also the issue of poor documentation and learning
         | material in the wild.
        
           | turmeric_root wrote:
           | yeah when getting DL up and running on AMD requires using a
           | datacentre card then it's no wonder CUDA is more popular. AMD
           | is enabling ROCm for commercial GPUs now but it's still a
           | pain to get it up and running, because of the inertia that
           | CUDA has.
        
       | version_five wrote:
       | A cool mission
        
       ___________________________________________________________________
       (page generated 2023-03-23 23:00 UTC)