[HN Gopher] Render a neural network into CUDA/HIP code
       ___________________________________________________________________
        
       Render a neural network into CUDA/HIP code
        
       Author : fzliu
       Score  : 122 points
       Date   : 2023-06-02 17:14 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Havoc wrote:
       | Interesting that AMD GPUs seem to be 1st class citizens here.
       | Consumer class gear is much cheaper per unit of VRAM by the looks
       | of it
        
       | brucethemoose2 wrote:
       | Also here are some other interesting projects in the ML
       | compilation space:
       | 
       | - Apache TVM (mlc-llm is a good demo)
       | 
       | - Hidet (a torch.compile backend)
       | 
       | - Alibaba BladeDISC
       | 
       | - Nvidia TensorRT (a classic, but much less of a nightmare to
       | install now)
       | 
       | - Torch MLIR (SHARK has some demos/implementations)
        
         | jahewson wrote:
         | And of course, Chris Lattner's Modular AI
         | https://www.modular.com/
        
       | homarp wrote:
       | CUDA: NVIDIA GPU 'framework'
       | 
       | HIP: AMD GPU 'framework'
       | 
       | This takes Neural Network defined in python and convert them to
       | C++ code calling CUDA / HIP for maximum inference speed
        
       | iaw wrote:
       | Anyone seen details on if this can handle splitting a model
       | across GPUs?
        
       | sosodev wrote:
       | The latency improvements are impressive but the ability to run
       | models beyond their typical memory limitations is way cooler.
        
       | hintymad wrote:
       | I like this humility: "AITemplate is co-created by Meta
       | engineers: Bing Xu, Ying Zhang, Hao Lu, Yang Chen, and Terry
       | Chen, with major contributions coming from more talented
       | engineers."
        
       | samstave wrote:
       | ELI5 what this means?
       | 
       | I am losing my bibliography, etymology and vocabulary with every
       | single AI advancement article.
       | 
       | Where learn AI vocab, please?
       | 
       | -
       | 
       | I nee an FN AI teacher to just give me daily updates on AI and
       | verbiage, models, etc...
       | 
       | Hey AI - if you're so smart, build a podcast that teaches me
       | about yourself and how to be a better meat parent whom made
       | you.///\\\\\\\
        
         | bagels wrote:
         | What is an FN AI?
        
           | samstave wrote:
           | A 'fUCKIn' ai'
        
           | dragonwriter wrote:
           | An autonomous Belgian firearm.
        
         | skirmish wrote:
         | Starting with a trained PyTorch model, it builds optimized C++
         | binaries for running inference (not training) on NVidia and AMD
         | GPUs. Various optimizations mentioned a lot, so presumably
         | models run faster than with just running them via regular
         | PyTorch.
        
           | stevenwliao wrote:
           | How much faster is it?
        
             | pumanoir wrote:
             | Depends on the model and GPU. Here is an example of almost
             | 2x on a 3060 for StableDiffusion:
             | https://www.youtube.com/watch?v=_6BsUijOWoM
        
           | prsutherland wrote:
           | I'm curious why that is called "rendering" rather than
           | "compiling". Is the code boiler plate and just a change in
           | the NN's representation?
        
         | iaw wrote:
         | Very much not an expert here but what I understand is that most
         | deep learning frameworks (PyTorch, Tensorflow, etc.) have some
         | overhead associated with them just being on the graphics card.
         | This takes PyTorch code and removes the overhead by translating
         | the network into a "native" language for the card (CUDA for
         | NVIDIA).
         | 
         | What I'm not sure is what "HIP" is in this context.
         | 
         | The way I'm reading this is it's the difference between running
         | code in an interpreter vs. on the bare metal (for the GPU)
        
           | [deleted]
        
           | entropicdrifter wrote:
           | HIP is AMD's open-source re-implementation of CUDA libraries.
        
             | viewtransform wrote:
             | HIP is AMD's contribution to the open source community to
             | overcome Nvidia's CUDA software moat
             | 
             | You write in HIP C++ language and run them on either NVIDIA
             | or AMD platforms. This way you get cross-platform code and
             | are not stuck with Nvidia.
             | 
             | Use HIPify tool to automatically convert existing sources
             | from CUDA to HIP.
             | 
             | It's been around for many years - but the fact that so many
             | people still don't know about it - speaks for the sad state
             | of AMD communication.
             | 
             | https://docs.amd.com/bundle/HIP-Programming-
             | Guide-v5.3/page/...
        
               | my123 wrote:
               | HIP is a pathetic CUDA API clone. Gratuitous renames
               | don't do it any good and are more representative of NIH
               | rather than anything else (sadly).
               | 
               | They should have shipped a proper header set instead of
               | hipify.
        
         | femto113 wrote:
         | It doesn't really help understand what they are, but for
         | completeness CUDA is an acronym for "Compute Unified Device
         | Architecture" while HIP is "Heterogeneous-compute Interface for
         | Portability"
        
       | born-jre wrote:
       | at first glance i thought may be its like tinygrad. but looks has
       | many ops than tiny grad but most maps to underlying hardware
       | provided ops?
       | 
       | i wonder how well tinygrad's apporach will work out, ops fusion
       | sounds easy, just walk a graph, pattern match it and lower to
       | hardware provided ops?
       | 
       | Anyway if anyone wants to understand the philosophy behind
       | tinygrad, this file is great start
       | https://github.com/geohot/tinygrad/blob/master/docs/abstract...
        
       | bguberfain wrote:
       | It reminds me Theano
        
       | antinucleon wrote:
       | AITemplate's original designer is here. We quit Meta in January
       | and start HippoML (https://hippoml.com/). We just disclosed our
       | new engine's performance on LLM: https://blog.hippoml.com/large-
       | language-model-inference-from... On Apple M2 Max our new engine
       | encode/decode is 13.8X/2.4X faster than llama.cpp
        
         | huevosabio wrote:
         | Any idea how hippo, AI Template and TVM compare in performance?
        
           | antinucleon wrote:
           | Hippo is faster than AITemplate, and supports more generative
           | models. We haven't compared vs TVM, but for absolute token/s
           | on M2 Max, Hippo is able to run decoding on LLAMA with
           | datacenter level GPUs performance (with other SW).
        
             | huevosabio wrote:
             | Thanks, I've added myself to the waitlist. Please let us
             | know when this can be tried!
        
         | ralfd wrote:
         | What is your planned business model here?
        
           | antinucleon wrote:
           | We will disclose more details very soon.
        
         | brucethemoose2 wrote:
         | Very interesting.
         | 
         | Is 8bit/4bit support in the works? Will it work with
         | bitsandbytes out of the box? Speedy inference is great, but in
         | practice many users are running the biggest ~4-bit LLM that
         | will fit into their RAM/VRAM pool these days. This is why
         | llama.cpp is so good, its (AFAIK) the only implementation that
         | will split a 4 bit quantized model so easily.
        
           | antinucleon wrote:
           | Yes. We support >= 1bit <= 16bit models out of box for
           | various of models.
        
         | sroussey wrote:
         | Would it work with instructor-xl or similar which is designed
         | for embeddings and retrieval? On device for privacy is key.
        
           | antinucleon wrote:
           | Yes
        
         | mhh__ wrote:
         | Really doesn't surprise me that much. Llama.cpp seems like an
         | OK first passs but I assume there is loads of time left on the
         | table in terms of graph optimizations optimizing for the memory
         | hierarchy properly.
        
           | brucethemoose2 wrote:
           | It also doesn't use Apple GPUs at all. Its 100% CPU
           | inference, with some CUDA/OpenCL (but no metal and no zero-
           | copy) offload at the moment.
        
             | antinucleon wrote:
             | It is actually non-trivial to get GPU run fast, especially
             | on SoC with strong CPU like M2.
        
               | hutzlibu wrote:
               | GPU programming in general is definitely not trivial, as
               | I can confirm with struggeling to learn WebGPU right now.
               | 
               | But it really depends on the problem, simple math
               | operations on lots of data is usually indeed trivially
               | faster. Like AI mostly is with math on matrices.
               | 
               | Or for example I just implemented a simple 2D
               | raycastsolver in wgsl and as a first project it is
               | totally not optimized - but even on my old laptop with
               | crappy integrated GPU, but (relativly) fast CPU - I can
               | now do 10000 raycasts per frame easily, while the cpu
               | (wasm!) struggles with 500.
               | 
               | The raw power of the gpu is really awesome. But every
               | step is hard and debugging a mess. Which is why only a
               | handful of people seems do be doing it. But now would
               | probably be a good time, to get into it.. as I think gpu
               | compute just has started and will get big.
        
               | paulmd wrote:
               | I've been out of the space for a long time, and it's
               | possible you know these already, but these are a couple
               | weird tricks that can help:
               | 
               | * Radix sort is your friend. Fun fact, O(n log n) is not
               | the fastest a sort can run, it's the fastest a
               | _comparison-based_ sort can run. Radix sort runs in O(N)
               | time, and in fact parallelizes extremely well on GPUs.
               | Extremely. They are great at it. And there are in-place
               | radix sorts too, just a bit slower (same asymptotic
               | performance tho).
               | 
               | * "Insert an element into this collection" style steps
               | can be replaced by a sort and a prefix-sum operation. If
               | you know the offset of the first element with key K, and
               | you know the offset of the first element with key J, you
               | know the offset and size of that "collection" for K
               | within a flat array ("size(K) = offset(J) - offset(K)").
               | Both of these run super fast in parallel and if you can
               | tweak your problem around to be some kind of sorting
               | operation that usually produces good speedups like this.
               | Easiest way to get a speedup from everything I've heard.
               | 
               | * Recomputing is often much faster than storing
               | intermediate results. "Procedural generation" is
               | interesting because you can re-compute the generation
               | step on demand. Random123 is also very nice compared to a
               | (CuRand) mersenne twister/etc - why are you, a believer
               | in the cryptographic maxim that hash(key, msg) is
               | uncorrelated to hash(key, msg+1), still storing RNG
               | state? Being able to play back arbitrary parts of a
               | datastream at will is incredibly powerful, you can
               | fastforward and rewind through the data previously used
               | to interact with an item, as long as you know the epoch
               | of the interaction you want for a particular key. And
               | because computation is cheaper than memory, and memory
               | bandwidth - it's really actually practically free in
               | program time terms to just do some math. This is a form
               | of data compression and performance enhancement.
               | 
               | * Generally you must understand the idea of coalescing
               | and divergence and keep those lanes executing. And it is
               | highly preferable to use sorts and scans and butterfly
               | operations (reduction, etc) even within a warp, because
               | traditional "mutex/atomic" paradigms don't work well with
               | 100k threads. But this is just the programming idioms of
               | this particular platform, I am sure LISP is similar too
               | in terms of "oh that's how you do that" once you're
               | accustomed.
               | 
               | * Texture maps aren't just for graphics, they are a black
               | box that lets the GPU perform 2D and 3D coalescing and
               | some interpolation.
               | 
               | * Intelligent use of constants memory is another one,
               | probably as is the use of CPU memory. If a value will be
               | seldom accessed, you can probably stuff it into host
               | memory and just accept the slowdown. Or you can store
               | only epochs on the GPU and recompute intermediate values
               | as needed. Try to ensure that all threads in a warp will
               | do it too (sorting vs recomputing).
               | 
               | * Raytracing is of course impervious to all of this (so
               | don't worry too much that you can't magically hammer a
               | speedup out of it, nobody really can). You can accelerate
               | the raycasting and intersection testing (and AMD and
               | NVIDIA and Intel all do this differently) but as a
               | general matter rays are completely random and
               | uncoalesced. Ray sorting/shader execution reordering is
               | something that needs hardware assistance, and Intel and
               | NVIDIA both have hardware along these lines. The idea of
               | Intel of making a facility for async future/promise
               | dispatch for sparse tasks (and then sorting the shaders
               | to get good coalescing/etc) is really neat and they've
               | said it's going to come to GPGPU.
               | https://youtu.be/SA1yvWs3lHU?t=289
               | 
               | * You can, however, use your rays more efficiently. And
               | that's an area of active focus for everyone. And I think
               | more efficient use of TAAU samples is probably where
               | raster is going too.
        
             | boywitharupee wrote:
             | zero-copy with mmap was added to llama.cpp, but the way it
             | was implemented sparked controversy.
        
               | fathyb wrote:
               | I think GP meant zero-copy communication with the GPU,
               | eg. through `newBufferWithBytesNoCopy` [0], which is only
               | possible with unified memory architectures, eg.
               | integrated GPUs.
               | 
               | The mmap change was just about mapping the model files in
               | memory instead of copying them, which has less overhead.
               | 
               | [0]: https://developer.apple.com/documentation/metal/mtld
               | evice/14...
        
         | yeison wrote:
         | Did Facebook invest in this. Is that why it's under
         | Facebookincubator?
        
           | antinucleon wrote:
           | We developed AITemplate majorly for Meta's focus at that
           | time, eg Ads/Ranking need. For HippoML is startup we are
           | building for Generative AI. HippoML is not using AITemplate.
        
         | yeison wrote:
         | Will this have some similarities to what Mojo is trying to
         | solve?
        
           | antinucleon wrote:
           | Mojo is trying to create a new language to solve the problem,
           | and specialized for CPU. We are using a more pragmatic way to
           | solve GPU AI computation problem.
        
             | jph00 wrote:
             | Mojo is not at all specialised for CPU. It sits on top of
             | MLIR and excellent support for all major accelerators is
             | planned.
        
         | thewataccount wrote:
         | Do you know how it's speed compares to exllama, specifically
         | with an nvidia gpu by chance?
        
           | antinucleon wrote:
           | We haven't compared yet.
        
       | cypress66 wrote:
       | I don't see any comparisons with torch.compile. Kind of unfair to
       | compare it to eager mode.
        
       | brucethemoose2 wrote:
       | I just ran a 512x512 Stable Diffusion benchmark with this
       | yesterday
       | 
       | Pytorch Eager Mode with some optimizations: ~6it/s
       | 
       | Pytorch Inductor (torch.compile with dynamic=True): ~7it/s
       | 
       | AITemplate: ~9it/s
       | 
       | All of them support changing settings and such, albeit with some
       | work in progress bugs/caveats.
       | 
       | That is 512x512 on a 2060, so I would expect the gains to be
       | bigger on newer GPUs with more overhead to take advantage of.
        
         | maxilevi wrote:
         | Did you try TensorRT?
        
           | brucethemoose2 wrote:
           | Not yet. TRT diffusion has been an _enormous_ pain in the
           | past, so I have kinda avoided it, but Nvidia just recently
           | contributed an img2img pipline in HF diffusers.
        
       ___________________________________________________________________
       (page generated 2023-06-02 23:00 UTC)