[HN Gopher] Render a neural network into CUDA/HIP code ___________________________________________________________________ Render a neural network into CUDA/HIP code Author : fzliu Score : 122 points Date : 2023-06-02 17:14 UTC (5 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | Havoc wrote: | Interesting that AMD GPUs seem to be 1st class citizens here. | Consumer class gear is much cheaper per unit of VRAM by the looks | of it | brucethemoose2 wrote: | Also here are some other interesting projects in the ML | compilation space: | | - Apache TVM (mlc-llm is a good demo) | | - Hidet (a torch.compile backend) | | - Alibaba BladeDISC | | - Nvidia TensorRT (a classic, but much less of a nightmare to | install now) | | - Torch MLIR (SHARK has some demos/implementations) | jahewson wrote: | And of course, Chris Lattner's Modular AI | https://www.modular.com/ | homarp wrote: | CUDA: NVIDIA GPU 'framework' | | HIP: AMD GPU 'framework' | | This takes Neural Network defined in python and convert them to | C++ code calling CUDA / HIP for maximum inference speed | iaw wrote: | Anyone seen details on if this can handle splitting a model | across GPUs? | sosodev wrote: | The latency improvements are impressive but the ability to run | models beyond their typical memory limitations is way cooler. | hintymad wrote: | I like this humility: "AITemplate is co-created by Meta | engineers: Bing Xu, Ying Zhang, Hao Lu, Yang Chen, and Terry | Chen, with major contributions coming from more talented | engineers." | samstave wrote: | ELI5 what this means? | | I am losing my bibliography, etymology and vocabulary with every | single AI advancement article. | | Where learn AI vocab, please? | | - | | I nee an FN AI teacher to just give me daily updates on AI and | verbiage, models, etc... | | Hey AI - if you're so smart, build a podcast that teaches me | about yourself and how to be a better meat parent whom made | you.///\\\\\\\ | bagels wrote: | What is an FN AI? | samstave wrote: | A 'fUCKIn' ai' | dragonwriter wrote: | An autonomous Belgian firearm. | skirmish wrote: | Starting with a trained PyTorch model, it builds optimized C++ | binaries for running inference (not training) on NVidia and AMD | GPUs. Various optimizations mentioned a lot, so presumably | models run faster than with just running them via regular | PyTorch. | stevenwliao wrote: | How much faster is it? | pumanoir wrote: | Depends on the model and GPU. Here is an example of almost | 2x on a 3060 for StableDiffusion: | https://www.youtube.com/watch?v=_6BsUijOWoM | prsutherland wrote: | I'm curious why that is called "rendering" rather than | "compiling". Is the code boiler plate and just a change in | the NN's representation? | iaw wrote: | Very much not an expert here but what I understand is that most | deep learning frameworks (PyTorch, Tensorflow, etc.) have some | overhead associated with them just being on the graphics card. | This takes PyTorch code and removes the overhead by translating | the network into a "native" language for the card (CUDA for | NVIDIA). | | What I'm not sure is what "HIP" is in this context. | | The way I'm reading this is it's the difference between running | code in an interpreter vs. on the bare metal (for the GPU) | [deleted] | entropicdrifter wrote: | HIP is AMD's open-source re-implementation of CUDA libraries. | viewtransform wrote: | HIP is AMD's contribution to the open source community to | overcome Nvidia's CUDA software moat | | You write in HIP C++ language and run them on either NVIDIA | or AMD platforms. This way you get cross-platform code and | are not stuck with Nvidia. | | Use HIPify tool to automatically convert existing sources | from CUDA to HIP. | | It's been around for many years - but the fact that so many | people still don't know about it - speaks for the sad state | of AMD communication. | | https://docs.amd.com/bundle/HIP-Programming- | Guide-v5.3/page/... | my123 wrote: | HIP is a pathetic CUDA API clone. Gratuitous renames | don't do it any good and are more representative of NIH | rather than anything else (sadly). | | They should have shipped a proper header set instead of | hipify. | femto113 wrote: | It doesn't really help understand what they are, but for | completeness CUDA is an acronym for "Compute Unified Device | Architecture" while HIP is "Heterogeneous-compute Interface for | Portability" | born-jre wrote: | at first glance i thought may be its like tinygrad. but looks has | many ops than tiny grad but most maps to underlying hardware | provided ops? | | i wonder how well tinygrad's apporach will work out, ops fusion | sounds easy, just walk a graph, pattern match it and lower to | hardware provided ops? | | Anyway if anyone wants to understand the philosophy behind | tinygrad, this file is great start | https://github.com/geohot/tinygrad/blob/master/docs/abstract... | bguberfain wrote: | It reminds me Theano | antinucleon wrote: | AITemplate's original designer is here. We quit Meta in January | and start HippoML (https://hippoml.com/). We just disclosed our | new engine's performance on LLM: https://blog.hippoml.com/large- | language-model-inference-from... On Apple M2 Max our new engine | encode/decode is 13.8X/2.4X faster than llama.cpp | huevosabio wrote: | Any idea how hippo, AI Template and TVM compare in performance? | antinucleon wrote: | Hippo is faster than AITemplate, and supports more generative | models. We haven't compared vs TVM, but for absolute token/s | on M2 Max, Hippo is able to run decoding on LLAMA with | datacenter level GPUs performance (with other SW). | huevosabio wrote: | Thanks, I've added myself to the waitlist. Please let us | know when this can be tried! | ralfd wrote: | What is your planned business model here? | antinucleon wrote: | We will disclose more details very soon. | brucethemoose2 wrote: | Very interesting. | | Is 8bit/4bit support in the works? Will it work with | bitsandbytes out of the box? Speedy inference is great, but in | practice many users are running the biggest ~4-bit LLM that | will fit into their RAM/VRAM pool these days. This is why | llama.cpp is so good, its (AFAIK) the only implementation that | will split a 4 bit quantized model so easily. | antinucleon wrote: | Yes. We support >= 1bit <= 16bit models out of box for | various of models. | sroussey wrote: | Would it work with instructor-xl or similar which is designed | for embeddings and retrieval? On device for privacy is key. | antinucleon wrote: | Yes | mhh__ wrote: | Really doesn't surprise me that much. Llama.cpp seems like an | OK first passs but I assume there is loads of time left on the | table in terms of graph optimizations optimizing for the memory | hierarchy properly. | brucethemoose2 wrote: | It also doesn't use Apple GPUs at all. Its 100% CPU | inference, with some CUDA/OpenCL (but no metal and no zero- | copy) offload at the moment. | antinucleon wrote: | It is actually non-trivial to get GPU run fast, especially | on SoC with strong CPU like M2. | hutzlibu wrote: | GPU programming in general is definitely not trivial, as | I can confirm with struggeling to learn WebGPU right now. | | But it really depends on the problem, simple math | operations on lots of data is usually indeed trivially | faster. Like AI mostly is with math on matrices. | | Or for example I just implemented a simple 2D | raycastsolver in wgsl and as a first project it is | totally not optimized - but even on my old laptop with | crappy integrated GPU, but (relativly) fast CPU - I can | now do 10000 raycasts per frame easily, while the cpu | (wasm!) struggles with 500. | | The raw power of the gpu is really awesome. But every | step is hard and debugging a mess. Which is why only a | handful of people seems do be doing it. But now would | probably be a good time, to get into it.. as I think gpu | compute just has started and will get big. | paulmd wrote: | I've been out of the space for a long time, and it's | possible you know these already, but these are a couple | weird tricks that can help: | | * Radix sort is your friend. Fun fact, O(n log n) is not | the fastest a sort can run, it's the fastest a | _comparison-based_ sort can run. Radix sort runs in O(N) | time, and in fact parallelizes extremely well on GPUs. | Extremely. They are great at it. And there are in-place | radix sorts too, just a bit slower (same asymptotic | performance tho). | | * "Insert an element into this collection" style steps | can be replaced by a sort and a prefix-sum operation. If | you know the offset of the first element with key K, and | you know the offset of the first element with key J, you | know the offset and size of that "collection" for K | within a flat array ("size(K) = offset(J) - offset(K)"). | Both of these run super fast in parallel and if you can | tweak your problem around to be some kind of sorting | operation that usually produces good speedups like this. | Easiest way to get a speedup from everything I've heard. | | * Recomputing is often much faster than storing | intermediate results. "Procedural generation" is | interesting because you can re-compute the generation | step on demand. Random123 is also very nice compared to a | (CuRand) mersenne twister/etc - why are you, a believer | in the cryptographic maxim that hash(key, msg) is | uncorrelated to hash(key, msg+1), still storing RNG | state? Being able to play back arbitrary parts of a | datastream at will is incredibly powerful, you can | fastforward and rewind through the data previously used | to interact with an item, as long as you know the epoch | of the interaction you want for a particular key. And | because computation is cheaper than memory, and memory | bandwidth - it's really actually practically free in | program time terms to just do some math. This is a form | of data compression and performance enhancement. | | * Generally you must understand the idea of coalescing | and divergence and keep those lanes executing. And it is | highly preferable to use sorts and scans and butterfly | operations (reduction, etc) even within a warp, because | traditional "mutex/atomic" paradigms don't work well with | 100k threads. But this is just the programming idioms of | this particular platform, I am sure LISP is similar too | in terms of "oh that's how you do that" once you're | accustomed. | | * Texture maps aren't just for graphics, they are a black | box that lets the GPU perform 2D and 3D coalescing and | some interpolation. | | * Intelligent use of constants memory is another one, | probably as is the use of CPU memory. If a value will be | seldom accessed, you can probably stuff it into host | memory and just accept the slowdown. Or you can store | only epochs on the GPU and recompute intermediate values | as needed. Try to ensure that all threads in a warp will | do it too (sorting vs recomputing). | | * Raytracing is of course impervious to all of this (so | don't worry too much that you can't magically hammer a | speedup out of it, nobody really can). You can accelerate | the raycasting and intersection testing (and AMD and | NVIDIA and Intel all do this differently) but as a | general matter rays are completely random and | uncoalesced. Ray sorting/shader execution reordering is | something that needs hardware assistance, and Intel and | NVIDIA both have hardware along these lines. The idea of | Intel of making a facility for async future/promise | dispatch for sparse tasks (and then sorting the shaders | to get good coalescing/etc) is really neat and they've | said it's going to come to GPGPU. | https://youtu.be/SA1yvWs3lHU?t=289 | | * You can, however, use your rays more efficiently. And | that's an area of active focus for everyone. And I think | more efficient use of TAAU samples is probably where | raster is going too. | boywitharupee wrote: | zero-copy with mmap was added to llama.cpp, but the way it | was implemented sparked controversy. | fathyb wrote: | I think GP meant zero-copy communication with the GPU, | eg. through `newBufferWithBytesNoCopy` [0], which is only | possible with unified memory architectures, eg. | integrated GPUs. | | The mmap change was just about mapping the model files in | memory instead of copying them, which has less overhead. | | [0]: https://developer.apple.com/documentation/metal/mtld | evice/14... | yeison wrote: | Did Facebook invest in this. Is that why it's under | Facebookincubator? | antinucleon wrote: | We developed AITemplate majorly for Meta's focus at that | time, eg Ads/Ranking need. For HippoML is startup we are | building for Generative AI. HippoML is not using AITemplate. | yeison wrote: | Will this have some similarities to what Mojo is trying to | solve? | antinucleon wrote: | Mojo is trying to create a new language to solve the problem, | and specialized for CPU. We are using a more pragmatic way to | solve GPU AI computation problem. | jph00 wrote: | Mojo is not at all specialised for CPU. It sits on top of | MLIR and excellent support for all major accelerators is | planned. | thewataccount wrote: | Do you know how it's speed compares to exllama, specifically | with an nvidia gpu by chance? | antinucleon wrote: | We haven't compared yet. | cypress66 wrote: | I don't see any comparisons with torch.compile. Kind of unfair to | compare it to eager mode. | brucethemoose2 wrote: | I just ran a 512x512 Stable Diffusion benchmark with this | yesterday | | Pytorch Eager Mode with some optimizations: ~6it/s | | Pytorch Inductor (torch.compile with dynamic=True): ~7it/s | | AITemplate: ~9it/s | | All of them support changing settings and such, albeit with some | work in progress bugs/caveats. | | That is 512x512 on a 2060, so I would expect the gains to be | bigger on newer GPUs with more overhead to take advantage of. | maxilevi wrote: | Did you try TensorRT? | brucethemoose2 wrote: | Not yet. TRT diffusion has been an _enormous_ pain in the | past, so I have kinda avoided it, but Nvidia just recently | contributed an img2img pipline in HF diffusers. ___________________________________________________________________ (page generated 2023-06-02 23:00 UTC)