[HN Gopher] Accelerating Generative AI with PyTorch II: GPT, Fast ___________________________________________________________________ Accelerating Generative AI with PyTorch II: GPT, Fast Author : polyrand Score : 159 points Date : 2023-11-30 18:35 UTC (4 hours ago) (HTM) web link (pytorch.org) (TXT) w3m dump (pytorch.org) | AmazingTurtle wrote: | 240tok/s is crazy | chillee wrote: | Hey, author of the blog post here. It's mentioned in the blog | post, but one of the intentions of this repo is that it's more of | a "tutorial" than it is a library/framework. My hope is that | people will copy-paste and modify it for their own needs :) | | Code can also be found here: https://github.com/pytorch-labs/gpt- | fast | | And a twitter thread summary here: | https://twitter.com/cHHillee/status/1730293330213531844 | buildbot wrote: | Great work and a really useful resource! Comprehensive guides | on improving PyTorch performance are pretty hard to come by, | and I learned a couple new tricks from this! | ilaksh wrote: | What GPU was used when testing this? | | Is this faster than HuggingFace's Text Generation inference | container? | chillee wrote: | We used an A100-80GB GPU. We didn't compare explicitly to | Huggingface TGI but I think you should be able to compare the | tokens/s achieved. | | One note is that this release is optimized for _latency_ , | while I think HF TGI might be more optimized for | _throughput_. | smith7018 wrote: | Great work! Do you know if it's possible to port this over to | pytorch's Apple Silicon/MPS support? | Dowwie wrote: | What kind of workstation would you build/buy for local GPT | development with a budget of $3000? Is remote dev a viable | alternative to local workstations? | woodson wrote: | I'd go with a remote dev solution. Training/finetuning of | large models requires much more resources anyway, so the GPUs | in the local machine would be unused most of the time. | leobg wrote: | Not OP, but I asked myself that same question two years ago. | Then I looked at the energy prices in Germany and knew I had | no chance against cloud GPUs. Maybe you live in a country | with lower energy prices, like Bermuda (or any other country | on earth), in which case this may not be as important to you. | A side benefit of going cloud that you can pick and choose | the right GPU for whatever project you're working on, and | you're really just paying while you're running them. Also, no | hardware or Cuda drivers that may divert your attention. | ftufek wrote: | Local workstation is much cheaper in the long run. | | Even ignoring that, most of the development is running | experiments. You're gonna be hesitant to run lots of | experiments if they each cost money whereas when you pay | upfront for the hardware, you're gonna have the incentive to | fully utilize it with lots of experiments. | | I'd go with rtx 4090 and deal with memory limitation through | software tricks. It's an underrated card that's as performant | as cards that are magnitude pricier. It's great way to get | started with that budget. | Philpax wrote: | Depending on what you're doing, 2x used 3090s are the same | price and offer you more VRAM. That's what I'm planning on | doing, in any case - being able to run 70B LLMs entirely on | the GPU is more useful than being able to run 34B faster. | biddit wrote: | Agreed. I recently completed a new build with two 3090 | GPUs and really appreciate being able to run 70b models. | Dowwie wrote: | which cpu did you go with? | biddit wrote: | i7-14700k | | z790 chipset w/ mobo that supports x8/x8 bifurcation | | 96gb ddr5 @5600mhz | icelancer wrote: | Yeah multiple 3090s is the best budget way to go for | sure. Also older server boards with tons of PCIe lanes if | you can swing rack mounted hardware and have some | technical skills. | biddit wrote: | I agree with you but right now RTX 4090 cards are pushing | $2000, which doesn't leave much budget left. I'd suggest | picking up a used 3090 card from eBay, which are currently | around $800. This will still give 24gb of VRAM like the | 4090. | icelancer wrote: | Strong endorse here. I pick up used RTX 3090s from | Facebook Marketplace and eBay at $800 maximum. Can | usually find them locally for $700-750, and typically can | test them too, which is fine (though I've had no issues | yet). | dharmab wrote: | I'm using an AMD 6900XT with ROCm and it's fast enough ti be | usable, for a fraction of the price of a 3090 or 4090. | icelancer wrote: | I would do remote dev using vast.ai and other cheap cloud | computing resources to ensure you want to do this and have | utility for it, then build your own. 3090s are typically the | most budget friendly, and if you have any IT chops (and | tolerance for noise), then server rack-mounted hardware, | PSUs, and riser cables tend to be the most efficient with | tons of PCIe lanes (which is a hidden issue people have with | consumer-grade gaming PCs as they scale). | modeless wrote: | I got a 13900k + 4090 workstation for ~$3500. But I hear what | people are doing is getting 2x (or more) 3090s instead, | because they are cheap used, and having more VRAM and VRAM | bandwidth is the important thing at the moment, even if it is | split between cards. | | I'm happy with my 4090 though. Dealing with splitting between | GPUs sounds like a chore and also I like the gaming abilities | of the 4090. | wolftickets wrote: | Just wanted to share, the charts and gifs are exceptionally | well done. Informative, concise, and easy to read. | chillee wrote: | Thanks! I've also written a couple other things along a | similar vein you might like at https://horace.io/writing.html | (particularly https://horace.io/brrr_intro.html) and also | some of the things I've tweeted: | https://twitter.com/cHHillee/highlights | xmichael909 wrote: | Holy hotdogs, this look amazing. So ahh. I'll jump right to it - | where can I run this online without having to do a bunch of work | setting it up? I have several python projects that could take | advantage of this! (; | andy99 wrote: | This is a great article. Regarding | | > While these projects are performant, they often come with | tradeoffs in ease of use, such as requiring model conversion to | specific formats or building and shipping new dependencies. | | I think it should be acknowledged that (at least IMO) pytorch | model formats are not very portable and this is a big part of the | problem. It would be nice to see industry move towards a better | format (gguf?) that can easily be ported between frameworks and | not leave you stuck using torch to load it. Likewise, pytorch is | a massive dependency to include with a project, especially for | simple inference, so while other projects have new dependencies, | they can often be a lot lighter than for a pytorch model, again | particularly for inference code. | chillee wrote: | Yeah, for sure. I think for deployment purposes, many times | these model conversions are necessary (such as if you don't | want to use Python). | | However, I do think these model conversions are often a | significant pain for users. | | So, in some sense, the goal here is to show that the | performance component and the "convert your model for | deployment" component can be disentangled. | | We also have work on allowing you to "export" an AOT-compiled | version of your model with torch.compile, and that should allow | you to deploy your models to run in other settings. | andy99 wrote: | Thanks for the reply. "show that the performance component | and the "convert your model for deployment" component can be | disentangled" makes sense. | | Also, I liked the part of the article about torch.compile | producing faster matrix-vector multiplication than cublas. | I've seen the same thing on CPU, that it's way faster to just | write and manually optimize a loop over a bunch of dot | products than it is to use BLAS routines because of how | simple the "matmul" actually is. I don't know how widely | known that is. | dnnssl2 wrote: | What are some of the better use cases of fast inference? From my | experience using ChatGPT, I don't need it to generate faster than | I can read, but waiting for code generation is painful because | I'm waiting for the whole code block to format correctly, be | available to copy or execute (in the case of code interpreter). | Anything else fall under this pattern? | wedn3sday wrote: | One obvious use case is that it makes per-token generation much | cheaper. | dnnssl2 wrote: | That's not so much a use case, but I get what you're saying. | It's nice that you can find optimizations to shift down the | pareto frontier of across the cost and latency dimension. The | hard tradeoffs are for cases like inference batching where | it's cheaper and higher throughput but slower for the end | consumer. | | What's a good use case for an order of magnitude decrease in | price per token? Web scale "analysis" or cleaning of | unstructured data? | jasonjmcghee wrote: | Programmatic and multi-step use cases. If you need chain-of- | thought or similar, tool use, etc. Generating data. | | Most use cases outside of classic chat. | | For example, I made an on-demand educational video project, and | the slowest part was by far the content generation. RAG, TTS, | Image generation, text rendering, and video processing were all | a drop in the bucket, in comparison. | | It would be an even wider gap now, and TTS is super-realtime, | and image generation can be single step. | rfw300 wrote: | The main thing is chat is just one application of LLMs. Other | applications are much more latency sensitive. Imagine, for | instance, an LLM-powered realtime grammar checker in an editor. | ClarityJones wrote: | Perhaps this is naive, but in my mind it can be useful for | learning. | | - Hook LLM to VMs | | - Ask for code that [counts to 10] | | - Run code on VM | | - Ask different LLM to Evaluate Results. | | - Repeat for sufficient volume. | | - Train. | | The faster it can generate results the faster those results can | be tested against the real world, e.g. a VM, users on X, other | models with known accuracies. | dnnssl2 wrote: | If you were to serve this on a datacenter server, is the client | to server roundtrip networking the slowest part of the inference? | Curious if it would be faster to run this cloud GPUs on better | hardware but farther compute, or locally with worse hardware. | chillee wrote: | Surprisingly, no. And part of this is that text generation is | _really_ expensive. Unlike traditional ML inference (like with, | resnets), you don 't just pass your data through your model | once. You need to pass it over and over again (once for each | token you generate). | | So, in practice, a full "text completion request" can often | take on the order of seconds, which dwarfs the client <-> | server roundtrip. | dnnssl2 wrote: | Is this still the case for sliding window attention/streaming | LLMs, where you have a fixed length attention window rather | than infinitely passing in new tokens for quadratic scaling? | You even get better performance due to purposely downsampling | non-meaningful attention sink tokens. | chillee wrote: | I cover it a bit in the blog post, but unless you have a | _really_ long context length (like 32k+), your primary | computational cost doesn 't come from attention but rather | from loading your weights from VRAM into registers. | | I mean, practically speaking, completions from say, ChatGPT | or Claude take seconds to finish :) | dnnssl2 wrote: | How does one select a good candidate for the draft model in | speculative decoding? I imagine that there's some better | intuition than just selecting the next parameter count down (i.e | 70B -> 13B, 13B -> 7B). | | Also how does that interact with MoE models? Do you have a mini | version of the MoE, with smaller experts? | chillee wrote: | This is indeed a bit of a dark art. Essentially, you want a | balance between "is significantly faster than base model" and | "generates similar stuff to the base model". | | Anecdotally, folks often seem to use say, 70B base + 7B as | verifier. But I think there's a lot of room for experimentation | and improvement here. | | You could... say, take a 70B model and maybe just chop off the | last 90% of layers and then fine-tune. Or perhaps you could use | a model that's trained to generate 8 tokens at once. Or perhaps | you could just use statistical "n-gram" predictor. | brucethemoose2 wrote: | This is similar to exllamav2, and exllamav2's quantization is | also excellent. | claytonjy wrote: | One of the notable tricks the various LLM serving frameworks | provide is a special approaches to batching, e g. continuous, | persistent, or in-flight batching depending on the inference | framework. At some level they each allow you to start a new | generation while in the middle of one or more previous | generations. | | Is that possible with "just" pytorch? Could it be added to gpt- | fast? | chillee wrote: | Yeah it's certainly possible, but it's not the focus of this | implementation, which is more latency focused (so BS=1). ___________________________________________________________________ (page generated 2023-11-30 23:00 UTC)