[HN Gopher] Accelerating Generative AI with PyTorch II: GPT, Fast
       ___________________________________________________________________
        
       Accelerating Generative AI with PyTorch II: GPT, Fast
        
       Author : polyrand
       Score  : 159 points
       Date   : 2023-11-30 18:35 UTC (4 hours ago)
        
 (HTM) web link (pytorch.org)
 (TXT) w3m dump (pytorch.org)
        
       | AmazingTurtle wrote:
       | 240tok/s is crazy
        
       | chillee wrote:
       | Hey, author of the blog post here. It's mentioned in the blog
       | post, but one of the intentions of this repo is that it's more of
       | a "tutorial" than it is a library/framework. My hope is that
       | people will copy-paste and modify it for their own needs :)
       | 
       | Code can also be found here: https://github.com/pytorch-labs/gpt-
       | fast
       | 
       | And a twitter thread summary here:
       | https://twitter.com/cHHillee/status/1730293330213531844
        
         | buildbot wrote:
         | Great work and a really useful resource! Comprehensive guides
         | on improving PyTorch performance are pretty hard to come by,
         | and I learned a couple new tricks from this!
        
         | ilaksh wrote:
         | What GPU was used when testing this?
         | 
         | Is this faster than HuggingFace's Text Generation inference
         | container?
        
           | chillee wrote:
           | We used an A100-80GB GPU. We didn't compare explicitly to
           | Huggingface TGI but I think you should be able to compare the
           | tokens/s achieved.
           | 
           | One note is that this release is optimized for _latency_ ,
           | while I think HF TGI might be more optimized for
           | _throughput_.
        
         | smith7018 wrote:
         | Great work! Do you know if it's possible to port this over to
         | pytorch's Apple Silicon/MPS support?
        
         | Dowwie wrote:
         | What kind of workstation would you build/buy for local GPT
         | development with a budget of $3000? Is remote dev a viable
         | alternative to local workstations?
        
           | woodson wrote:
           | I'd go with a remote dev solution. Training/finetuning of
           | large models requires much more resources anyway, so the GPUs
           | in the local machine would be unused most of the time.
        
           | leobg wrote:
           | Not OP, but I asked myself that same question two years ago.
           | Then I looked at the energy prices in Germany and knew I had
           | no chance against cloud GPUs. Maybe you live in a country
           | with lower energy prices, like Bermuda (or any other country
           | on earth), in which case this may not be as important to you.
           | A side benefit of going cloud that you can pick and choose
           | the right GPU for whatever project you're working on, and
           | you're really just paying while you're running them. Also, no
           | hardware or Cuda drivers that may divert your attention.
        
           | ftufek wrote:
           | Local workstation is much cheaper in the long run.
           | 
           | Even ignoring that, most of the development is running
           | experiments. You're gonna be hesitant to run lots of
           | experiments if they each cost money whereas when you pay
           | upfront for the hardware, you're gonna have the incentive to
           | fully utilize it with lots of experiments.
           | 
           | I'd go with rtx 4090 and deal with memory limitation through
           | software tricks. It's an underrated card that's as performant
           | as cards that are magnitude pricier. It's great way to get
           | started with that budget.
        
             | Philpax wrote:
             | Depending on what you're doing, 2x used 3090s are the same
             | price and offer you more VRAM. That's what I'm planning on
             | doing, in any case - being able to run 70B LLMs entirely on
             | the GPU is more useful than being able to run 34B faster.
        
               | biddit wrote:
               | Agreed. I recently completed a new build with two 3090
               | GPUs and really appreciate being able to run 70b models.
        
               | Dowwie wrote:
               | which cpu did you go with?
        
               | biddit wrote:
               | i7-14700k
               | 
               | z790 chipset w/ mobo that supports x8/x8 bifurcation
               | 
               | 96gb ddr5 @5600mhz
        
               | icelancer wrote:
               | Yeah multiple 3090s is the best budget way to go for
               | sure. Also older server boards with tons of PCIe lanes if
               | you can swing rack mounted hardware and have some
               | technical skills.
        
             | biddit wrote:
             | I agree with you but right now RTX 4090 cards are pushing
             | $2000, which doesn't leave much budget left. I'd suggest
             | picking up a used 3090 card from eBay, which are currently
             | around $800. This will still give 24gb of VRAM like the
             | 4090.
        
               | icelancer wrote:
               | Strong endorse here. I pick up used RTX 3090s from
               | Facebook Marketplace and eBay at $800 maximum. Can
               | usually find them locally for $700-750, and typically can
               | test them too, which is fine (though I've had no issues
               | yet).
        
           | dharmab wrote:
           | I'm using an AMD 6900XT with ROCm and it's fast enough ti be
           | usable, for a fraction of the price of a 3090 or 4090.
        
           | icelancer wrote:
           | I would do remote dev using vast.ai and other cheap cloud
           | computing resources to ensure you want to do this and have
           | utility for it, then build your own. 3090s are typically the
           | most budget friendly, and if you have any IT chops (and
           | tolerance for noise), then server rack-mounted hardware,
           | PSUs, and riser cables tend to be the most efficient with
           | tons of PCIe lanes (which is a hidden issue people have with
           | consumer-grade gaming PCs as they scale).
        
           | modeless wrote:
           | I got a 13900k + 4090 workstation for ~$3500. But I hear what
           | people are doing is getting 2x (or more) 3090s instead,
           | because they are cheap used, and having more VRAM and VRAM
           | bandwidth is the important thing at the moment, even if it is
           | split between cards.
           | 
           | I'm happy with my 4090 though. Dealing with splitting between
           | GPUs sounds like a chore and also I like the gaming abilities
           | of the 4090.
        
         | wolftickets wrote:
         | Just wanted to share, the charts and gifs are exceptionally
         | well done. Informative, concise, and easy to read.
        
           | chillee wrote:
           | Thanks! I've also written a couple other things along a
           | similar vein you might like at https://horace.io/writing.html
           | (particularly https://horace.io/brrr_intro.html) and also
           | some of the things I've tweeted:
           | https://twitter.com/cHHillee/highlights
        
       | xmichael909 wrote:
       | Holy hotdogs, this look amazing. So ahh. I'll jump right to it -
       | where can I run this online without having to do a bunch of work
       | setting it up? I have several python projects that could take
       | advantage of this! (;
        
       | andy99 wrote:
       | This is a great article. Regarding
       | 
       | > While these projects are performant, they often come with
       | tradeoffs in ease of use, such as requiring model conversion to
       | specific formats or building and shipping new dependencies.
       | 
       | I think it should be acknowledged that (at least IMO) pytorch
       | model formats are not very portable and this is a big part of the
       | problem. It would be nice to see industry move towards a better
       | format (gguf?) that can easily be ported between frameworks and
       | not leave you stuck using torch to load it. Likewise, pytorch is
       | a massive dependency to include with a project, especially for
       | simple inference, so while other projects have new dependencies,
       | they can often be a lot lighter than for a pytorch model, again
       | particularly for inference code.
        
         | chillee wrote:
         | Yeah, for sure. I think for deployment purposes, many times
         | these model conversions are necessary (such as if you don't
         | want to use Python).
         | 
         | However, I do think these model conversions are often a
         | significant pain for users.
         | 
         | So, in some sense, the goal here is to show that the
         | performance component and the "convert your model for
         | deployment" component can be disentangled.
         | 
         | We also have work on allowing you to "export" an AOT-compiled
         | version of your model with torch.compile, and that should allow
         | you to deploy your models to run in other settings.
        
           | andy99 wrote:
           | Thanks for the reply. "show that the performance component
           | and the "convert your model for deployment" component can be
           | disentangled" makes sense.
           | 
           | Also, I liked the part of the article about torch.compile
           | producing faster matrix-vector multiplication than cublas.
           | I've seen the same thing on CPU, that it's way faster to just
           | write and manually optimize a loop over a bunch of dot
           | products than it is to use BLAS routines because of how
           | simple the "matmul" actually is. I don't know how widely
           | known that is.
        
       | dnnssl2 wrote:
       | What are some of the better use cases of fast inference? From my
       | experience using ChatGPT, I don't need it to generate faster than
       | I can read, but waiting for code generation is painful because
       | I'm waiting for the whole code block to format correctly, be
       | available to copy or execute (in the case of code interpreter).
       | Anything else fall under this pattern?
        
         | wedn3sday wrote:
         | One obvious use case is that it makes per-token generation much
         | cheaper.
        
           | dnnssl2 wrote:
           | That's not so much a use case, but I get what you're saying.
           | It's nice that you can find optimizations to shift down the
           | pareto frontier of across the cost and latency dimension. The
           | hard tradeoffs are for cases like inference batching where
           | it's cheaper and higher throughput but slower for the end
           | consumer.
           | 
           | What's a good use case for an order of magnitude decrease in
           | price per token? Web scale "analysis" or cleaning of
           | unstructured data?
        
         | jasonjmcghee wrote:
         | Programmatic and multi-step use cases. If you need chain-of-
         | thought or similar, tool use, etc. Generating data.
         | 
         | Most use cases outside of classic chat.
         | 
         | For example, I made an on-demand educational video project, and
         | the slowest part was by far the content generation. RAG, TTS,
         | Image generation, text rendering, and video processing were all
         | a drop in the bucket, in comparison.
         | 
         | It would be an even wider gap now, and TTS is super-realtime,
         | and image generation can be single step.
        
         | rfw300 wrote:
         | The main thing is chat is just one application of LLMs. Other
         | applications are much more latency sensitive. Imagine, for
         | instance, an LLM-powered realtime grammar checker in an editor.
        
         | ClarityJones wrote:
         | Perhaps this is naive, but in my mind it can be useful for
         | learning.
         | 
         | - Hook LLM to VMs
         | 
         | - Ask for code that [counts to 10]
         | 
         | - Run code on VM
         | 
         | - Ask different LLM to Evaluate Results.
         | 
         | - Repeat for sufficient volume.
         | 
         | - Train.
         | 
         | The faster it can generate results the faster those results can
         | be tested against the real world, e.g. a VM, users on X, other
         | models with known accuracies.
        
       | dnnssl2 wrote:
       | If you were to serve this on a datacenter server, is the client
       | to server roundtrip networking the slowest part of the inference?
       | Curious if it would be faster to run this cloud GPUs on better
       | hardware but farther compute, or locally with worse hardware.
        
         | chillee wrote:
         | Surprisingly, no. And part of this is that text generation is
         | _really_ expensive. Unlike traditional ML inference (like with,
         | resnets), you don 't just pass your data through your model
         | once. You need to pass it over and over again (once for each
         | token you generate).
         | 
         | So, in practice, a full "text completion request" can often
         | take on the order of seconds, which dwarfs the client <->
         | server roundtrip.
        
           | dnnssl2 wrote:
           | Is this still the case for sliding window attention/streaming
           | LLMs, where you have a fixed length attention window rather
           | than infinitely passing in new tokens for quadratic scaling?
           | You even get better performance due to purposely downsampling
           | non-meaningful attention sink tokens.
        
             | chillee wrote:
             | I cover it a bit in the blog post, but unless you have a
             | _really_ long context length (like 32k+), your primary
             | computational cost doesn 't come from attention but rather
             | from loading your weights from VRAM into registers.
             | 
             | I mean, practically speaking, completions from say, ChatGPT
             | or Claude take seconds to finish :)
        
       | dnnssl2 wrote:
       | How does one select a good candidate for the draft model in
       | speculative decoding? I imagine that there's some better
       | intuition than just selecting the next parameter count down (i.e
       | 70B -> 13B, 13B -> 7B).
       | 
       | Also how does that interact with MoE models? Do you have a mini
       | version of the MoE, with smaller experts?
        
         | chillee wrote:
         | This is indeed a bit of a dark art. Essentially, you want a
         | balance between "is significantly faster than base model" and
         | "generates similar stuff to the base model".
         | 
         | Anecdotally, folks often seem to use say, 70B base + 7B as
         | verifier. But I think there's a lot of room for experimentation
         | and improvement here.
         | 
         | You could... say, take a 70B model and maybe just chop off the
         | last 90% of layers and then fine-tune. Or perhaps you could use
         | a model that's trained to generate 8 tokens at once. Or perhaps
         | you could just use statistical "n-gram" predictor.
        
       | brucethemoose2 wrote:
       | This is similar to exllamav2, and exllamav2's quantization is
       | also excellent.
        
       | claytonjy wrote:
       | One of the notable tricks the various LLM serving frameworks
       | provide is a special approaches to batching, e g. continuous,
       | persistent, or in-flight batching depending on the inference
       | framework. At some level they each allow you to start a new
       | generation while in the middle of one or more previous
       | generations.
       | 
       | Is that possible with "just" pytorch? Could it be added to gpt-
       | fast?
        
         | chillee wrote:
         | Yeah it's certainly possible, but it's not the focus of this
         | implementation, which is more latency focused (so BS=1).
        
       ___________________________________________________________________
       (page generated 2023-11-30 23:00 UTC)