[HN Gopher] vLLM: 24x faster LLM serving than HuggingFace Transf...
       ___________________________________________________________________
        
       vLLM: 24x faster LLM serving than HuggingFace Transformers
        
       Author : wskwon
       Score  : 132 points
       Date   : 2023-06-20 19:17 UTC (3 hours ago)
        
 (HTM) web link (vllm.ai)
 (TXT) w3m dump (vllm.ai)
        
       | two_in_one wrote:
       | I wonder if this sort of memory management can be made for
       | Pytorch transformers as under the hood optimization.
        
       | thewataccount wrote:
       | This is really cool to see.
       | 
       | > Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
       | 
       | > Dynamic: Its size depends on the sequence length, which is
       | highly variable and unpredictable. As a result, efficiently
       | managing the KV cache presents a significant challenge. We find
       | that existing systems waste 60% - 80% of memory due to
       | fragmentation and over-reservation.
       | 
       | This mentions improvements for throughput which is great, and it
       | mentions memory savings. I'm a bit confused how 80% of the memory
       | could be wasted by the KV cache when the vast majority of the
       | memory is usually holding the model itself?
       | 
       | How much memory savings does this translate to effectively for
       | say a 30B 4bit model?
        
         | zhisbug wrote:
         | This really depends on what GPUs you use. If you GPUs has very
         | small amount of memory, vLLM will help more.
         | 
         | vLLM addresses the memory bottleneck for saving KV caches and
         | hence increases the throughput.
        
       | Solvency wrote:
       | Semi-related question: this page is full of little charts and
       | diagrams. There are thousands of similar
       | projects/sites/experiment sites with their own charts and
       | diagrams. But it seems like there are always subtle-to-large
       | differences in them that indicate they're made with totally
       | different libraries.
       | 
       | Are there just thousands of homebrewn non-standard chart &
       | diagram builders out there? How does one even begin to pick a
       | standard to whip out quickies like these? Google SEO makes it
       | virtually impossible to get to substance.
        
         | daedbe wrote:
         | I often see charts produced using matplotlib or plotly - often
         | you can tell based on the colour schemes used. For example, the
         | bar chart at the bottom of this paper looks like it was made
         | with plotly. I think the reason for such variance in the style
         | of charts is largely due to the flexibility frameworks such as
         | matplotlib provide: you can control basically every aspect of a
         | chart and use any number of predefined or custom stylesheets to
         | change the look and feel.
        
         | wskwon wrote:
         | We used matplotlib for the performance charts, and used a free
         | website to convert google slides to the animation gifs.
        
         | kristjansson wrote:
         | The color scheme on these implies Google Drawing, but I don't
         | know how they made them into animations - maybe just manually?
        
           | mattnewton wrote:
           | Google slides I think.
        
       | marcopicentini wrote:
       | Is it available an hosted demo?
       | 
       | What are use cases for which open source models are equivalent of
       | GPT 3.5?
        
         | wskwon wrote:
         | You can think of LMSYS Vicuna: https://chat.lmsys.org as our
         | hosted demo, as it actually uses vLLM as the backend.
        
       | bioemerl wrote:
       | I'm spoiled by 4 bit and unfortunately it doesn't appear to be
       | supposed here so this isn't of much use to me, but it's awesome
       | to see people working on the inference speed side of things
       | regardless.
        
         | george_123 wrote:
         | this approach to managing KV cache can work with 4bit. imagine
         | the speedup of pagedattention with quantization..
        
           | zhisbug wrote:
           | yep, it is agonistic to 4-bit. You can deploy a 4-bit model
           | and still use vllm + pagedattention to double or even triple
           | your serving throughput.
        
             | ynniv wrote:
             | If this were submitted as a new comment it would be at the
             | top of the page.
        
       | brucethemoose2 wrote:
       | Reading between the lines, it sounds like some of the speedup
       | comes from VRAM savings on an otherwise close to full GPU?
       | 
       | This is definitely cool and needed, but it might not be so
       | dramatic running 3-5 but quant on a less full GPU.
        
       | scv119 wrote:
       | Pretty cool stuff and the result are amazing. Hoping we will see
       | virtual memory get standardized in pytorch or cuda.
        
       | gwph wrote:
       | Ion Stoica's lab continues to be a powerhouse of innovation.
       | Previous successes of Stoica and his students include (but are
       | certainly not limited to) Apache Spark, Ray, Apache Mesos and
       | Alluxio.
        
       | kossTKR wrote:
       | Does this mean that GPT-4/65b level performance is closer to
       | running on a say a m1/m2 with only 24+ gigabytes of ram?
        
         | wskwon wrote:
         | Not really. vLLM optimizes the throughput of your LLM, but does
         | not reduce the minimum required amount of resource to run your
         | model.
        
       | jokoon wrote:
       | Now do the same for image classifiers. I tried a few of them,
       | they're just horribly slow.
       | 
       | This is pretty outrageous considering the first robust image
       | image classifiers appeared around 2007.
        
       | wskwon wrote:
       | vLLM has been adopted by LMSYS for serving Vicuna and Chatbot
       | Arena.
        
       ___________________________________________________________________
       (page generated 2023-06-20 23:00 UTC)