[HN Gopher] vLLM: 24x faster LLM serving than HuggingFace Transf... ___________________________________________________________________ vLLM: 24x faster LLM serving than HuggingFace Transformers Author : wskwon Score : 132 points Date : 2023-06-20 19:17 UTC (3 hours ago) (HTM) web link (vllm.ai) (TXT) w3m dump (vllm.ai) | two_in_one wrote: | I wonder if this sort of memory management can be made for | Pytorch transformers as under the hood optimization. | thewataccount wrote: | This is really cool to see. | | > Large: Takes up to 1.7GB for a single sequence in LLaMA-13B. | | > Dynamic: Its size depends on the sequence length, which is | highly variable and unpredictable. As a result, efficiently | managing the KV cache presents a significant challenge. We find | that existing systems waste 60% - 80% of memory due to | fragmentation and over-reservation. | | This mentions improvements for throughput which is great, and it | mentions memory savings. I'm a bit confused how 80% of the memory | could be wasted by the KV cache when the vast majority of the | memory is usually holding the model itself? | | How much memory savings does this translate to effectively for | say a 30B 4bit model? | zhisbug wrote: | This really depends on what GPUs you use. If you GPUs has very | small amount of memory, vLLM will help more. | | vLLM addresses the memory bottleneck for saving KV caches and | hence increases the throughput. | Solvency wrote: | Semi-related question: this page is full of little charts and | diagrams. There are thousands of similar | projects/sites/experiment sites with their own charts and | diagrams. But it seems like there are always subtle-to-large | differences in them that indicate they're made with totally | different libraries. | | Are there just thousands of homebrewn non-standard chart & | diagram builders out there? How does one even begin to pick a | standard to whip out quickies like these? Google SEO makes it | virtually impossible to get to substance. | daedbe wrote: | I often see charts produced using matplotlib or plotly - often | you can tell based on the colour schemes used. For example, the | bar chart at the bottom of this paper looks like it was made | with plotly. I think the reason for such variance in the style | of charts is largely due to the flexibility frameworks such as | matplotlib provide: you can control basically every aspect of a | chart and use any number of predefined or custom stylesheets to | change the look and feel. | wskwon wrote: | We used matplotlib for the performance charts, and used a free | website to convert google slides to the animation gifs. | kristjansson wrote: | The color scheme on these implies Google Drawing, but I don't | know how they made them into animations - maybe just manually? | mattnewton wrote: | Google slides I think. | marcopicentini wrote: | Is it available an hosted demo? | | What are use cases for which open source models are equivalent of | GPT 3.5? | wskwon wrote: | You can think of LMSYS Vicuna: https://chat.lmsys.org as our | hosted demo, as it actually uses vLLM as the backend. | bioemerl wrote: | I'm spoiled by 4 bit and unfortunately it doesn't appear to be | supposed here so this isn't of much use to me, but it's awesome | to see people working on the inference speed side of things | regardless. | george_123 wrote: | this approach to managing KV cache can work with 4bit. imagine | the speedup of pagedattention with quantization.. | zhisbug wrote: | yep, it is agonistic to 4-bit. You can deploy a 4-bit model | and still use vllm + pagedattention to double or even triple | your serving throughput. | ynniv wrote: | If this were submitted as a new comment it would be at the | top of the page. | brucethemoose2 wrote: | Reading between the lines, it sounds like some of the speedup | comes from VRAM savings on an otherwise close to full GPU? | | This is definitely cool and needed, but it might not be so | dramatic running 3-5 but quant on a less full GPU. | scv119 wrote: | Pretty cool stuff and the result are amazing. Hoping we will see | virtual memory get standardized in pytorch or cuda. | gwph wrote: | Ion Stoica's lab continues to be a powerhouse of innovation. | Previous successes of Stoica and his students include (but are | certainly not limited to) Apache Spark, Ray, Apache Mesos and | Alluxio. | kossTKR wrote: | Does this mean that GPT-4/65b level performance is closer to | running on a say a m1/m2 with only 24+ gigabytes of ram? | wskwon wrote: | Not really. vLLM optimizes the throughput of your LLM, but does | not reduce the minimum required amount of resource to run your | model. | jokoon wrote: | Now do the same for image classifiers. I tried a few of them, | they're just horribly slow. | | This is pretty outrageous considering the first robust image | image classifiers appeared around 2007. | wskwon wrote: | vLLM has been adopted by LMSYS for serving Vicuna and Chatbot | Arena. ___________________________________________________________________ (page generated 2023-06-20 23:00 UTC)