[HN Gopher] High-Speed Large Language Model Serving on PCs with ... ___________________________________________________________________ High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs Author : dataminer Score : 253 points Date : 2023-12-20 13:46 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | brucethemoose2 wrote: | This is super cool. | | For all the love llama.cpp gets, its method of dGPU offloading | (prompt processing on GPU and then just splitting the model down | the middle) is relatively simple. But its interesting that there | even _is_ so much "activation sparsity" to take advantage of. | The traditional thinking in ML is that memory access is very | random. | | Hopefully the "cold" neurons eventually get offloaded to the IGP | instead? | | Also, its curious that they are considering a Metal kernel. I | thought the performance advantage came from the hybrid memory | pool... seems like that would only help old AMD Macs, unless I am | missing something? | sroussey wrote: | The only thing I could think of on the question of Apple | Silicon and Metal is that they think they could still split out | the cold neurons to the CPU/Accelerate and the hot ones on the | GPU and utilize both. The speedup is likely less if there is | already no copying of data between GPU/CPU and using the | unified memory. Still, it would be great if you could use even | more of the capabilities of the chip simultaneously. In order | to avoid thermal throttling they should use the efficiency | cores only (I think this is what game mode does). | coder543 wrote: | "Power*" made me think of Microsoft, so I was almost expecting | this to be Windows-specific. (PowerShell, PowerPoint, Power BI, | Power Apps, Power Automate... I'm probably forgetting some.) | HPsquared wrote: | PowerToys are probably the original (going back to PowerToys | for Windows 95) | | Edit: https://socket3.wordpress.com/2016/10/22/using- | windows-95-po... | coder543 wrote: | PowerPoint existed in the late 80s, I think, although | Microsoft acquired it from what I understand. | latchkey wrote: | https://en.wikipedia.org/wiki/PowerPC | EwanG wrote: | The important stuff from the readme (if you're not looking to | tinker with it directly): | | We have tested PowerInfer on the following platforms: | | x86-64 CPU (with AVX2 instructions) on Linux | | x86-64 CPU and NVIDIA GPU on Linux | | Apple M Chips on macOS (As we do not optimize for Mac, the | performance improvement is not significant now.) | | And new features coming soon: | | Mistral-7B model | | Metal backend for sparse inference on macOS | rahimnathwani wrote: | Also worth mentioning the downloadable llama2 models, and the | convert.py file. | 127 wrote: | Running uncensored Mixtral on this would be really nice. More | than 3 bits quantized for 4090. | eurekin wrote: | Downvoters care to comment? Uncensored llm versions typically | perform better (at least on benchmarks) to their "lobotomized" | or aligned counterparts | infotainment wrote: | Probably because the parent comment didn't contain much of | substance. "Oh, I'd love to see this with [insert my favorite | model here]" doesn't really add a lot to the discussion. | | For example, the parent commenter could have talked about the | specific attributes of that model that make it superior. I | personally am aware that Mixtral is one of the best | performing models right now, but is everyone else? Also, does | Mixtral need to be uncensored? I've used vanilla Mistral for | some...interesting...prompts and had no issues with it | moralizing at me. | BriggyDwiggs42 wrote: | Lol | mirekrusin wrote: | Dual GPUs should be considered normal/consumer grade setup, | hopefully they'll add it soon, on 4bits it's enough with plenty | of space for context. | | This whole thing is a fork of llamacpp, also hoping it'll all | go upstream sooner or later. | legel wrote: | Yeah, so they demo a bigger model on an RTX 4090 with 24 GB | VRAM. Granted an implementation of sparse activations with the | Mixture of Experts could be non-trivial, I think it's a | brilliant move, that could potentially allow for even, e.g., | CPU only processing and/or much cheaper GPU processing... | Mixtral technically already has neural network controlled | sparse activations, but like the Inception meme says: we must | go deeper... | ekianjo wrote: | how much speed increase do we get on CPU only configurations? has | anyone tested it in such cases? | ComputerGuru wrote: | This architecture is specifically aimed at optimizing GPU use. | NavinF wrote: | CPU-only is impractical for most use cases and this will only | become more true over time as models become larger. The | mediocre perf/$ and perf/watt makes it not worth the effort | jupp0r wrote: | From my understanding in this implementation there is some amount | of knowledge about the model itself needed to determine what | parts to place in system memory vs what parts to place in GPU | memory. Can this ideally be computed automatically or will future | models have some sort of interface for placement algorithms like | this to help automate this? If the algorithm needs to be adopted | for each model architecture, it's going to be a lot of work to | maintain this project. | loudmax wrote: | That sounds about right. They provide a script to combine their | "Predictor" weights to the original models, but I don't see | anything obvious in the front page of the Github repo about how | to create those weights. | | A 10x speed improvement is really impressive. If this kind of | improvement is reproducible across other models, then | presumably identifying hot and cold neurons for inference | optimization should go on to become a normal part of model | development process. | thelastparadise wrote: | Like JVM "hot spots," or JIT optimization. | jupp0r wrote: | Or profile guided optimization. | phh wrote: | Took me a while to understand what their "hot" and "cold" neurons | meant, since in most ML I do, there is no such notion. And their | paper doesn't directly define it (or I missed it) | | After some thoughts, in ReLU it does make sense, because half of | the function is constant, so you can say that you're "cold" if | that neuron's ReLU-ed output is often 0 . So I checked whether | ReLU was common in LLMs, original llama doesn't use ReLU. But | after (re-)reading the github, it actually only works on ReLU | models. Turns out that there is a group of people "fine-tuning" | (I would rather call that re-training, since you start by | breaking the model?) models to use ReLU to allow for that | sparsity: https://huggingface.co/SparseLLM | | So this is sadly not applicable to any model you can find on the | internet, but that sounds like a great progress anyway. Possibly | this might shift the compromises back to bigger models but with | "less ideal" activations. Also I'm curious what would be the | legal impacts on it (since USA and EU refers to a model's | FLOPs/number of parameters... How do you compute it with | sparsity? Do you average?) | | I think that a possible avenue for future research in that area | is keeping original activation (like llama keeping SwiGLU), but | using quantification to define "hot" and "cold" neurons to be | saturation areas. (For example, saying that this activation | function, below -1. at 8 bit, is equivalent to -infinity, and | thus this is a cold neuron) | brucethemoose2 wrote: | That is a huge caveat to leave out of a readme, especially one | that claims llama compatibility. | acqq wrote: | Indeed | | https://huggingface.co/SparseLLM/ReluFalcon-40B | | "We utilize PowerInfer for inference" | boredumb wrote: | > Also I'm curious what would be the legal impacts on it (since | USA and EU refers to a model's FLOPs/number of parameters... | How do you compute it with sparsity? Do you average?). | | How/when did these types of regulations come about? This feels | like an insane thing to have to keep in mind while developing. | radicalbyte wrote: | The EU messed up with the GDPR - they should have implemented | it at least a decade earlier and ignored the lobby which lead | to the cookie banner instead of either an outright ban on | tracking for all but a tiny number of purposes. Such a ban | would have had a negligible impact on the tech industry | financially but would have had huge privacy rewards. | | They're trying to get in early on AI so as not to make the | same mistake again. Which might result in them making the | opposite mistake. | quocanh wrote: | Tiny negligible impact on the industry (Except cut | advertising revenue in half, but who cares. What do ads pay | for anyways?) | phh wrote: | > How/when did these types of regulations come about? | | I can't say much about US. As I see it, EU pretty much copied | US about that part. There was nothing related to computation | in the EU's AI Act projects until few months ago, it was | purely a "what kind of data processing are you allowed to | do?" | alchemist1e9 wrote: | Politely, what the hell are you talking about? Who is | telling anyone what they can or cannot compute? | iamjackg wrote: | US: | | https://www.whitehouse.gov/briefing-room/presidential- | action... | | "Until such technical conditions are defined, the | Secretary shall require compliance with these reporting | requirements for: (i) any model | that was trained using a quantity of computing power | greater than 1026 integer or floating-point operations, | or using primarily biological sequence data and using a | quantity of computing power greater than 1023 integer or | floating-point operations[...]" | | EU: | | https://thefuturesociety.org/wp- | content/uploads/2023/12/EU-A... | geon wrote: | > 1026 | | > 1023 | | Should be 10^26 and 10^23. | alchemist1e9 wrote: | Probably I did this wrong but I'm getting an | approximation of 300K H100s completes that in a month. At | least they choose something fairly large it seems. Not | sure how LoRA or other incremental training is handled. | two_in_one wrote: | Passing though ChatGPT4, (actually nothing specific, | mostly empty words) | | Summary: | | The Executive Order focuses on the safe, secure, and | trustworthy development and use of Artificial | Intelligence (AI). It outlines a government-wide approach | to manage AI responsibly, addressing potential societal | harms like fraud, bias, and security risks. The order | establishes guiding principles and policies for AI | development, emphasizing safety, innovation, workers' | rights, equity, consumer protection, privacy, government | use of AI, and global leadership. It includes detailed | definitions and actions for government agencies to ensure | AI is developed and used ethically and effectively. | | about power/size: | | The Executive Order does not specifically mention the | size of AI models or compute power in terms of FLOPs | (Floating Point Operations per Second). It focuses more | broadly on the principles and policies for responsible AI | development and use, without delving into technical | specifics like model size or compute requirements. | | about what developers have to do after this order: | | New developers of AI models, after this Executive Order, | are encouraged to align their AI development and use with | the outlined principles and policies. These focus on | ensuring AI is safe, secure, trustworthy, and ethically | developed, while addressing societal harms such as bias | and privacy concerns. Developers should consider how | their AI impacts equity, innovation, consumer protection, | and workers' rights, and adhere to guidelines for | responsible government use of AI. | cyanydeez wrote: | anyone with a functional government. | ComputerGuru wrote: | It's not too much faster than exllama2 with flash attention, no? | modeless wrote: | Everyone compares against llama.cpp because it's easy mode. | Llama.cpp is slow! Everyone should know this. They should compare | against exllamav2 or other optimized implementations. | nulld3v wrote: | ExLlama is GPU only right? This speedup is for GPU + CPU split | use cases. | modeless wrote: | Oh I see, they are running a 40B model unquantized, whereas | exllamav2 would have to use 4-bit quantization to fit. Given | the quality of 4-bit quantization these days and the speed | boost it provides I question the utility of running | unquantized for serving purposes. | | I see they have a 4-bit benchmark lower down in the page. | That's where they ought to compare against exllamav2. | sroussey wrote: | What do you recommend that is faster that I can package into an | app for distribution? | modeless wrote: | I have packaged exllamav2 (plus a lot of other stuff) into an | app for distribution here: | https://apps.microsoft.com/detail/9NC624PBFGB7 | | I used pyinstaller. It was difficult because Python makes | these things difficult. But it works. It does require an | Nvidia GPU. MLC-LLM is another option that might be easier to | package and potentially able to run on AMD. | sroussey wrote: | Oh yeah, I want to work on AMD/Intel/NVIDIA and MacOS, even | iOS/Android. | | I've been following MLC-LLM as well. Right now I am just | using JS/WASM from Huggingface, but later I will want | something more performant. | modeless wrote: | Yeah if you want maximum performance on multiple | platforms you'll probably have to package multiple | frameworks. Llama.cpp might be a decently fast option on | Apple Silicon, I'm not sure of the state of the art | there. | avereveard wrote: | Yeah but exllama doesn't do grammars so I'm stuck with | llama.cpp | | Also apparently exllama has a few side effects in coherence | https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_for... | superkuh wrote: | In this case they're comparing against llama.cpp because the | code is literally a modification of llama.cpp. I'm not talking | about using the ggml lib for matrix calculations, it's | literally using the llama.cpp main.cpp and other normal | llama.cpp code. It's a fork. It is _directly_ comparable. | | https://github.com/ggerganov/llama.cpp/pull/4543 [Review] Merge | PowerInfer with llama.cpp mainline #4543 | | https://github.com/ggerganov/llama.cpp/discussions/4534#disc... | "The x11 speedup is kind of cherrypicked because the llama.cpp | GPU code for Falcon 40b is just not well-optimized." | nextaccountic wrote: | > Hybrid CPU/GPU Utilization: Seamlessly integrates | memory/computation capabilities of CPU and GPU for a balanced | workload and faster processing. | | Does this means that it runs at same time at both CPU and GPU, | being faster than a CPU-only or a GPU-only implementation on the | same device? | | edit: when running on integrated GPUs, can this benefit from the | improved communication between CPU and GPU? | rahimnathwani wrote: | GPU-only will be faster if you have enough VRAM. | | But if you want to run a model that requires more VRAM than you | have, the current approach is to use llama.cpp and specify | n_gpu_layers. That works, but is slower than GPU-only. | | OP claims to be 10x as fast as llama.cpp in the case when you | can't fit the whole model in VRAM. | causality0 wrote: | All the "consumer grade GPUs" terminology makes it seem like you | could run it on a variety of models, but like _so many_ of these | posts, is this a 4090 exclusive? | superkuh wrote: | This will be really cool once there's the ability to generate the | sparse predictor files for arbitrary models rather than just the | 4 they've done it with. Looking through the page and code it | doesn't seem like the tools to do that step are included. Guess | I'll wait on this one a bit. Hopefully these features will be | merged back into llama.cpp as options eventually since this is | based on the normal llama.cpp code (ie, not just using the ggml | matrix lib). ___________________________________________________________________ (page generated 2023-12-20 23:00 UTC)