[HN Gopher] High-Speed Large Language Model Serving on PCs with ...
       ___________________________________________________________________
        
       High-Speed Large Language Model Serving on PCs with Consumer-Grade
       GPUs
        
       Author : dataminer
       Score  : 253 points
       Date   : 2023-12-20 13:46 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | brucethemoose2 wrote:
       | This is super cool.
       | 
       | For all the love llama.cpp gets, its method of dGPU offloading
       | (prompt processing on GPU and then just splitting the model down
       | the middle) is relatively simple. But its interesting that there
       | even _is_ so much  "activation sparsity" to take advantage of.
       | The traditional thinking in ML is that memory access is very
       | random.
       | 
       | Hopefully the "cold" neurons eventually get offloaded to the IGP
       | instead?
       | 
       | Also, its curious that they are considering a Metal kernel. I
       | thought the performance advantage came from the hybrid memory
       | pool... seems like that would only help old AMD Macs, unless I am
       | missing something?
        
         | sroussey wrote:
         | The only thing I could think of on the question of Apple
         | Silicon and Metal is that they think they could still split out
         | the cold neurons to the CPU/Accelerate and the hot ones on the
         | GPU and utilize both. The speedup is likely less if there is
         | already no copying of data between GPU/CPU and using the
         | unified memory. Still, it would be great if you could use even
         | more of the capabilities of the chip simultaneously. In order
         | to avoid thermal throttling they should use the efficiency
         | cores only (I think this is what game mode does).
        
       | coder543 wrote:
       | "Power*" made me think of Microsoft, so I was almost expecting
       | this to be Windows-specific. (PowerShell, PowerPoint, Power BI,
       | Power Apps, Power Automate... I'm probably forgetting some.)
        
         | HPsquared wrote:
         | PowerToys are probably the original (going back to PowerToys
         | for Windows 95)
         | 
         | Edit: https://socket3.wordpress.com/2016/10/22/using-
         | windows-95-po...
        
           | coder543 wrote:
           | PowerPoint existed in the late 80s, I think, although
           | Microsoft acquired it from what I understand.
        
         | latchkey wrote:
         | https://en.wikipedia.org/wiki/PowerPC
        
       | EwanG wrote:
       | The important stuff from the readme (if you're not looking to
       | tinker with it directly):
       | 
       | We have tested PowerInfer on the following platforms:
       | 
       | x86-64 CPU (with AVX2 instructions) on Linux
       | 
       | x86-64 CPU and NVIDIA GPU on Linux
       | 
       | Apple M Chips on macOS (As we do not optimize for Mac, the
       | performance improvement is not significant now.)
       | 
       | And new features coming soon:
       | 
       | Mistral-7B model
       | 
       | Metal backend for sparse inference on macOS
        
         | rahimnathwani wrote:
         | Also worth mentioning the downloadable llama2 models, and the
         | convert.py file.
        
       | 127 wrote:
       | Running uncensored Mixtral on this would be really nice. More
       | than 3 bits quantized for 4090.
        
         | eurekin wrote:
         | Downvoters care to comment? Uncensored llm versions typically
         | perform better (at least on benchmarks) to their "lobotomized"
         | or aligned counterparts
        
           | infotainment wrote:
           | Probably because the parent comment didn't contain much of
           | substance. "Oh, I'd love to see this with [insert my favorite
           | model here]" doesn't really add a lot to the discussion.
           | 
           | For example, the parent commenter could have talked about the
           | specific attributes of that model that make it superior. I
           | personally am aware that Mixtral is one of the best
           | performing models right now, but is everyone else? Also, does
           | Mixtral need to be uncensored? I've used vanilla Mistral for
           | some...interesting...prompts and had no issues with it
           | moralizing at me.
        
             | BriggyDwiggs42 wrote:
             | Lol
        
         | mirekrusin wrote:
         | Dual GPUs should be considered normal/consumer grade setup,
         | hopefully they'll add it soon, on 4bits it's enough with plenty
         | of space for context.
         | 
         | This whole thing is a fork of llamacpp, also hoping it'll all
         | go upstream sooner or later.
        
         | legel wrote:
         | Yeah, so they demo a bigger model on an RTX 4090 with 24 GB
         | VRAM. Granted an implementation of sparse activations with the
         | Mixture of Experts could be non-trivial, I think it's a
         | brilliant move, that could potentially allow for even, e.g.,
         | CPU only processing and/or much cheaper GPU processing...
         | Mixtral technically already has neural network controlled
         | sparse activations, but like the Inception meme says: we must
         | go deeper...
        
       | ekianjo wrote:
       | how much speed increase do we get on CPU only configurations? has
       | anyone tested it in such cases?
        
         | ComputerGuru wrote:
         | This architecture is specifically aimed at optimizing GPU use.
        
         | NavinF wrote:
         | CPU-only is impractical for most use cases and this will only
         | become more true over time as models become larger. The
         | mediocre perf/$ and perf/watt makes it not worth the effort
        
       | jupp0r wrote:
       | From my understanding in this implementation there is some amount
       | of knowledge about the model itself needed to determine what
       | parts to place in system memory vs what parts to place in GPU
       | memory. Can this ideally be computed automatically or will future
       | models have some sort of interface for placement algorithms like
       | this to help automate this? If the algorithm needs to be adopted
       | for each model architecture, it's going to be a lot of work to
       | maintain this project.
        
         | loudmax wrote:
         | That sounds about right. They provide a script to combine their
         | "Predictor" weights to the original models, but I don't see
         | anything obvious in the front page of the Github repo about how
         | to create those weights.
         | 
         | A 10x speed improvement is really impressive. If this kind of
         | improvement is reproducible across other models, then
         | presumably identifying hot and cold neurons for inference
         | optimization should go on to become a normal part of model
         | development process.
        
           | thelastparadise wrote:
           | Like JVM "hot spots," or JIT optimization.
        
             | jupp0r wrote:
             | Or profile guided optimization.
        
       | phh wrote:
       | Took me a while to understand what their "hot" and "cold" neurons
       | meant, since in most ML I do, there is no such notion. And their
       | paper doesn't directly define it (or I missed it)
       | 
       | After some thoughts, in ReLU it does make sense, because half of
       | the function is constant, so you can say that you're "cold" if
       | that neuron's ReLU-ed output is often 0 . So I checked whether
       | ReLU was common in LLMs, original llama doesn't use ReLU. But
       | after (re-)reading the github, it actually only works on ReLU
       | models. Turns out that there is a group of people "fine-tuning"
       | (I would rather call that re-training, since you start by
       | breaking the model?) models to use ReLU to allow for that
       | sparsity: https://huggingface.co/SparseLLM
       | 
       | So this is sadly not applicable to any model you can find on the
       | internet, but that sounds like a great progress anyway. Possibly
       | this might shift the compromises back to bigger models but with
       | "less ideal" activations. Also I'm curious what would be the
       | legal impacts on it (since USA and EU refers to a model's
       | FLOPs/number of parameters... How do you compute it with
       | sparsity? Do you average?)
       | 
       | I think that a possible avenue for future research in that area
       | is keeping original activation (like llama keeping SwiGLU), but
       | using quantification to define "hot" and "cold" neurons to be
       | saturation areas. (For example, saying that this activation
       | function, below -1. at 8 bit, is equivalent to -infinity, and
       | thus this is a cold neuron)
        
         | brucethemoose2 wrote:
         | That is a huge caveat to leave out of a readme, especially one
         | that claims llama compatibility.
        
         | acqq wrote:
         | Indeed
         | 
         | https://huggingface.co/SparseLLM/ReluFalcon-40B
         | 
         | "We utilize PowerInfer for inference"
        
         | boredumb wrote:
         | > Also I'm curious what would be the legal impacts on it (since
         | USA and EU refers to a model's FLOPs/number of parameters...
         | How do you compute it with sparsity? Do you average?).
         | 
         | How/when did these types of regulations come about? This feels
         | like an insane thing to have to keep in mind while developing.
        
           | radicalbyte wrote:
           | The EU messed up with the GDPR - they should have implemented
           | it at least a decade earlier and ignored the lobby which lead
           | to the cookie banner instead of either an outright ban on
           | tracking for all but a tiny number of purposes. Such a ban
           | would have had a negligible impact on the tech industry
           | financially but would have had huge privacy rewards.
           | 
           | They're trying to get in early on AI so as not to make the
           | same mistake again. Which might result in them making the
           | opposite mistake.
        
             | quocanh wrote:
             | Tiny negligible impact on the industry (Except cut
             | advertising revenue in half, but who cares. What do ads pay
             | for anyways?)
        
           | phh wrote:
           | > How/when did these types of regulations come about?
           | 
           | I can't say much about US. As I see it, EU pretty much copied
           | US about that part. There was nothing related to computation
           | in the EU's AI Act projects until few months ago, it was
           | purely a "what kind of data processing are you allowed to
           | do?"
        
             | alchemist1e9 wrote:
             | Politely, what the hell are you talking about? Who is
             | telling anyone what they can or cannot compute?
        
               | iamjackg wrote:
               | US:
               | 
               | https://www.whitehouse.gov/briefing-room/presidential-
               | action...
               | 
               | "Until such technical conditions are defined, the
               | Secretary shall require compliance with these reporting
               | requirements for:                         (i)   any model
               | that was trained using a quantity of computing power
               | greater than 1026 integer or floating-point operations,
               | or using primarily biological sequence data and using a
               | quantity of computing power greater than 1023 integer or
               | floating-point operations[...]"
               | 
               | EU:
               | 
               | https://thefuturesociety.org/wp-
               | content/uploads/2023/12/EU-A...
        
               | geon wrote:
               | > 1026
               | 
               | > 1023
               | 
               | Should be 10^26 and 10^23.
        
               | alchemist1e9 wrote:
               | Probably I did this wrong but I'm getting an
               | approximation of 300K H100s completes that in a month. At
               | least they choose something fairly large it seems. Not
               | sure how LoRA or other incremental training is handled.
        
               | two_in_one wrote:
               | Passing though ChatGPT4, (actually nothing specific,
               | mostly empty words)
               | 
               | Summary:
               | 
               | The Executive Order focuses on the safe, secure, and
               | trustworthy development and use of Artificial
               | Intelligence (AI). It outlines a government-wide approach
               | to manage AI responsibly, addressing potential societal
               | harms like fraud, bias, and security risks. The order
               | establishes guiding principles and policies for AI
               | development, emphasizing safety, innovation, workers'
               | rights, equity, consumer protection, privacy, government
               | use of AI, and global leadership. It includes detailed
               | definitions and actions for government agencies to ensure
               | AI is developed and used ethically and effectively.
               | 
               | about power/size:
               | 
               | The Executive Order does not specifically mention the
               | size of AI models or compute power in terms of FLOPs
               | (Floating Point Operations per Second). It focuses more
               | broadly on the principles and policies for responsible AI
               | development and use, without delving into technical
               | specifics like model size or compute requirements.
               | 
               | about what developers have to do after this order:
               | 
               | New developers of AI models, after this Executive Order,
               | are encouraged to align their AI development and use with
               | the outlined principles and policies. These focus on
               | ensuring AI is safe, secure, trustworthy, and ethically
               | developed, while addressing societal harms such as bias
               | and privacy concerns. Developers should consider how
               | their AI impacts equity, innovation, consumer protection,
               | and workers' rights, and adhere to guidelines for
               | responsible government use of AI.
        
               | cyanydeez wrote:
               | anyone with a functional government.
        
       | ComputerGuru wrote:
       | It's not too much faster than exllama2 with flash attention, no?
        
       | modeless wrote:
       | Everyone compares against llama.cpp because it's easy mode.
       | Llama.cpp is slow! Everyone should know this. They should compare
       | against exllamav2 or other optimized implementations.
        
         | nulld3v wrote:
         | ExLlama is GPU only right? This speedup is for GPU + CPU split
         | use cases.
        
           | modeless wrote:
           | Oh I see, they are running a 40B model unquantized, whereas
           | exllamav2 would have to use 4-bit quantization to fit. Given
           | the quality of 4-bit quantization these days and the speed
           | boost it provides I question the utility of running
           | unquantized for serving purposes.
           | 
           | I see they have a 4-bit benchmark lower down in the page.
           | That's where they ought to compare against exllamav2.
        
         | sroussey wrote:
         | What do you recommend that is faster that I can package into an
         | app for distribution?
        
           | modeless wrote:
           | I have packaged exllamav2 (plus a lot of other stuff) into an
           | app for distribution here:
           | https://apps.microsoft.com/detail/9NC624PBFGB7
           | 
           | I used pyinstaller. It was difficult because Python makes
           | these things difficult. But it works. It does require an
           | Nvidia GPU. MLC-LLM is another option that might be easier to
           | package and potentially able to run on AMD.
        
             | sroussey wrote:
             | Oh yeah, I want to work on AMD/Intel/NVIDIA and MacOS, even
             | iOS/Android.
             | 
             | I've been following MLC-LLM as well. Right now I am just
             | using JS/WASM from Huggingface, but later I will want
             | something more performant.
        
               | modeless wrote:
               | Yeah if you want maximum performance on multiple
               | platforms you'll probably have to package multiple
               | frameworks. Llama.cpp might be a decently fast option on
               | Apple Silicon, I'm not sure of the state of the art
               | there.
        
         | avereveard wrote:
         | Yeah but exllama doesn't do grammars so I'm stuck with
         | llama.cpp
         | 
         | Also apparently exllama has a few side effects in coherence
         | https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_for...
        
         | superkuh wrote:
         | In this case they're comparing against llama.cpp because the
         | code is literally a modification of llama.cpp. I'm not talking
         | about using the ggml lib for matrix calculations, it's
         | literally using the llama.cpp main.cpp and other normal
         | llama.cpp code. It's a fork. It is _directly_ comparable.
         | 
         | https://github.com/ggerganov/llama.cpp/pull/4543 [Review] Merge
         | PowerInfer with llama.cpp mainline #4543
         | 
         | https://github.com/ggerganov/llama.cpp/discussions/4534#disc...
         | "The x11 speedup is kind of cherrypicked because the llama.cpp
         | GPU code for Falcon 40b is just not well-optimized."
        
       | nextaccountic wrote:
       | > Hybrid CPU/GPU Utilization: Seamlessly integrates
       | memory/computation capabilities of CPU and GPU for a balanced
       | workload and faster processing.
       | 
       | Does this means that it runs at same time at both CPU and GPU,
       | being faster than a CPU-only or a GPU-only implementation on the
       | same device?
       | 
       | edit: when running on integrated GPUs, can this benefit from the
       | improved communication between CPU and GPU?
        
         | rahimnathwani wrote:
         | GPU-only will be faster if you have enough VRAM.
         | 
         | But if you want to run a model that requires more VRAM than you
         | have, the current approach is to use llama.cpp and specify
         | n_gpu_layers. That works, but is slower than GPU-only.
         | 
         | OP claims to be 10x as fast as llama.cpp in the case when you
         | can't fit the whole model in VRAM.
        
       | causality0 wrote:
       | All the "consumer grade GPUs" terminology makes it seem like you
       | could run it on a variety of models, but like _so many_ of these
       | posts, is this a 4090 exclusive?
        
       | superkuh wrote:
       | This will be really cool once there's the ability to generate the
       | sparse predictor files for arbitrary models rather than just the
       | 4 they've done it with. Looking through the page and code it
       | doesn't seem like the tools to do that step are included. Guess
       | I'll wait on this one a bit. Hopefully these features will be
       | merged back into llama.cpp as options eventually since this is
       | based on the normal llama.cpp code (ie, not just using the ggml
       | matrix lib).
        
       ___________________________________________________________________
       (page generated 2023-12-20 23:00 UTC)