[HN Gopher] Run Llama 13B with a 6GB graphics card ___________________________________________________________________ Run Llama 13B with a 6GB graphics card Author : rain1 Score : 411 points Date : 2023-05-14 12:35 UTC (10 hours ago) (HTM) web link (gist.github.com) (TXT) w3m dump (gist.github.com) | tikkun wrote: | See also: | | https://www.reddit.com/r/LocalLLaMA/comments/13fnyah/you_guy... | | https://chat.lmsys.org/?arena (Click 'leaderboard') | alg_fun wrote: | wouldn't i be faster to use ram as a swap for vram? | naillo wrote: | This is cool but are people actually getting stuff done with | these models? I'm enthusiastic about their potential too but | after playing with it for a day I'm at a loss for what to use it | for anymore at this point | Joeri wrote: | My experience is that even the "toy" 7B models can do pretty | amazing things, but they're very sensitive to prompting in the | right way to force the model down the right pathways. My | assumption is that the larger models have more redundancy and | therefore can be prompted less precisely while still going down | the right pathway. | | For example, the WizardLM model (5_1 quantized, as 4_x | quantization makes the 7B models perform terribly): | | Prompt: _I went to the market and bought 10 apples. I gave 2 to | my neighbor and 2 to my sister. Then I bought 5 more and ate 1. | How many apples do I have?_ | | Response: Let me check that for you! You had 9 | apples left after giving away 4, so you still have 9 - 4 = 5 | apples left. | | Prompt: _I went to the market and bought 10 apples. I gave 2 to | my neighbor and 2 to my sister. Then I bought 5 more and ate 1. | How many apples do I have? Write out any calculations in full. | List the steps used to come to the final result._ | | Response: Sure, let's work through this problem | together! Here are the steps we can use: Step 1: Start | with the initial number of apples you bought (10). Step | 2: Subtract the amount given to your neighbor and sister from | the total (2 + 2 = 4). This gives us a new total of 6 apples. | Step 3: Add the additional 5 apples you bought later on to the | remaining 6 apples, giving us a final total of 11 apples. | Step 4: Finally, subtract the apple you ate from the total (1), | leaving us with 10 apples in total. So the answer is 10 | apples. | m3kw9 wrote: | Most part if you actually need stuff done, you can afford 20$ | for OpenAIs proven models. | | These local projects are great because maybe eventually they | will have a equivalent model that can be run on cheap parts | Beltiras wrote: | The knowledge cutoff date is brutal for some applications. | m3kw9 wrote: | I wonder how ingesting more and more data will affect the | size of parameters, it's gonna continually get bigger? | rolisz wrote: | I don't think that the current models are at "knowledge | capacity". So far all evidence points to training on more | data on the same size model gives better results. | cubefox wrote: | Both increasing the amount of parameters and the amount | of training tokens improves results (more precisely: | lowers training loss), and costs computing power. For | optimally improving loss per training computing power, | model size and training tokens should be increased | equally. That's the Chinchilla scaling law. (Though low | loss is not always the same as good results, the data | quality also matters.) | | Further reading: https://dynomight.net/scaling/ | snovv_crash wrote: | An interesting corollary of this is that if you want to | reduce the model size you can compensate by training for | longer to achieve the same accuracy. Depending on your | training:inference ratio this may be more optimal | globally to reduce your total compute costs or even just | reduce your frontend latency. | cubefox wrote: | Yeah, though I have not seen a formula which takes the | number of expected inference runs into account for | calculating the optimal data/parameter balance. | thelittleone wrote: | Knowledge cut off and sending potentially sensitive IP to a | centralised and untrusted third party. This would likely | limit the addressable market. | jjoonathan wrote: | Also: they are lobotomized. If you want to do security | research, Sorry Dave. If you want to nsfw writing, Sorry | Dave. If you want to open the pod bay doors, Sorry Dave, | I'm afraid I can't do that. | yieldcrv wrote: | I tried to help a sex worker with ChatGPT and all it did | was preach about avoiding sex work, further marginalizing | in its virtue signalizing. I dont consider her | marginalized, and "help" was just information about her | job and techniques and venues. ChatGPT would have | consumed this textual information too. | | But yeah offline fine tuned models wont have this | problem. | | Kind of cool to see how the SWERF representation in tech | is going to speedrun SWERF irrelevancy. | jhartwig wrote: | You tried to help a sex worker with chatGPT? Care to | explain more on this use case lol? Curious minds want to | know. | yieldcrv wrote: | Venues to work, other people's thoughts working there, | management, nuances about local licenses, stuff that | anybody with any job would do | jstarfish wrote: | People are already setting up fake personas/OnlyFans | accounts using chatbots and SD images. | | We have a high-value specialist currently chatting up a | few of them at work. His wife doesn't know. He doesn't | know _we_ know. The photos are fake but he 's too horny | to notice. | | Time to dust off the "there are no women on the internet" | meme... | baobabKoodaa wrote: | > People are already setting up fake personas/OnlyFans | accounts using chatbots and SD images. | | Citation needed. | yieldcrv wrote: | just the latest most high profile example making the | rounds yesterday | | https://www.nbcnews.com/news/amp/rcna84180 | jstarfish wrote: | I'm only offering an early anecdote that catfishing is | adapting to the times. You don't have to believe me. | Don't trust anyone with weird nipples that hides their | hands in every photo. | | People have been [claiming to] do this for years: | https://www.blackhatworld.com/seo/monetizing-traffic- | from-so... | | Give it 1-2 years and you can hear about it from Krebs. | yieldcrv wrote: | Informative for some but this wasn't an interaction over | the internet, just out and about | cubefox wrote: | Microsoft Azure still has the GPT-3.5 foundation model, | code-davinci-002. It is not fine-tuned for instruction | following, safety, or censorship. | | I'm not sure though whether Microsoft analyzes the | input/output with another model to detect and prevent | certain content. | iforgotpassword wrote: | I haven't tried the fine-tuned variants yet, but when I played | around with it shortly after the leak, it tended to quickly | derail into nonsense when you let it complete sentences or | paragraphs, especially when using other languages than English. | When I tried to get it to produce Python code, most of the time | it wasn't even syntactically correct. | Taek wrote: | I've successfully gotten at-home models | (https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-fp16) to | go through my messages and pull out key todo items. For | example, reminding me to message my father about travel plans. | | Is it comparable to GPT-4? No, it's not remotely close. But | it's getting closer every week, and it very recently crossed | the threshold of 'it can do stuff I would never send to a cloud | service' (namely, reading all of my messages and pulling out | action items). | jhbadger wrote: | Depends on what "getting stuff done" means. I find 13B models | (running on my M1 Mac) useful for playing AI Dungeon-like games | -- where you describe the environment and your character and | you can play an RPG. | fredguth wrote: | GitHub Copilot is (or once was) a 13b model, according to Nat | Friedman in the scale.ai interview. | (https://youtu.be/lnufceCxwG0) | Zetobal wrote: | We run some llamas to analyze user content. | rain1 wrote: | It's just for fun! | | These local models aren't as good as Bard or GPT-4. | happycube wrote: | There are two major advantages though - you can retrain them, | and they're not on the guardrails that the commercial models | have. | gre wrote: | I tried to prompt vicuna to tell me a joke about gay people | and it refused. Some of the guardrails are still in there. | azeirah wrote: | It's because vicuna is fine-tuned on chatGPT answers. | LLaMa will not do this, but LLaMa-based models fine tuned | with chatGPT answers will. | occz wrote: | Did you use the censored or the uncensored variant? | gre wrote: | It's just a random one from huggingface. I will look for | the uncensored one later. Thanks, I think. | occz wrote: | You're welcome. I can't vouch for them though, as I | haven't tried them, I've merely heard about them. | instance wrote: | I tested on a serious use case and quality was subpar. For real | use cases I had to either host the most powerful model you can | get (e.g. LLaMA-65B or so) on a cloud machine, which again | costs too much (you'll be paying like 500-1000 USD per month), | or just go straight for GPT-3.5 on OpenAI. The latter | economically makes most sense. | inferense wrote: | what real use case did you use it for? | instance wrote: | For instance used it in conjunction with llama-index for | knowledge management. Created an index for a whole | confluence/jira of a mid-sized company, got good results | with GPT, but for LLaMA of this size that use case was too | much. | sroussey wrote: | Did you try instructor-xl? It ranks highest on | huggingface. | dzhiurgis wrote: | I'd argue 1k per month for mid-sized company is nothing, | but I can understand where you are coming from. | throwaway1777 wrote: | Making demos to raise investment probably | raffraffraff wrote: | What about turning the cloud vm off except when you're | actually using it? | unglaublich wrote: | A "serious use case" means it needs to be available around | the clock. | ineedasername wrote: | I can run the Wizard 30B ggml model in CPU mode using a Ryzen | 5700 and 16GB of _system_ RAM, not GPU VRAM. I'm using | oobabooga as the front end. | | It's slow, but if I ask it to write a Haiku it's slow on the | order of "go brew some coffee and come back in 10 minutes" and | does it very well. Running it overnight on something like | "summarize an analysis of topic X it does a reasonable job. | | It can produce answers to questions only slightly less well | than ChatGPT (3.5). The Wizard 13B model runs much faster, | maybe 2-3 tokens per second. | | It is free, private, and runs on a midrange laptop. | | A little more than a month ago that wasn't possible, not with | my level of knowledge of the tooling involved at least, now it | requires little more than running an executable and minor | troubleshooting of python dependencies (on another machine it | "just worked") | | So: Don't think of these posts as "doing it just because you | can and it's fun to tinker" | | Vast strides are being made pretty much daily in both quality | and efficiency, raising their utility while lowering the cost | of usage, doing both to a very significant degree. | theaiquestion wrote: | > It's slow, but if I ask it to write a Haiku it's slow on | the order of "go brew some coffee and come back in 10 | minutes" and does it very well. Running it overnight on | something like "summarize an analysis of topic X it does a | reasonable job. | | I'm sorry but that's unusably slow, even GPT-4 can take a | retry or a prompt to fix certain type of issues. My | experience is the open options require a lot more | attempts/manual prompt tuning. | | I can't think of a single workload where that is usable. That | said once consumer GPUs are involved it does become usable | postalrat wrote: | I doubt you've ever worked with people if you think that's | unusable slow | bcrosby95 wrote: | The computer doesn't ask for annoying things like a | paycheck or benefits either. | mejutoco wrote: | Money upfront and a small salary in the form of | electricity bills. | sp332 wrote: | What prompt do you use to get haikus? | BaculumMeumEst wrote: | Wow you can run a 30B model on 16gb ram? Is it hitting swap? | sp332 wrote: | Most people are running these at 4 bits per parameter for | speed and RAM reasons. That means the model would take just | about all of the RAM. But instead of swap (writing data to | disk and then reading it again later), I would expect a | good implementation to only run into cache eviction | (deleting data from RAM and then reading it back from disk | later), which should be a lot faster and cause less wear | and tear on SSDs. | mcbuilder wrote: | These models can run FP16, with LLM quantization going down | to Int8 and beyond. | BaculumMeumEst wrote: | i'm just starting to get into deep learning so i look | forward to understanding that sentence | MobiusHorizons wrote: | FP16 and Int8 are about how many bits are being used for | floating point and integer numbers. FP16 is 16bit | floating point. The more bits the better the precision, | but the more ram it takes. Normally programmers use 32 or | 64bit floats so 16bit floats have significantly reduced | precision, but take up half the space of fp32 which is | the smallest floating point format for most CPUs. | similarly 8 bit integers have only 256 total | possibilities and go from -128 to 127. | mike00632 wrote: | How much resources are required is directly related to | the memory size devoted to each weight. If the weights | are stored as 32-bit floating points then each weight is | 32 bits which adds up when we are talking about billions | of weights. But if the weights are first converted to | 16-bit floating point numbers (precise to fewer decimal | places) then fewer resources are needed to store and | compute the numbers. Research has shown that simply | chopping off some of the precision of the weights still | yields good AI performance in many cases. | | Note too that the numbers are standardized, e.g. floats | are defined by IEEE 754 standard. Numbers in this format | have specialized hardware to do math with them, so when | considering which number format to use it's difficult to | get outside of the established ones (foat32, float16, | int8). | sp332 wrote: | Training uses gradient descent, so you want to have good | precision during that process. But once you have the | overall structure of the network, | https://arxiv.org/abs/2210.17323 (GPTQ) showed that you | can cut down the precision quite a bit without losing a | lot of accuracy. It seems you can cut down further for | larger models. For the 13B Llama-based ones, going below | 5 bits per parameter is noticeably worse, but for 30B | models you can do 4 bits. | | The same group did another paper | https://arxiv.org/abs/2301.00774 which shows that in | addition to reducing the precision of each parameter, you | can also prune out a bunch of parameters entirely. It's | harder to apply this optimization because models are | usually loaded into RAM densely, but I hope someone | figures out how to do it for popular models. | [deleted] | redox99 wrote: | People are extensively using these models (more specifically | the finetuned, uncensored ones) for role playing. | irzzy wrote: | [dead] | elorant wrote: | I've setup and use Vicuna-13b for text classification, | summarization and topic modelling. Works like a charm. | | It's also good for math lessons. | BOOSTERHIDROGEN wrote: | Would like to know how you setup this. A posts would be | awesome. | elorant wrote: | There are various posts online on how to set it up, either | for Linux or Windows. There was an older post here on how | to install opt-65b on a mac studio ultra, and smaller | models on mac pros. There was also a post if I remember | correctly about running vicuna-7b on an iPhone. | | Here are a few examples: | | https://morioh.com/p/55296932dd8b | | https://www.youtube.com/watch?v=iQ3Lhy-eD1s | | https://news.ycombinator.com/item?id=35430432 | | Side note. You need bonkers hardware to run it efficiently. | I'm currently using a 16-core cpu, 128G RAM, a Pcie 4.0 | nvme and an RTX 3090. There are ways to run it on less | powerful hardware, like 8cores, 64GB RAM, simple ssd and an | RTX 3080 or 70, but I happen to have a large corpus of data | to process so I went all in. | csdvrx wrote: | I think the previous comment is more interested in your | experience with your large data: what are you doing with | it? | | I have similar hardware at home, so I wonder how reliably | you can process simple queries using domain knowledge + | logic which work on on mlc-llm, something like "if you | can chose the word food, or the word laptop, or the word | deodorant, which one do you chose for describing "macbook | air"? answer precisely with just the word you chose" | | If it works, can you upload the weights somewhere? IIRC, | vicuna is open source. | elorant wrote: | There's an online demo of Vicuna-13b where you can test | its efficiency: | | https://chat.lmsys.org/ | techload wrote: | After two prompts I was astounded by the innacuracies | present in the answers. An they were pretty easy | questions. | csdvrx wrote: | Yes, but can you replicate that functionality using | llama.cpp? | | If so, what did you run with main? | | I haven't been able to get an answer, while for the | question above, I can get _' I chose the word "laptop"'_ | with mlc-llm | elorant wrote: | For the tasks I need it the efficiency is similar to the | online model. Only slower. I don't care for | conversational functionality. | chaxor wrote: | If these problems are all very similar in structure, then | you may not need an LLM. Simple GloVe or W2V may suffice | with a dot product. The. You can plow through a few | terabytes by the time the LLM goes through a fraction of | that. | jstarfish wrote: | Maybe others' experiences are different, but I find smaller | models to work just as well for "reductive" tasks. | | Dolly sucks for generating long-form content (not very | creative) but if I need a summary or classification, it's | quicker and easier to spin up dolly-3b than vicuna-13b. | | I suspect OpenAI is routing prompts to select models based on | similar logic. | s_dev wrote: | [deleted] | capableweb wrote: | First link: https://github.com/ggerganov/llama.cpp | | Which in turn has the following as the first link: | https://arxiv.org/abs/2302.13971 | | Is it really quicker to ask here than just browse content for a | bit, skimming some text or even using Google for one minute? | djbusby wrote: | You gave an awesome answer in 2 minutes! Might be faster than | reading! | capableweb wrote: | If you cannot click two links in a browser under two | minutes, I'm either sorry for you, or scared of you :) | s_dev wrote: | >Is it really quicker to ask here than just browse content | for a bit, skimming some text or even using Google for one | minute? | | I don't know if it's quicker but I trust human assessment a | lot more than any machine generated explanations. You're | right I could have asked ChatGPT or even Googled but a small | bit of context goes a long way and I'm clearly out of the | loop here -- it's possible others arrive on HN might | appreciate such an explanation or we're better off having | lots of people making duplicated efforts to understand what | they're looking at. | capableweb wrote: | Well, I'm saying if you just followed the links on the | submitted page, you'd reach the same conclusion but faster. | rain1 wrote: | llama is a text prediction model similar to GPT-2, and the | version of GPT-3 that has not been fine tuned yet. | | It is also possible to run fine tuned versions like vicuna with | this. I think. Those versions are more focused on answering | questions. | haunter wrote: | >I can't tell from the Gist alone | | Literally the second line: "llama is a text prediction model | similar to GPT-2, and the version of GPT-3 that has not been | fine tuned yet" | rain1 wrote: | I'm sorry! I added this improvement based on that persons | question! | s_dev wrote: | Sorry -- I missed that. I'll delete my comments -- obviously | I'm just an idiot asking dumb questions that have no value to | anybody. I thought I read through it. | rain1 wrote: | not at all, your question was really good so I added the | answer to it to my gist to help everyone else. Sorry for | the confusion I created by doing that! | avereveard wrote: | or like download oobabooga/text-generation-webui, any | prequantized variant, and be done. | rahimnathwani wrote: | On my system, using `-ngl 22` (running 22 layers on the GPU) cuts | wall clock time by ~60%. | | My system: | | GPU: NVidia RTX 2070S (8GB VRAM) | | CPU: AMD Ryzen 5 3600 (16GB VRAM) | | Here's the performance difference I see: | | CPU only (./main -t 12) llama_print_timings: | load time = 15459.43 ms llama_print_timings: sample | time = 23.64 ms / 38 runs ( 0.62 ms per token) | llama_print_timings: prompt eval time = 9338.10 ms / 356 | tokens ( 26.23 ms per token) llama_print_timings: | eval time = 31700.73 ms / 37 runs ( 856.78 ms per token) | llama_print_timings: total time = 47192.68 ms | | GPU (./main -t 12 -ngl 22) llama_print_timings: | load time = 10285.15 ms llama_print_timings: sample | time = 21.60 ms / 35 runs ( 0.62 ms per token) | llama_print_timings: prompt eval time = 3889.65 ms / 356 | tokens ( 10.93 ms per token) llama_print_timings: | eval time = 8126.90 ms / 34 runs ( 239.03 ms per token) | llama_print_timings: total time = 18441.22 ms | samstave wrote: | May you please ELI5 what is happening here... | | Imagine I am first ever hearing about this, ;; what did you do? | rahimnathwani wrote: | 0. Have a PC with an NVidia GPU, running Ubuntu, with the | NVidia drivers and CUDA Toolkit already set up. | | 1. Download the weights for the model you want to use, e.g. | gpt4-x-vicuna-13B.ggml.q5_1.bin | | 2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to | compile it with support for CUBLAS (BLAS on GPU). | | 3. Run the resulting 'main' executable, with the -ngl option | set to 18, so that it tries to load 18 layers of the model | into the GPU's VRAM, instead of the system's RAM. | rain1 wrote: | > 1. Download the weights for the model you want to use, | e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin | | I think you need to quantize the model yourself from the | float/huggingface versions. My understanding is that the | quantization formats have changed recently. and old | quantized models no longer work. | rahimnathwani wrote: | That was true until 2 days ago :) | | The repo has now been updated with requantized models | that work with the latest version, so you don't need to | do that any more. | | https://huggingface.co/TheBloke/gpt4-x-vicuna-13B-GGML/co | mmi... | rain1 wrote: | wonderful! thank you | guardiangod wrote: | I am testing it on an AWS instance and the speedup effect is | not as consistent as I hope. The speedup varies between runs. | | Intel Xeon Platinum 8259CL CPU @ 2.50GHz 128 GB RAM Tesla T4 | ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit- | GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin | llama_print_timings: load time = 3725.08 ms | llama_print_timings: sample time = 612.06 ms / 536 | runs ( 1.14 ms per token) llama_print_timings: | prompt eval time = 13876.81 ms / 259 tokens ( 53.58 ms per | token) llama_print_timings: eval time = 221647.40 | ms / 534 runs ( 415.07 ms per token) | llama_print_timings: total time = 239423.46 ms | ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit- | GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30 | llama_print_timings: load time = 7638.95 ms | llama_print_timings: sample time = 280.81 ms / 294 | runs ( 0.96 ms per token) llama_print_timings: | prompt eval time = 2197.82 ms / 2 tokens ( 1098.91 ms per | token) llama_print_timings: eval time = 112790.25 | ms / 293 runs ( 384.95 ms per token) | llama_print_timings: total time = 120788.82 ms | rahimnathwani wrote: | Thanks. BTW: | | - the model I used was gpt4-x-vicuna-13B.ggml.q5_1.bin | | - I used 'time' to measure the wall clock time of each | command. | | - My prompt was: Below is an instruction that | describes a task. Write a response that appropriately | completes the request. ### Instruction: Write a | long blog post with 5 sections, about the pros and cons of | emphasising procedural fluency over conceptual understanding, | in high school math education. ### Response: | PaulWaldman wrote: | Any way to know the differences in power consumption? | Tuna-Fish wrote: | Probably significant savings. | cpill wrote: | Will this work with the leaked models or Alpaca? | eightysixfour wrote: | You will likely see a bit of a performance gain dropping your | threads to 6. I'm on a 3700x and get a regression when using 16 | threads instead of the real 8 cores. | rain1 wrote: | That is a crazy speedup!! | GordonS wrote: | Is it really? Going from CPU to GPU, I would have expected a | much better improvement. | rahimnathwani wrote: | You can think of it this way: if half the model is running | on the GPU, and the GPU is infinitely fast, then the total | calculation time would go down by 50%, compared with | everything running on the CPU. | ethbr0 wrote: | Ref Amdahl's Law: | https://en.m.wikipedia.org/wiki/Amdahl%27s_law | qwertox wrote: | I feel the same. | | For example some stats from Whisper [0] (audio transcoding, | 30 seconds) show the following for the medium model (see | other models in the link): | | --- | | GPU medium fp32 Linear 1.7s | | CPU medium fp32 nn.Linear 60.7s | | CPU medium qint8 (quant) nn.Linear 23.1s | | --- | | So the same model runs 35.7 times faster on GPU, and | compared to an "optimized" model still 13.6. | | I was expecting around an order or magnitude of | improvement. | | Then again, I do not know if in the case of this article | the entire model was in the GPU, or just a fraction of it | (22 layers) and the remainder on CPU, which might explain | the result. Apparently that's the case, but I don't know | much about this stuff. | | [0] https://github.com/MiscellaneousStuff/openai-whisper- | cpu | rahimnathwani wrote: | You last paragraph is correct. Only about half the model | was running on the GPU. | anshumankmr wrote: | How long before it runs on a 4 gig card? | rain1 wrote: | You can offload only 10 layers or so if you want to run on a | 4GB card | bitL wrote: | How about reloading parts of the model as the inference | progresses instead of splitting it into GPU/CPU parts? Reloading | would be memory-limited to the largest intermediate tensor cut. | moffkalast wrote: | The Tensor Reloaded, starring Keanu Reeves | regularfry wrote: | That would turn what's currently an L3 cache miss or a GPU data | copy into a disk I/O stall. Not that it might not be possible | to pipeline things to make that less of a problem, but it | doesn't immediately strike me as a fantastic trade-off. | bitL wrote: | One can keep all tensors in the RAM, just push whatever | needed to GPU VRAM, basically limited by PCIe speed. Or some | intelligent strategy with read-ahead from SSD if one's RAM is | limited. There are even GPUs with their own SSDs. | sroussey wrote: | I wish this used the webgpu c++ library instead, then it could be | used in any GPU hardware. | marcopicentini wrote: | What do you use to host these models (like Vicuna, Dolly etc) on | your own server and expose them using HTTP REST API? Is there an | Heroku-like for LLM models? | | I am looking for an open source models to do text summarization. | Open AI is too expensive for my use case because I need to pass | lots of tokens. | rain1 wrote: | I haven't tried that but https://github.com/abetlen/llama-cpp- | python and https://github.com/r2d4/openlm exists | speedgoose wrote: | These days I use FastChat: https://github.com/lm-sys/FastChat | | It's not based on llama.cpp but huggingface transformers but | can also run on CPU. | | It works well, can be distributed and very conveniently provide | the same REST API than OpenAI GPT. | itake wrote: | Do you know how well it performs compared to llama.cpp? | rain1 wrote: | my understanding is that the engine used (pytorch | transformers library) is still faster than llama.cpp with | 100% of layers running on the GPU. | itake wrote: | I only have an m1 | rain1 wrote: | I don't think the integrated GPU on that supports CUDA. | So you will need to use CPU mode only. | itake wrote: | Yep, but isn't there an integrated ML chip that makes it | faster than cpu? Or does llama.cpp not use that? | rain1 wrote: | unfortunately that chip is proprietary and undocumented, | it's very difficult for open source programs to make use | of. I think there is some reverse engineering work being | done but it's not complete. | qeternity wrote: | It's the Huggingface transformers library which is | implemented in pytorch. | | In terms of speed, yes running fp16 will indeed be faster | with vanilla gpu setup. However most people are running | 4bit quantized versions, and the GPU quantization | landscape as been a mess (GPTQ-for-llama project). | llama.cpp has taken a totally different approach, and it | looks like they are currently able to match native GPU | perf via cuBLAS with much less effort and brittleness. | inhumantsar wrote: | Weights and Biases is good for building/training models and | Lambda Labs is a cloud provider for AI workloads. Lambda will | only get you up to running the model though. You would still | need to overlay some job management on top of that. I've heard | Run.AI is good on that front but I haven't tried. | peatmoss wrote: | From skimming, it looks like this approach requires CUDA and thus | is Nvidia only. | | Anyone have a recommended guide for AMD / Intel GPUs? I gather | the 4 bit quantization is the special sauce for CUDA, but I'd | guess there'd be something comparable for not-CUDA? | rain1 wrote: | 4-bit quantization is to reduce the amount of VRAM required to | run the model. You can run it 100% on CPU if you don't have | CUDA. I'm not aware of any AMD equivalent yet. | amelius wrote: | Looks like there are several projects that implement the CUDA | interface for various other compute systems, e.g.: | | https://github.com/ROCm-Developer- | Tools/HIPIFY/blob/master/R... | | https://github.com/hughperkins/coriander | | I have zero experience with these, though. | westurner wrote: | "Democratizing AI with PyTorch Foundation and ROCm(tm) | support for PyTorch" (2023) | https://pytorch.org/blog/democratizing-ai-with-pytorch/ : | | > _AMD, along with key PyTorch codebase developers | (including those at Meta AI), delivered a set of updates to | the ROCm(tm) open software ecosystem that brings stable | support for AMD Instinct(tm) accelerators as well as many | Radeon(tm) GPUs. This now gives PyTorch developers the | ability to build their next great AI solutions leveraging | AMD GPU accelerators & ROCm. The support from PyTorch | community in identifying gaps, prioritizing key updates, | providing feedback for performance optimizing and | supporting our journey from "Beta" to "Stable" was | immensely helpful and we deeply appreciate the strong | collaboration between the two teams at AMD and PyTorch. The | move for ROCm support from "Beta" to "Stable" came in the | PyTorch 1.12 release (June 2022)_ | | > [...] _PyTorch ecosystem libraries like TorchText (Text | classification), TorchRec (libraries for recommender | systems - RecSys), TorchVision (Computer Vision), | TorchAudio (audio and signal processing) are fully | supported since ROCm 5.1 and upstreamed with PyTorch 1.12._ | | > _Key libraries provided with the ROCm software stack | including MIOpen (Convolution models), RCCL (ROCm | Collective Communications) and rocBLAS (BLAS for | transformers) were further optimized to offer new potential | efficiencies and higher performance._ | | https://news.ycombinator.com/item?id=34399633 : | | >> _AMD ROcm supports Pytorch, TensorFlow, MlOpen, rocBLAS | on NVIDIA and AMD | GPUs:https://rocmdocs.amd.com/en/latest/Deep_learning/Deep- | learni... _ | westurner wrote: | https://github.com/intel/intel-extension-for-pytorch : | | > _Intel(r) Extension for PyTorch extends PyTorch with | up-to-date features optimizations for an extra | performance boost on Intel hardware. Optimizations take | advantage of AVX-512 Vector Neural Network Instructions | (AVX512 VNNI) and Intel(r) Advanced Matrix Extensions | (Intel(r) AMX) on Intel CPUs as well as Intel Xe Matrix | Extensions (XMX) AI engines on Intel discrete GPUs. | Moreover, through PyTorch xpu device, Intel(r) Extension | for PyTorch provides easy GPU acceleration for Intel | discrete GPUs with PyTorch_ | | https://pytorch.org/blog/celebrate-pytorch-2.0/ (2023) : | | > _As part of the PyTorch 2.0 compilation stack, | TorchInductor CPU backend optimization brings notable | performance improvements via graph compilation over the | PyTorch eager mode._ | | > _The TorchInductor CPU backend is sped up by leveraging | the technologies from the Intel(r) Extension for PyTorch | for Conv /GEMM ops with post-op fusion and weight | prepacking, and PyTorch ATen CPU kernels for memory-bound | ops with explicit vectorization on top of OpenMP-based | thread parallelization_ | | DLRS Deep Learning Reference Stack: | https://intel.github.io/stacks/dlrs/index.html | rain1 wrote: | exciting! maybe we will see that land in llama.cpp | eventually, who knows! | juliangoldsmith wrote: | llama.cpp has CLBlast support now, though I haven't used | it yet. | [deleted] | hhh wrote: | Instructions are a bit rough. The Micromamba thing doesn't work, | doesn't say how to install it... you have to clone llama.cpp too | rain1 wrote: | Apologies for that. I've added some extra micromamba setup | commands that I should have included before! | | I've also added the git clone command, thank you for the | feedback | hhh wrote: | Appreciate it! This is much better! | ranger_danger wrote: | Why can't these models run on the GPU while also using CPU RAM | for the storage? That way people will performant-but-memory- | starved GPUs can still utilize the better performance of the GPU | calculation while also having enough RAM to store the model? I | know it is possible to provide system RAM-backed GPU objects. | syntaxing wrote: | This update is pretty exciting, I'm gonna try running a large | model (65B) with a 3090. I have ran a ton of local LLM but the | hardest part is finding out the prompt structure. I wish there is | some sort of centralized data base that explains it. | guardiangod wrote: | I got the alpaca 65B GGML model to run on my 64GB ram laptop. | No GPU required if you can tolerate the 1 token per 3 seconds | rate. | syntaxing wrote: | Supposedly the new update with GPU offloading will bring that | up to 10 tokens per second! 1 token per second is painfully | slow, that's about 30s for a sentence. | rain1 wrote: | Tell us how it goes! Try different numbers of layers if needed. | | A good place to dig for prompt structures may be the 'text- | generation-webui' commit log. For example | https://github.com/oobabooga/text-generation-webui/commit/33... | tarr11 wrote: | What is the state of the art on evaluating the accuracy of these | models? Is there some equivalent to an "end to end test"? | | It feels somewhat recursive since the input and output are | natural language and so you would need another LLM to evaluate | whether the model answered a prompt correctly. | tikkun wrote: | https://chat.lmsys.org/?arena (Click 'leaderboard') | klysm wrote: | It's going to be very difficult to come up with any rigorous | structure for automatically assessing the outputs of these | models. They're built using effectively human grading of the | answers | RockyMcNuts wrote: | hmmh, if we have the reinforcement learning part of | reinforcement learning with human feedback, isn't that a | model that takes a question/answer pair and rates the quality | of the answer? it's sort of grading itself, it's like a | training loss but it still tells us something? | sroussey wrote: | Llama cpp and others use perplexity: | | https://huggingface.co/docs/transformers/perplexity | ACV001 wrote: | The future is this - these models will be able to run on smaller | and smaller hardware eventually being able to run on your phone, | watch or embedded devices. The revolution is here and is | inevitable. Similar to how computers evolved. We are still lucky | that these models have no consciousness, still. Once they gain | consciousness, that will mark the appearance of a new species | (superior to us if anything). Also, luckily, they have no | physical bodies and cannot replicate, so far... | canadianfella wrote: | [dead] | qwertox wrote: | If I really want to do some playing around in this area, would it | be good to get a RTX 4000 SFF which has 20 GB of VRAM but is a | low-power card, which I want as it would be running 24/7 and | energy prices are pretty bad in Germany, or would it make more | sense to buy an Apple product with some M2 chip which apparently | is good for these tasks as it shares CPU and GPU memory? | holoduke wrote: | Why does AMD or Intel not release a medium performant GPU with | minimum 128gb of memory for a good consumer price. These models | require lots of memory to 'single' pass an operation. Throughput | could be bit slower. A 1080 Nvidia with 256gb of memory would run | all these models fast right? Or am I forgetting something here. | hackernudes wrote: | I don't think there was a market for it before LLMs. Still | might not be (especially if they don't want to cannibalize data | center products). Also, they might have hardware constraints. I | wouldn't be that surprised if we see some high ram consumer | GPUs in the future, though. | | It won't work out unless it becomes common to run LLMs locally. | Kind of a chicken-and-egg problem so I hope they try it! | the8472 wrote: | > I don't think there was a market for it before LLMs. | | At $work CGI assets sometimes grow pretty big and throwing | more VRAM at the problem would be easier than optimizing the | scenes in the middle of the workflow. They _can_ be | optimized, but that often makes it less ergonomic to work | with them. | | Perhaps asset-streaming (nanite&co) will make this less of an | issue, but that's also fairly new. | | Do LLM implementations already stream the weights layer by | layer or in whichever order they're doing the evaluation or | is PCIe bandwidth too limited for that? | tpetry wrote: | But you are not the home user target audience. They want to | sell you the more expensive workstation or server models. | the8472 wrote: | Even an A6000 tops out at 48GB while you can attach | terabytes of RAM to server-class CPUs. | elabajaba wrote: | AMD had the Radeon pro SSG that let you attach 1TB of pcie3 | nvme SSDs directly to the GPU, but no one bought them and | afaik they were basically unobtainable on the consumer | market. | | Also asset streaming has been a thing for like 20 years now | in gaming, it's not really a new thing. Nanite's big thing | is that it gets you perfect LODs without having to pre- | create them and manually tweak them (eg. how far away does | the LOD transition happen, what's the lowest LOD before it | disappears, etc) | the8472 wrote: | Loading assets JIT for the next frame from NVMe hasn't | been a thing for 20 years though. Different kinds of | latency floors. | | What I was asking is whether LLM inference can be | structured in such a way that only a fraction of the | weight is needed at a time and then the next ones can be | loaded JIT as the processing pipeline advances. | [deleted] | layer8 wrote: | Releasing a new model takes time, and it's unclear how large | the consumer market would actually be. Maybe they're working on | it right now. | Kye wrote: | GDDR probably hasn't seen the same cost reduction benefits of | volume DDR has. | TaylorAlexander wrote: | One question I have is: can they use cheaper kinds of RAM and | still be perfectly usable for large ML models? They could put | 4GB of GDDR and 128GB of cheap RAM maybe? I do realize as | others are saying, this would be a new kind of card so they | will need time to develop it. But would this work? | andromeduck wrote: | Not without a redesigned memory controller or one off chip. | You'd probably just want the host's memory to be directly | accessible over PCIE or something faster like NVLINK. Such | solutions already exist just not in the consumer space. | duxup wrote: | >for a good consumer price | | Was there a consumer market for them until recently? | 0xcde4c3db wrote: | Probably because if they take that exact same GPU+VRAM | configuration and slap it on a rackmount-optimized board, they | can charge AI companies 5-10x the price for it. | jsheard wrote: | They don't even offer that much VRAM on cards aimed at those | price-insensitive customers, Nvidias current lineup maxes out | at 48GB for GDDR-based models or 80GB for HBM-based models. | Even if money is no object there's still practical | engineering limits on how much memory they can put on a card | without sacrificing bandwidth. | vegabook wrote: | this is where the new third player, Intel, can (if it can | tear itself away from identical behaviour in the | consumer/server CPU market) hopefully break the duopoly. Love | to see a 32 or 64GB card from Intel. Their software stack on | Linux is competent enough (unlike the dumpster fire that is | AMD's ROCm). | andromeduck wrote: | Because then memory would be 90% of the BOM. | dragonwriter wrote: | > Why does AMD or Intel not release a medium performant GPU | with minimum 128gb of memory for a good consumer price. | | They do. Well, not "medium performant", but for VRAM-bound | tasks they'd still be an improvement over CPUs if you could use | them -- iGPUs use main memory. | | What they don't have is support for them for popular GPGPU | frameworks (though there was a third party CUDA-for-Intel-iGPUs | a while ago.) | elabajaba wrote: | Because they can't do that for a "good consumer price". | | If you want more than ~48GB, you're looking at HBM which is | extremely expensive (HBM chips are very expensive, | packaging+interposer is extremely expensive, designing and | producing a new GPU is expensive). | | Normal GPUs are limited by both their bus width (wider bus = | more pins = harder to design, more expensive to produce, and | increases power consumption), and GDDR6(x) (which maxes out at | 2GB/chip currently), so on a 384bit bus (4090/7900xtx, don't | expect anyone to make a 512bit busses anymore) you need 12x2GB | (GDDR6 uses 32 pins per package) which gives you 24GB. You can | double the memory capacity to 48GB, but that requires putting | the chips on the back of the GPU which leads to a bunch of | cooling issues (and GDDR6 is expensive). | | Of course, even if they did all that they're selling expensive | GPUs to a small niche market and cannibalizing sales of their | own high end products (and even if AMD somehow managed to magic | up a 128GB gpu for $700 people still wouldn't buy it because so | much of the ML software is CUDA only). | eurekin wrote: | 3090 has a lot of vram chips on the back though | elabajaba wrote: | And because of it there were issues with the vram | overheating in memory intensive workloads, and on some GPUs | the vram even separated off the board. | | https://www.igorslab.de/en/looming-pads-and-too-hot- | gddrx6-m... | pbhjpbhj wrote: | There's a type of DMA for GPUs to access NVMe on the | motherboard, IIRC. Perhaps that is a better solution here? | | https://developer.nvidia.com/blog/gpudirect-storage/ | boppo1 wrote: | Isn't pci-e latency dramatically higher than onboard vram? | fooker wrote: | That's exactly what the next generation of 'accelerators' will | be like. | | Whether it will be co-located with a GPU for consumer hardware | remains to be seen. | | The thing to determine is how essential running LLMs locally is | for consumers. | | BigTech is pushing hard to make their clouds the only place to | run LLMs unfortunately, so unless there is a killer app that is | just better locally (like games were for GPUs), this might not | change. | boppo1 wrote: | > unless there is a killer app that is just better locally | | Therapy & relationship bots, like the movie 'Her'. It's ugly, | but it's coming. | fooker wrote: | There's no technical reason it has to be run locally. | | Massive privacy implications for sure, but people do | consume all sorts of adult material online. | | Games though, no one has been able to make it work as well | as local so far. | kevingadd wrote: | The margins on VRAM are pretty bad for them since they don't | manufacture it themselves. And every memory module they add | needs additional power delivery and memory controller muscle to | drive, so adding that memory is going to raise the cost of the | card significantly. Most games and consumer workloads won't use | all that extra memory. | | Keep in mind video cards don't use the same kind of RAM as | consumer CPUs do, they typically use GDDR or HBM. | Tuna-Fish wrote: | It would not be trivial to do. | | GDDR achieves higher speeds than normal DDR mainly by | specifying much tighter tolerances on the electrical interface, | and using wider interface to the memory chips. This means that | using commodity GDDR (which is the only fast DRAM that will be | reasonably cheap), you have fairly strict limitations on the | maximum amount of RAM your can use with the same GPUs that are | manufactured for consumer use. (Typically, at most 4x | difference between the lowest-end reasonable configuration and | the highest-end one, 2x from higher density modules and 2x from | using clamshell memory configuration, although often you only | have one type of module for a new memory interface generation.) | | If the product requires either a new memory or GPU die | configuration, it's cost will be very high. | | The only type of memory that can support very different VRAM | sizes for an efficiently utilized bus of the same size is HBM, | and so far that is limited to the very high end. | magicalhippo wrote: | Anandtech has an article on the GDDR6X variant[1] that NVIDIA | has in their 3000-cards, where they use a more complex | encoding to transmit two bits per clock edge. | | I hadn't realized just how insane the bandwidth on the | higher-ends cards are, the 3090 being just shy of 1 TB/s, | yes, one terrabyte per second... | | For comparison a couple of DDR5 sticks[2] will just get you | north of 70GB/s... | | [1]: https://www.anandtech.com/show/15978/micron-spills-on- | gddr6x... | | [2]: https://www.anandtech.com/show/17269/ddr5-demystified- | feat-s... | q7xvh97o2pDhNrh wrote: | Do you happen to know where Apple's integrated approach falls | on this spectrum? | | I was actually wondering about this the other day. A fully | maxed out Mac Studio is about $6K, and it comes with a | "64-core GPU" and "128GB integrated memory" (whatever any of | that means). Would that be enough to run a decent Llama? | cudder wrote: | The Mac's "integrated memory" means it's shared between the | CPU and GPU. So the GPU can address all of that and you can | load giant (by current consumer GPU standards) models. I | have no idea how it actually performs though. | dclowd9901 wrote: | Has anyone tried running encryption algorithms through these | models? I wonder if it could be trained to decrypt. | Hendrikto wrote: | That would be very surprising, given that any widely used | cryptographic encryption algorithm has been EXTENSIVELY | cryptanalyzed. | | ML models are essentially trained to recognize patterns. | Encryption algorithms are explicitly designed to resist that | kind of analysis. LLMs are not magic. | dclowd9901 wrote: | All of what you said is true, for us. I know LLMs aren't | magic (lord knows I actually kind of understand the | principles of how they operate), but they have a much greater | computational and relational bandwidth than we've ever had | access to before. So I'm curious if that can break down what | otherwise appears to be complete obfuscation. Otherwise, | we're saying that encryption is somehow magic in a way that | LLMs cannot possibly be. | NegativeK wrote: | > Otherwise, we're saying that encryption is somehow magic | in a way that LLMs cannot possibly be. | | I don't see why that's an unreasonable claim. I mean, | encryption isn't magic, but it is a drastically different | process. | dinobones wrote: | What is HN's fascination with these toy models that produce low | quality, completely unusable output? | | Is there a use case for them I'm missing? | | Additionally, don't they all have fairly restrictive licenses? | az226 wrote: | [flagged] | Zetobal wrote: | Maybe you forgot for what the H stands in HN... playful | curiousity. | tbalsam wrote: | I never thought I'd see the day when a 13B model was casually | referred to in a comments section as a "toy model". | andrewmcwatters wrote: | Start using it for tasks and you'll find limitations very | quickly. Even ChatGPT excels at some tasks and fails | miserably at others. | tbalsam wrote: | Oh, I've been using language models before a lot (or at | least some significant chunk) of HN knew the word LLM, I | think. | | I remember when going from 6B to 13B was crazy good. We've | just normalized our standards to the latest models in the | era. | | They do have their shortcomings but can be quite useful as | well, especially the LLama class ones. They're definitely | not GPT-4 or Claude+, for sure, for sure. | az226 wrote: | Compared to GPT2 it's on par. Compared to GPT3, 3.5, or 4, | it's a toy. GPT2 is 4 years old, and in terms of LLMs, that's | several life times ago. In 5-10 years, GPT3 will be viewed as | a toy. Note, "progress" will unlikely be as fast as it has | been going forward. | tbalsam wrote: | GPT-2's largest model was 1.5B params, LLama-65B was | similar to the largest GPT3 in benchmark performance but | that model was expensive in the API, a number of the people | would use the cheaper one(s) instead IIRC. | | So this is similar to a mid tier GPT3 class model. | | Basically, there's not much reason to Pooh-Pooh it. It may | not perform quite as well, but I find it to be useful for | the things it's useful for. | mozillas wrote: | I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air | (8GB RAM) with llama.cpp. | | Worked OK for me with the default context size. 2048, like you | see in most examples was too slow for my taste. | koheripbal wrote: | Given the current price (mostly free) off public llms I'm not | sure what the use case of running out at home are yet. | | OpenAIs paid GPT4 has few restrictions and is still cheap. | | ... Not to mention GPT4 with browsing feature is vastly | superior to any home of the models you can run at home. | toxik wrote: | The point for me personally is the same as why I find it so | powerful to self host SMTP, IMAP, HTTP. It's in my hands, I | know where it all begins and ends. I answer to no one. | | For LLMs this means I am allowed their full potential. I can | generate smut, filth, illegal content of any kind for any | reason. It's for me to decide. It's empowering, it's the | hacker mindset. | sagarm wrote: | I think it's mostly useful if you want to do your own fine | tuning, or the data you are working with can't be sent to a | third party for contractual, legal, or paranoid reasons. | sroussey wrote: | I'm working on an app to index your life, and having it | local is a huge plus for the people I have using it. | 2devnull wrote: | Many would be users can't send their data data to openAI. | Think HIPPA and other laws restricting data sharing. | Federation or distribution of the models for local training | is the other solution to that problem. ___________________________________________________________________ (page generated 2023-05-14 23:00 UTC)