[HN Gopher] Guide to running Llama 2 locally ___________________________________________________________________ Guide to running Llama 2 locally Author : bfirsh Score : 177 points Date : 2023-07-25 16:58 UTC (6 hours ago) (HTM) web link (replicate.com) (TXT) w3m dump (replicate.com) | guy98238710 wrote: | > curl -L "https://replicate.fyi/install-llama-cpp" | bash | | Seriously? Pipe script from someone's website directly to bash? | gattilorenz wrote: | Yes. If you are worried, you can redirect it to file and then | sh it. It doesn't get much easier to inspect than that... | cjbprime wrote: | Either you trust the TLS session to their website to deliver | you software you're going to run, or you don't. | madars wrote: | That's the recommended way to get Rust nightly too: | https://rustup.rs/ But don't look there, there is memory safety | somewhere! | raccolta wrote: | oh, this again. | handelaar wrote: | Idiot question: if I have access to sentence-by-sentence | professionally-translated text of foreign-language-to-English in | gigantic quantities, and I fed the originals as prompts and the | translations as completions... | | ... would I be likely to get anything useful if I then fed it new | prompts in a similar style? Or would it just generate gibberish? | seanthemon wrote: | Indeed, it sounds like you have what's called fine tuned data | (given an input, here's the output), there's loads of info both | here on HN about fine tuning and on youtube's huggingface | channels | | Note if you have sufficient data, look into existing models on | huggingface, you may find a smaller, faster and more open | (licencing-wise) model that you can fine tune to get the | results you want - Llama is hot, but not a catch-all for all | tasks (as no model should be) | | Happy inferring! | nl wrote: | If you have that much data you can build your own model that | can be much smaller and faster. | | A simple version is a beginner tutorial: | https://pytorch.org/tutorials/beginner/translation_transform... | maxlin wrote: | I might be missing something. The article asks me to run a bash | script on windows. | | I assume this would still need to be run manually to access GPU | resources etc, so can someone illuminate what is actually | expected for a windows user to make this run? | | I'm currently paying 15$ a month in a personal | translation/summarizer project's ChatGPT queries. I run whisper | (const.me's GPU fork) locally and would love to get the LLM part | local eventually too! The system generates 30k queries a month | but is not super-affected by delay so lower token rates might | work too. | nomel wrote: | Windows has supported linux tools for some time now, using WSL: | https://learn.microsoft.com/en-us/windows/wsl/about | | No idea if it will work, in this case, but it does with | llama.cpp: https://github.com/ggerganov/llama.cpp/issues/103 | maxlin wrote: | I know (should have included in my earlier response but | editing would've felt weird) but I still assume one should | run the result natively, so am asking if/where there's some | jumping around required. | | Last time I tried running an LLM I tried wsl&native both on 2 | machines and just got lovecraftian-tier errors so waiting if | I'm missing something obvious before going down that route | again | nomand wrote: | Is it possible for such local install to retain conversation | history so if for example you're working on a project and use it | as your assistance across many days that you can continue | conversations and for the model to keep track of what you and it | already know? | simonw wrote: | My LLM command line tool can do that - it logs everything to a | SQLite database and has an option to continue a conversation: | https://llm.datasette.io | knodi123 wrote: | llama is just an input/output engine. It takes a big string as | input, and gives a big string of output. | | Save your outputs if you want, you can copy/paste them into any | editor. Or make a shell script that mirrors outputs to a file | and use _that_ as your main interface. It 's up to the user. | jmiskovic wrote: | There is no fully built solution, only bits and pieces. I | noticed that llama outputs tend to degrade with amount of text, | the text becomes too repetitive and focused, and you have to | raise the temperature to break the model out of loops. | nomand wrote: | Does what you're saying mean you can only ask questions and | get answers in a single step, and that having a long | discussion where refinement of output is arrived at through | conversation isn't possible? | krisoft wrote: | My understanding is that at a high level you can look at | this model as a black box which accepts a string and | outputs a string. | | If you want it to "remember" things you do that by | appending all the previous conversations together and | supply it in the input string. | | In an ideal world this would work perfectly. It would read | through the whole conversation and would provide the right | output you expect, exactly as if it would "remember" the | conversation. In reality there are all kind of issues which | can crop up as the input grows longer and longer. One is | that it takes more and more processing power and time for | it to "read through" everything previously said. And there | are things like what jmiskovic said that the output quality | can also degrade in perhaps unexpected ways. | | But that also doesn't mean that " refinement of output is | arrived at through conversation isn't possible". It is not | that black and white, just that you can run into troubles | as the length of the discussion grows. | | I don't have direct experience with long conversations so I | can't tell you how long is definietly too long, and how | long is still safe. Plus probably there are some tricks one | can do to work around these. Probably there are things one | can do if one unpacks that "black box" understanding of the | process. But even without that you could imagine a | "consolidation" process where the AI is instructed to write | short notes about a given length of conversation and then | those shorter notes would be copied in to the next input | instead of the full previous conversation. All of these are | possible, but you won't have a turn-key solution for it | just yet. | cjbprime wrote: | The limit here is the "context window" length of the | model, measured in tokens, which will quickly become too | short to contain all of your previous conversations, | which will mean it has to answer questions without access | to all of that text. And within a single conversation, it | will mean that it starts forgetting the text from the | start of the conversation, once the [conversation + new | prompt] reaches the context length. | | The kind of hacks that work around this are to train the | model on the past conversations, and then rely on | similarity in tensor space to pull the right (lossy) data | back out of the model (or a separate database) later, | based on its similarity to your question, and include it | (or a summary of it, since summaries are smaller) within | the context window for your new conversation, combined | with your prompt. This is what people are talking about | when they use the term "embeddings". | nomand wrote: | My benchmark is having a peer programming session | spanning days and dozens of queries with ChatGPT where we | co-created a custom static site generator that works | really well for my requirements. It was able to hold | context for a while and not "forget" what code it | provided me dozens of messages earlier, it was able to | "remember" corrections and refactors that I gave it and | overall was incredibly useful for working out things like | recurrence for folder hierarchies and building data | trees. This kind and similar use-cases where memory is | important, when the model is used as a genuine assistant. | krisoft wrote: | Excelent! That sounds like a very usefull personal | benchmark then. You could test llama v2 by copying in | different lengths of snippets from that conversation and | checking how usefull you find its outputs. | RicoElectrico wrote: | curl -L "https://replicate.fyi/windows-install-llama-cpp" | | ... returns 404 Not Found | thisisit wrote: | The easiest way I found was to use GPT4All. Just download and | install, grab GGML version of Llama 2, copy to the models | directory in the installation folder. Fire up GPT4All and run. | andreyk wrote: | This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama | (Mac), MLC LLM (iOS/Android) | | Which is not really comprehensive... If you have a linux machine | with GPUs, i'd just use hugging face inference | (https://github.com/huggingface/text-generation-inference). And I | am sure there are other things that could be covered. | krisoft wrote: | > If you have a linux machine with GPUs | | How much VRAM one needs to run inference with llama 2 on a GPU | approximately? | lolinder wrote: | Depends on which model. I haven't bothered doing it on my 8GB | because the only model that would fit is the 7B model | quantized to 4 bits, and that model at that size is pretty | bad for most things. I think you could have fun with 13B with | 12GB VRAM. The full size model would require >35GB even | quantized. | novaRom wrote: | 16Gb is minimum to run 7B model with float16 weights; out of | the box, with no further efforts. | Patrick_Devine wrote: | Ollama works with Windows and Linux as well too, but doesn't | (yet) have GPU support for those platforms. You have to compile | it yourself (it's a simple `go build .`), but should work fine | (albeit slow). The benefit is you can still pull the llama2 | model really easily (with `ollama pull llama2`) and even use it | with other runners. | | DISCLAIMER: I'm one of the developers behind Ollama. | mschuster91 wrote: | > DISCLAIMER: I'm one of the developers behind Ollama. | | I got a feature suggestion - would it be possible to have the | ollama CLI automatically start up the GUI/daemon if it's not | running? There's only so much stuff one can keep in a Macbook | Air's auto start. | jmorgan wrote: | Good suggestion! This is definitely on the radar, so that | running `ollama` will start the server when it's needed | (instead of erroring!): | https://github.com/jmorganca/ollama/issues/47 | DennisP wrote: | I've been wondering, is the M2's neural engine usable for | this? | robotnikman wrote: | Llama.cpp has been fun to experiment around with. I was | suprised with how easy it was to set up, much easier than when | I tried to set up a local llm almost a year ago. | lolinder wrote: | Just a note that you have to have at least 12GB VRAM for it to | be worth even trying to use your GPU for LLaMA 2. | | The 7B model quantized to 4 bits can fit in 8GB VRAM with room | for the context, but is pretty useless for getting good results | in my experience. 13B is better but still not anything near as | good as the 70B, which would require >35GB VRAM to use at 4 bit | quantization. | | My solution for playing with this was just to upgrade my PC's | RAM to 64GB. It's slower than the GPU, but it was way cheaper | and I can run the 70B model easily. | dc443 wrote: | I have 2x 3090 do you know if it's feasible to use that 48GB | total for running this? | eurekin wrote: | Yes, it runs totally fine. I ran it in Oobabooga/text | generation web ui. Nice thing about it is that it | autodownloads all necessary gpu binaries on it's own and | creates a isolated conda env. I asked same questions on the | official 70b demo and got same answers. I even got better | answers with ooba, since the demo cuts text early | | Ooobabooga: https://github.com/oobabooga/text-generation- | webui | | Model: TheBloke_Llama-2-70B-chat-GPTQ from | https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ | | ExLlama_HF loader gpu split 20,22, context size 2048 | | on the Chat Settings tab, choose Instruction template tab | and pick Llama-v2 from the instruction template dropdown | | Demo: https://huggingface.co/blog/llama2#demo | zakki wrote: | Is there any specific settings to make 2x3090 work | together? | NoMoreNicksLeft wrote: | Trying to figure out what hardware to convince my boss to | spend on... if we were to get one of the A6000/48gb cards, | will that see significant performance improvements over just | a 4090/24gb? The primary limitation is vram, is it not? | cjbprime wrote: | You might consider getting a Mac Studio (with as much RAM | as you can afford up to 192GB) instead, since 192GB is more | (unified) memory than you're going to easily get to with | GPUs. | lolinder wrote: | VRAM is what gets you up to the larger model sizes, and | 24GB isn't enough to load the full 70B even at 4 bits, you | need at least 35 and some extra for the context. So it | depends a lot on what you want to do--fine tuning will take | even more as I understand it. | | The card's speed will affect your performance, but I don't | know enough about different graphics cards to tell you | specifics. | ErneX wrote: | Apple Silicon Macs might not have great GPUs but they do have | unified memory. I need to try this on mine I have 96GB of RAM | on my M2 Max. | krychu wrote: | Self-plug. Here's a fork of the original llama 2 code adapted to | run on the CPU or MPS (M1/M2 GPU) if available: | | https://github.com/krychu/llama | | It runs with the original weights, and gets you to ~4 tokens/sec | on MacBook Pro M1 with the 7B model. | rootusrootus wrote: | For most people who just want to play around and are using MacOS | or Windows, I'd just recommend lmstudio.ai. Nice interface, with | super easy searching and downloading of new models. | dividedbyzero wrote: | Does it make any sense to try this on a lower-end Mac (like a | M2 Air)? | mchiang wrote: | Yeah! How much memory do you have? | | If by lower-end Macbook air, you mean with 8GB of memory, try | the smaller models (Such as Orca Mini 3B). You can do this | via LM Studio, Oogabooga/text-generation-webui, KoboldCPP, | GPT4all, ctransformers, and more. | | I'm biased since I work on Ollama, and if you want to try it | out: | | 1. Download https://ollama.ai/download | | 2. `ollama run orca` | | 3. Enter your input to prompt | | Note Ollama is open source, and you can compile it too from | https://github.com/jmorganca/ollama | bdavbdav wrote: | I'm deliberating on how much RAM to get on my new MBP. Is | 32gb going to stand me in good stead? | mchiang wrote: | Local memory management will definitely get better in the | future. | | For now: | | You should have at least 8 GB of RAM to run the 3B | models, 16 GB to run the 7B models, and 32 GB to run the | 13B models. | | My personal recommendation is to get as much memory as | you can if you want to work with local models [including | VRAM if you are planning to be executing on GPU] | rootusrootus wrote: | 32GB should be fine. I went a little overboard and got a | new MBP with M2 MAX and 96GB, but the hardware is really | best suited at this point to a 30B model. I can and do | play around with 65B models, but at that point you're | making a fairly big tradeoff in generation speed for an | incremental increase in quality. | | As a datapoint, I have a 30B model [0] loaded right now | and it's using 23.44GB of RAM. Getting around 9 | tokens/sec, which is very usable. I also have the 65B | version of the same model [1] and it's good for around | 3.6 tokens/second, but it uses 44GB of RAM. Not unusably | slow, but more often than not I opt for the 30B because | it's good enough and a lot faster. | | Haven't tried the llama2 70B yet. | | [0] https://huggingface.co/TheBloke/upstage- | llama-30b-instruct-2... [1] | https://huggingface.co/TheBloke/Upstage- | Llama1-65B-Instruct-... | swader999 wrote: | What's your use case for local if you don't mind? | dividedbyzero wrote: | By lower-end I meant that the Airs are quite low-end in | general (compared to Pro/Studio). I have the maxed-out | 24gb, but 16gb may be more common among people who might | use an Air for this kind of thing. | Der_Einzige wrote: | The correct answer, as always, is the oogabooga text generation | webUI, which supports all of the relevant backends: | https://github.com/oobabooga/text-generation-webui | cypress66 wrote: | Yep. Use ooba. And people who like to RP often use ooba as a | backend, and sillitavern as a frontend. | Roark66 wrote: | Can it run onnx transformer models? I found optimised onnx | models are at least twice the speed of vanilla pytorch on the | CPU. | TheAceOfHearts wrote: | How do you decide what model variant to use? There's a bunch of | Quant method variations of Llama-2-13B-chat-GGML [0], how do you | know which one to use? Reading the "Explanation of the new | k-quant methods" is a bit opaque. | | [0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML | sva_ wrote: | If you just want to do inference/mess around with the model and | have a 16GB GPU, then this[0] is enough to paste into a notebook. | You need to have access to the HF models though. | | 0. | https://github.com/huggingface/blog/blob/main/llama2.md#usin... | oaththrowaway wrote: | Off topic: is there a way to use one of the LLMs and have it | ingest data from a SQLite database and ask it questions about it? | politelemon wrote: | Have a look at this too, it's just an integration which | langchain can be good at : https://walkingtree.tech/natural- | language-to-query-your-sql-... | seanthemon wrote: | You can, but as a crazy idea you can also ask chatgpt to write | select queries using the functions parameter they added | recently - you can also ask it to write jsonpath. | | As long as it understands the schema and general idea of data, | it does a fairly good job. Just be careful to do too much with | one prompt, you can easily cause hallucinations | simonw wrote: | I've experimented with that a bit. | | Currently the absolutely best way to do that is to upload a | SQLite database file to ChatGPT Code Interpreter. | | I'm hoping that someone will fine-tune an openly licensed model | for this at some point that can give results as good as Code | Interpreter does. | siquick wrote: | You can migrate that data to a vector database (eg Pinecone or | pgVector) and then query it. I didn't write it but this guide | has a good overview of concepts and some code. In your case | your just replace the web crawler with database queries. All | the libraries used also exist in Python. | | https://www.pinecone.io/learn/javascript-chatbot/ | thisisit wrote: | You can but what you'll end up trading precise answers while | querying to a chance of hallucinations. | politelemon wrote: | Llama.cpp can run on Android too. | synaesthesisx wrote: | This is usable, but hopefully folks manage to tweak it a bit | further for even higher tokens/s. I'm running Llama.cpp locally | on my M2 Max (32 GB) with decent performance but sticking to the | 7B model for now. ___________________________________________________________________ (page generated 2023-07-25 23:00 UTC)