[HN Gopher] Guide to running Llama 2 locally
       ___________________________________________________________________
        
       Guide to running Llama 2 locally
        
       Author : bfirsh
       Score  : 177 points
       Date   : 2023-07-25 16:58 UTC (6 hours ago)
        
 (HTM) web link (replicate.com)
 (TXT) w3m dump (replicate.com)
        
       | guy98238710 wrote:
       | > curl -L "https://replicate.fyi/install-llama-cpp" | bash
       | 
       | Seriously? Pipe script from someone's website directly to bash?
        
         | gattilorenz wrote:
         | Yes. If you are worried, you can redirect it to file and then
         | sh it. It doesn't get much easier to inspect than that...
        
         | cjbprime wrote:
         | Either you trust the TLS session to their website to deliver
         | you software you're going to run, or you don't.
        
         | madars wrote:
         | That's the recommended way to get Rust nightly too:
         | https://rustup.rs/ But don't look there, there is memory safety
         | somewhere!
        
           | raccolta wrote:
           | oh, this again.
        
       | handelaar wrote:
       | Idiot question: if I have access to sentence-by-sentence
       | professionally-translated text of foreign-language-to-English in
       | gigantic quantities, and I fed the originals as prompts and the
       | translations as completions...
       | 
       | ... would I be likely to get anything useful if I then fed it new
       | prompts in a similar style? Or would it just generate gibberish?
        
         | seanthemon wrote:
         | Indeed, it sounds like you have what's called fine tuned data
         | (given an input, here's the output), there's loads of info both
         | here on HN about fine tuning and on youtube's huggingface
         | channels
         | 
         | Note if you have sufficient data, look into existing models on
         | huggingface, you may find a smaller, faster and more open
         | (licencing-wise) model that you can fine tune to get the
         | results you want - Llama is hot, but not a catch-all for all
         | tasks (as no model should be)
         | 
         | Happy inferring!
        
         | nl wrote:
         | If you have that much data you can build your own model that
         | can be much smaller and faster.
         | 
         | A simple version is a beginner tutorial:
         | https://pytorch.org/tutorials/beginner/translation_transform...
        
       | maxlin wrote:
       | I might be missing something. The article asks me to run a bash
       | script on windows.
       | 
       | I assume this would still need to be run manually to access GPU
       | resources etc, so can someone illuminate what is actually
       | expected for a windows user to make this run?
       | 
       | I'm currently paying 15$ a month in a personal
       | translation/summarizer project's ChatGPT queries. I run whisper
       | (const.me's GPU fork) locally and would love to get the LLM part
       | local eventually too! The system generates 30k queries a month
       | but is not super-affected by delay so lower token rates might
       | work too.
        
         | nomel wrote:
         | Windows has supported linux tools for some time now, using WSL:
         | https://learn.microsoft.com/en-us/windows/wsl/about
         | 
         | No idea if it will work, in this case, but it does with
         | llama.cpp: https://github.com/ggerganov/llama.cpp/issues/103
        
           | maxlin wrote:
           | I know (should have included in my earlier response but
           | editing would've felt weird) but I still assume one should
           | run the result natively, so am asking if/where there's some
           | jumping around required.
           | 
           | Last time I tried running an LLM I tried wsl&native both on 2
           | machines and just got lovecraftian-tier errors so waiting if
           | I'm missing something obvious before going down that route
           | again
        
       | nomand wrote:
       | Is it possible for such local install to retain conversation
       | history so if for example you're working on a project and use it
       | as your assistance across many days that you can continue
       | conversations and for the model to keep track of what you and it
       | already know?
        
         | simonw wrote:
         | My LLM command line tool can do that - it logs everything to a
         | SQLite database and has an option to continue a conversation:
         | https://llm.datasette.io
        
         | knodi123 wrote:
         | llama is just an input/output engine. It takes a big string as
         | input, and gives a big string of output.
         | 
         | Save your outputs if you want, you can copy/paste them into any
         | editor. Or make a shell script that mirrors outputs to a file
         | and use _that_ as your main interface. It 's up to the user.
        
         | jmiskovic wrote:
         | There is no fully built solution, only bits and pieces. I
         | noticed that llama outputs tend to degrade with amount of text,
         | the text becomes too repetitive and focused, and you have to
         | raise the temperature to break the model out of loops.
        
           | nomand wrote:
           | Does what you're saying mean you can only ask questions and
           | get answers in a single step, and that having a long
           | discussion where refinement of output is arrived at through
           | conversation isn't possible?
        
             | krisoft wrote:
             | My understanding is that at a high level you can look at
             | this model as a black box which accepts a string and
             | outputs a string.
             | 
             | If you want it to "remember" things you do that by
             | appending all the previous conversations together and
             | supply it in the input string.
             | 
             | In an ideal world this would work perfectly. It would read
             | through the whole conversation and would provide the right
             | output you expect, exactly as if it would "remember" the
             | conversation. In reality there are all kind of issues which
             | can crop up as the input grows longer and longer. One is
             | that it takes more and more processing power and time for
             | it to "read through" everything previously said. And there
             | are things like what jmiskovic said that the output quality
             | can also degrade in perhaps unexpected ways.
             | 
             | But that also doesn't mean that " refinement of output is
             | arrived at through conversation isn't possible". It is not
             | that black and white, just that you can run into troubles
             | as the length of the discussion grows.
             | 
             | I don't have direct experience with long conversations so I
             | can't tell you how long is definietly too long, and how
             | long is still safe. Plus probably there are some tricks one
             | can do to work around these. Probably there are things one
             | can do if one unpacks that "black box" understanding of the
             | process. But even without that you could imagine a
             | "consolidation" process where the AI is instructed to write
             | short notes about a given length of conversation and then
             | those shorter notes would be copied in to the next input
             | instead of the full previous conversation. All of these are
             | possible, but you won't have a turn-key solution for it
             | just yet.
        
               | cjbprime wrote:
               | The limit here is the "context window" length of the
               | model, measured in tokens, which will quickly become too
               | short to contain all of your previous conversations,
               | which will mean it has to answer questions without access
               | to all of that text. And within a single conversation, it
               | will mean that it starts forgetting the text from the
               | start of the conversation, once the [conversation + new
               | prompt] reaches the context length.
               | 
               | The kind of hacks that work around this are to train the
               | model on the past conversations, and then rely on
               | similarity in tensor space to pull the right (lossy) data
               | back out of the model (or a separate database) later,
               | based on its similarity to your question, and include it
               | (or a summary of it, since summaries are smaller) within
               | the context window for your new conversation, combined
               | with your prompt. This is what people are talking about
               | when they use the term "embeddings".
        
               | nomand wrote:
               | My benchmark is having a peer programming session
               | spanning days and dozens of queries with ChatGPT where we
               | co-created a custom static site generator that works
               | really well for my requirements. It was able to hold
               | context for a while and not "forget" what code it
               | provided me dozens of messages earlier, it was able to
               | "remember" corrections and refactors that I gave it and
               | overall was incredibly useful for working out things like
               | recurrence for folder hierarchies and building data
               | trees. This kind and similar use-cases where memory is
               | important, when the model is used as a genuine assistant.
        
               | krisoft wrote:
               | Excelent! That sounds like a very usefull personal
               | benchmark then. You could test llama v2 by copying in
               | different lengths of snippets from that conversation and
               | checking how usefull you find its outputs.
        
       | RicoElectrico wrote:
       | curl -L "https://replicate.fyi/windows-install-llama-cpp"
       | 
       | ... returns 404 Not Found
        
       | thisisit wrote:
       | The easiest way I found was to use GPT4All. Just download and
       | install, grab GGML version of Llama 2, copy to the models
       | directory in the installation folder. Fire up GPT4All and run.
        
       | andreyk wrote:
       | This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama
       | (Mac), MLC LLM (iOS/Android)
       | 
       | Which is not really comprehensive... If you have a linux machine
       | with GPUs, i'd just use hugging face inference
       | (https://github.com/huggingface/text-generation-inference). And I
       | am sure there are other things that could be covered.
        
         | krisoft wrote:
         | > If you have a linux machine with GPUs
         | 
         | How much VRAM one needs to run inference with llama 2 on a GPU
         | approximately?
        
           | lolinder wrote:
           | Depends on which model. I haven't bothered doing it on my 8GB
           | because the only model that would fit is the 7B model
           | quantized to 4 bits, and that model at that size is pretty
           | bad for most things. I think you could have fun with 13B with
           | 12GB VRAM. The full size model would require >35GB even
           | quantized.
        
           | novaRom wrote:
           | 16Gb is minimum to run 7B model with float16 weights; out of
           | the box, with no further efforts.
        
         | Patrick_Devine wrote:
         | Ollama works with Windows and Linux as well too, but doesn't
         | (yet) have GPU support for those platforms. You have to compile
         | it yourself (it's a simple `go build .`), but should work fine
         | (albeit slow). The benefit is you can still pull the llama2
         | model really easily (with `ollama pull llama2`) and even use it
         | with other runners.
         | 
         | DISCLAIMER: I'm one of the developers behind Ollama.
        
           | mschuster91 wrote:
           | > DISCLAIMER: I'm one of the developers behind Ollama.
           | 
           | I got a feature suggestion - would it be possible to have the
           | ollama CLI automatically start up the GUI/daemon if it's not
           | running? There's only so much stuff one can keep in a Macbook
           | Air's auto start.
        
             | jmorgan wrote:
             | Good suggestion! This is definitely on the radar, so that
             | running `ollama` will start the server when it's needed
             | (instead of erroring!):
             | https://github.com/jmorganca/ollama/issues/47
        
           | DennisP wrote:
           | I've been wondering, is the M2's neural engine usable for
           | this?
        
         | robotnikman wrote:
         | Llama.cpp has been fun to experiment around with. I was
         | suprised with how easy it was to set up, much easier than when
         | I tried to set up a local llm almost a year ago.
        
         | lolinder wrote:
         | Just a note that you have to have at least 12GB VRAM for it to
         | be worth even trying to use your GPU for LLaMA 2.
         | 
         | The 7B model quantized to 4 bits can fit in 8GB VRAM with room
         | for the context, but is pretty useless for getting good results
         | in my experience. 13B is better but still not anything near as
         | good as the 70B, which would require >35GB VRAM to use at 4 bit
         | quantization.
         | 
         | My solution for playing with this was just to upgrade my PC's
         | RAM to 64GB. It's slower than the GPU, but it was way cheaper
         | and I can run the 70B model easily.
        
           | dc443 wrote:
           | I have 2x 3090 do you know if it's feasible to use that 48GB
           | total for running this?
        
             | eurekin wrote:
             | Yes, it runs totally fine. I ran it in Oobabooga/text
             | generation web ui. Nice thing about it is that it
             | autodownloads all necessary gpu binaries on it's own and
             | creates a isolated conda env. I asked same questions on the
             | official 70b demo and got same answers. I even got better
             | answers with ooba, since the demo cuts text early
             | 
             | Ooobabooga: https://github.com/oobabooga/text-generation-
             | webui
             | 
             | Model: TheBloke_Llama-2-70B-chat-GPTQ from
             | https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ
             | 
             | ExLlama_HF loader gpu split 20,22, context size 2048
             | 
             | on the Chat Settings tab, choose Instruction template tab
             | and pick Llama-v2 from the instruction template dropdown
             | 
             | Demo: https://huggingface.co/blog/llama2#demo
        
               | zakki wrote:
               | Is there any specific settings to make 2x3090 work
               | together?
        
           | NoMoreNicksLeft wrote:
           | Trying to figure out what hardware to convince my boss to
           | spend on... if we were to get one of the A6000/48gb cards,
           | will that see significant performance improvements over just
           | a 4090/24gb? The primary limitation is vram, is it not?
        
             | cjbprime wrote:
             | You might consider getting a Mac Studio (with as much RAM
             | as you can afford up to 192GB) instead, since 192GB is more
             | (unified) memory than you're going to easily get to with
             | GPUs.
        
             | lolinder wrote:
             | VRAM is what gets you up to the larger model sizes, and
             | 24GB isn't enough to load the full 70B even at 4 bits, you
             | need at least 35 and some extra for the context. So it
             | depends a lot on what you want to do--fine tuning will take
             | even more as I understand it.
             | 
             | The card's speed will affect your performance, but I don't
             | know enough about different graphics cards to tell you
             | specifics.
        
           | ErneX wrote:
           | Apple Silicon Macs might not have great GPUs but they do have
           | unified memory. I need to try this on mine I have 96GB of RAM
           | on my M2 Max.
        
       | krychu wrote:
       | Self-plug. Here's a fork of the original llama 2 code adapted to
       | run on the CPU or MPS (M1/M2 GPU) if available:
       | 
       | https://github.com/krychu/llama
       | 
       | It runs with the original weights, and gets you to ~4 tokens/sec
       | on MacBook Pro M1 with the 7B model.
        
       | rootusrootus wrote:
       | For most people who just want to play around and are using MacOS
       | or Windows, I'd just recommend lmstudio.ai. Nice interface, with
       | super easy searching and downloading of new models.
        
         | dividedbyzero wrote:
         | Does it make any sense to try this on a lower-end Mac (like a
         | M2 Air)?
        
           | mchiang wrote:
           | Yeah! How much memory do you have?
           | 
           | If by lower-end Macbook air, you mean with 8GB of memory, try
           | the smaller models (Such as Orca Mini 3B). You can do this
           | via LM Studio, Oogabooga/text-generation-webui, KoboldCPP,
           | GPT4all, ctransformers, and more.
           | 
           | I'm biased since I work on Ollama, and if you want to try it
           | out:
           | 
           | 1. Download https://ollama.ai/download
           | 
           | 2. `ollama run orca`
           | 
           | 3. Enter your input to prompt
           | 
           | Note Ollama is open source, and you can compile it too from
           | https://github.com/jmorganca/ollama
        
             | bdavbdav wrote:
             | I'm deliberating on how much RAM to get on my new MBP. Is
             | 32gb going to stand me in good stead?
        
               | mchiang wrote:
               | Local memory management will definitely get better in the
               | future.
               | 
               | For now:
               | 
               | You should have at least 8 GB of RAM to run the 3B
               | models, 16 GB to run the 7B models, and 32 GB to run the
               | 13B models.
               | 
               | My personal recommendation is to get as much memory as
               | you can if you want to work with local models [including
               | VRAM if you are planning to be executing on GPU]
        
               | rootusrootus wrote:
               | 32GB should be fine. I went a little overboard and got a
               | new MBP with M2 MAX and 96GB, but the hardware is really
               | best suited at this point to a 30B model. I can and do
               | play around with 65B models, but at that point you're
               | making a fairly big tradeoff in generation speed for an
               | incremental increase in quality.
               | 
               | As a datapoint, I have a 30B model [0] loaded right now
               | and it's using 23.44GB of RAM. Getting around 9
               | tokens/sec, which is very usable. I also have the 65B
               | version of the same model [1] and it's good for around
               | 3.6 tokens/second, but it uses 44GB of RAM. Not unusably
               | slow, but more often than not I opt for the 30B because
               | it's good enough and a lot faster.
               | 
               | Haven't tried the llama2 70B yet.
               | 
               | [0] https://huggingface.co/TheBloke/upstage-
               | llama-30b-instruct-2... [1]
               | https://huggingface.co/TheBloke/Upstage-
               | Llama1-65B-Instruct-...
        
               | swader999 wrote:
               | What's your use case for local if you don't mind?
        
             | dividedbyzero wrote:
             | By lower-end I meant that the Airs are quite low-end in
             | general (compared to Pro/Studio). I have the maxed-out
             | 24gb, but 16gb may be more common among people who might
             | use an Air for this kind of thing.
        
       | Der_Einzige wrote:
       | The correct answer, as always, is the oogabooga text generation
       | webUI, which supports all of the relevant backends:
       | https://github.com/oobabooga/text-generation-webui
        
         | cypress66 wrote:
         | Yep. Use ooba. And people who like to RP often use ooba as a
         | backend, and sillitavern as a frontend.
        
           | Roark66 wrote:
           | Can it run onnx transformer models? I found optimised onnx
           | models are at least twice the speed of vanilla pytorch on the
           | CPU.
        
       | TheAceOfHearts wrote:
       | How do you decide what model variant to use? There's a bunch of
       | Quant method variations of Llama-2-13B-chat-GGML [0], how do you
       | know which one to use? Reading the "Explanation of the new
       | k-quant methods" is a bit opaque.
       | 
       | [0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML
        
       | sva_ wrote:
       | If you just want to do inference/mess around with the model and
       | have a 16GB GPU, then this[0] is enough to paste into a notebook.
       | You need to have access to the HF models though.
       | 
       | 0.
       | https://github.com/huggingface/blog/blob/main/llama2.md#usin...
        
       | oaththrowaway wrote:
       | Off topic: is there a way to use one of the LLMs and have it
       | ingest data from a SQLite database and ask it questions about it?
        
         | politelemon wrote:
         | Have a look at this too, it's just an integration which
         | langchain can be good at : https://walkingtree.tech/natural-
         | language-to-query-your-sql-...
        
         | seanthemon wrote:
         | You can, but as a crazy idea you can also ask chatgpt to write
         | select queries using the functions parameter they added
         | recently - you can also ask it to write jsonpath.
         | 
         | As long as it understands the schema and general idea of data,
         | it does a fairly good job. Just be careful to do too much with
         | one prompt, you can easily cause hallucinations
        
         | simonw wrote:
         | I've experimented with that a bit.
         | 
         | Currently the absolutely best way to do that is to upload a
         | SQLite database file to ChatGPT Code Interpreter.
         | 
         | I'm hoping that someone will fine-tune an openly licensed model
         | for this at some point that can give results as good as Code
         | Interpreter does.
        
         | siquick wrote:
         | You can migrate that data to a vector database (eg Pinecone or
         | pgVector) and then query it. I didn't write it but this guide
         | has a good overview of concepts and some code. In your case
         | your just replace the web crawler with database queries. All
         | the libraries used also exist in Python.
         | 
         | https://www.pinecone.io/learn/javascript-chatbot/
        
         | thisisit wrote:
         | You can but what you'll end up trading precise answers while
         | querying to a chance of hallucinations.
        
       | politelemon wrote:
       | Llama.cpp can run on Android too.
        
       | synaesthesisx wrote:
       | This is usable, but hopefully folks manage to tweak it a bit
       | further for even higher tokens/s. I'm running Llama.cpp locally
       | on my M2 Max (32 GB) with decent performance but sticking to the
       | 7B model for now.
        
       ___________________________________________________________________
       (page generated 2023-07-25 23:00 UTC)