[HN Gopher] Run Llama 13B with a 6GB graphics card
       ___________________________________________________________________
        
       Run Llama 13B with a 6GB graphics card
        
       Author : rain1
       Score  : 411 points
       Date   : 2023-05-14 12:35 UTC (10 hours ago)
        
 (HTM) web link (gist.github.com)
 (TXT) w3m dump (gist.github.com)
        
       | tikkun wrote:
       | See also:
       | 
       | https://www.reddit.com/r/LocalLLaMA/comments/13fnyah/you_guy...
       | 
       | https://chat.lmsys.org/?arena (Click 'leaderboard')
        
       | alg_fun wrote:
       | wouldn't i be faster to use ram as a swap for vram?
        
       | naillo wrote:
       | This is cool but are people actually getting stuff done with
       | these models? I'm enthusiastic about their potential too but
       | after playing with it for a day I'm at a loss for what to use it
       | for anymore at this point
        
         | Joeri wrote:
         | My experience is that even the "toy" 7B models can do pretty
         | amazing things, but they're very sensitive to prompting in the
         | right way to force the model down the right pathways. My
         | assumption is that the larger models have more redundancy and
         | therefore can be prompted less precisely while still going down
         | the right pathway.
         | 
         | For example, the WizardLM model (5_1 quantized, as 4_x
         | quantization makes the 7B models perform terribly):
         | 
         | Prompt: _I went to the market and bought 10 apples. I gave 2 to
         | my neighbor and 2 to my sister. Then I bought 5 more and ate 1.
         | How many apples do I have?_
         | 
         | Response:                 Let me check that for you! You had 9
         | apples left after giving away 4, so you still have 9 - 4 = 5
         | apples left.
         | 
         | Prompt: _I went to the market and bought 10 apples. I gave 2 to
         | my neighbor and 2 to my sister. Then I bought 5 more and ate 1.
         | How many apples do I have? Write out any calculations in full.
         | List the steps used to come to the final result._
         | 
         | Response:                 Sure, let's work through this problem
         | together! Here are the steps we can use:       Step 1: Start
         | with the initial number of apples you bought (10).       Step
         | 2: Subtract the amount given to your neighbor and sister from
         | the total (2 + 2 = 4). This gives us a new total of 6 apples.
         | Step 3: Add the additional 5 apples you bought later on to the
         | remaining 6 apples, giving us a final total of 11 apples.
         | Step 4: Finally, subtract the apple you ate from the total (1),
         | leaving us with 10 apples in total.       So the answer is 10
         | apples.
        
         | m3kw9 wrote:
         | Most part if you actually need stuff done, you can afford 20$
         | for OpenAIs proven models.
         | 
         | These local projects are great because maybe eventually they
         | will have a equivalent model that can be run on cheap parts
        
           | Beltiras wrote:
           | The knowledge cutoff date is brutal for some applications.
        
             | m3kw9 wrote:
             | I wonder how ingesting more and more data will affect the
             | size of parameters, it's gonna continually get bigger?
        
               | rolisz wrote:
               | I don't think that the current models are at "knowledge
               | capacity". So far all evidence points to training on more
               | data on the same size model gives better results.
        
               | cubefox wrote:
               | Both increasing the amount of parameters and the amount
               | of training tokens improves results (more precisely:
               | lowers training loss), and costs computing power. For
               | optimally improving loss per training computing power,
               | model size and training tokens should be increased
               | equally. That's the Chinchilla scaling law. (Though low
               | loss is not always the same as good results, the data
               | quality also matters.)
               | 
               | Further reading: https://dynomight.net/scaling/
        
               | snovv_crash wrote:
               | An interesting corollary of this is that if you want to
               | reduce the model size you can compensate by training for
               | longer to achieve the same accuracy. Depending on your
               | training:inference ratio this may be more optimal
               | globally to reduce your total compute costs or even just
               | reduce your frontend latency.
        
               | cubefox wrote:
               | Yeah, though I have not seen a formula which takes the
               | number of expected inference runs into account for
               | calculating the optimal data/parameter balance.
        
           | thelittleone wrote:
           | Knowledge cut off and sending potentially sensitive IP to a
           | centralised and untrusted third party. This would likely
           | limit the addressable market.
        
             | jjoonathan wrote:
             | Also: they are lobotomized. If you want to do security
             | research, Sorry Dave. If you want to nsfw writing, Sorry
             | Dave. If you want to open the pod bay doors, Sorry Dave,
             | I'm afraid I can't do that.
        
               | yieldcrv wrote:
               | I tried to help a sex worker with ChatGPT and all it did
               | was preach about avoiding sex work, further marginalizing
               | in its virtue signalizing. I dont consider her
               | marginalized, and "help" was just information about her
               | job and techniques and venues. ChatGPT would have
               | consumed this textual information too.
               | 
               | But yeah offline fine tuned models wont have this
               | problem.
               | 
               | Kind of cool to see how the SWERF representation in tech
               | is going to speedrun SWERF irrelevancy.
        
               | jhartwig wrote:
               | You tried to help a sex worker with chatGPT? Care to
               | explain more on this use case lol? Curious minds want to
               | know.
        
               | yieldcrv wrote:
               | Venues to work, other people's thoughts working there,
               | management, nuances about local licenses, stuff that
               | anybody with any job would do
        
               | jstarfish wrote:
               | People are already setting up fake personas/OnlyFans
               | accounts using chatbots and SD images.
               | 
               | We have a high-value specialist currently chatting up a
               | few of them at work. His wife doesn't know. He doesn't
               | know _we_ know. The photos are fake but he 's too horny
               | to notice.
               | 
               | Time to dust off the "there are no women on the internet"
               | meme...
        
               | baobabKoodaa wrote:
               | > People are already setting up fake personas/OnlyFans
               | accounts using chatbots and SD images.
               | 
               | Citation needed.
        
               | yieldcrv wrote:
               | just the latest most high profile example making the
               | rounds yesterday
               | 
               | https://www.nbcnews.com/news/amp/rcna84180
        
               | jstarfish wrote:
               | I'm only offering an early anecdote that catfishing is
               | adapting to the times. You don't have to believe me.
               | Don't trust anyone with weird nipples that hides their
               | hands in every photo.
               | 
               | People have been [claiming to] do this for years:
               | https://www.blackhatworld.com/seo/monetizing-traffic-
               | from-so...
               | 
               | Give it 1-2 years and you can hear about it from Krebs.
        
               | yieldcrv wrote:
               | Informative for some but this wasn't an interaction over
               | the internet, just out and about
        
               | cubefox wrote:
               | Microsoft Azure still has the GPT-3.5 foundation model,
               | code-davinci-002. It is not fine-tuned for instruction
               | following, safety, or censorship.
               | 
               | I'm not sure though whether Microsoft analyzes the
               | input/output with another model to detect and prevent
               | certain content.
        
         | iforgotpassword wrote:
         | I haven't tried the fine-tuned variants yet, but when I played
         | around with it shortly after the leak, it tended to quickly
         | derail into nonsense when you let it complete sentences or
         | paragraphs, especially when using other languages than English.
         | When I tried to get it to produce Python code, most of the time
         | it wasn't even syntactically correct.
        
         | Taek wrote:
         | I've successfully gotten at-home models
         | (https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-fp16) to
         | go through my messages and pull out key todo items. For
         | example, reminding me to message my father about travel plans.
         | 
         | Is it comparable to GPT-4? No, it's not remotely close. But
         | it's getting closer every week, and it very recently crossed
         | the threshold of 'it can do stuff I would never send to a cloud
         | service' (namely, reading all of my messages and pulling out
         | action items).
        
         | jhbadger wrote:
         | Depends on what "getting stuff done" means. I find 13B models
         | (running on my M1 Mac) useful for playing AI Dungeon-like games
         | -- where you describe the environment and your character and
         | you can play an RPG.
        
           | fredguth wrote:
           | GitHub Copilot is (or once was) a 13b model, according to Nat
           | Friedman in the scale.ai interview.
           | (https://youtu.be/lnufceCxwG0)
        
         | Zetobal wrote:
         | We run some llamas to analyze user content.
        
         | rain1 wrote:
         | It's just for fun!
         | 
         | These local models aren't as good as Bard or GPT-4.
        
           | happycube wrote:
           | There are two major advantages though - you can retrain them,
           | and they're not on the guardrails that the commercial models
           | have.
        
             | gre wrote:
             | I tried to prompt vicuna to tell me a joke about gay people
             | and it refused. Some of the guardrails are still in there.
        
               | azeirah wrote:
               | It's because vicuna is fine-tuned on chatGPT answers.
               | LLaMa will not do this, but LLaMa-based models fine tuned
               | with chatGPT answers will.
        
               | occz wrote:
               | Did you use the censored or the uncensored variant?
        
               | gre wrote:
               | It's just a random one from huggingface. I will look for
               | the uncensored one later. Thanks, I think.
        
               | occz wrote:
               | You're welcome. I can't vouch for them though, as I
               | haven't tried them, I've merely heard about them.
        
         | instance wrote:
         | I tested on a serious use case and quality was subpar. For real
         | use cases I had to either host the most powerful model you can
         | get (e.g. LLaMA-65B or so) on a cloud machine, which again
         | costs too much (you'll be paying like 500-1000 USD per month),
         | or just go straight for GPT-3.5 on OpenAI. The latter
         | economically makes most sense.
        
           | inferense wrote:
           | what real use case did you use it for?
        
             | instance wrote:
             | For instance used it in conjunction with llama-index for
             | knowledge management. Created an index for a whole
             | confluence/jira of a mid-sized company, got good results
             | with GPT, but for LLaMA of this size that use case was too
             | much.
        
               | sroussey wrote:
               | Did you try instructor-xl? It ranks highest on
               | huggingface.
        
               | dzhiurgis wrote:
               | I'd argue 1k per month for mid-sized company is nothing,
               | but I can understand where you are coming from.
        
             | throwaway1777 wrote:
             | Making demos to raise investment probably
        
           | raffraffraff wrote:
           | What about turning the cloud vm off except when you're
           | actually using it?
        
             | unglaublich wrote:
             | A "serious use case" means it needs to be available around
             | the clock.
        
         | ineedasername wrote:
         | I can run the Wizard 30B ggml model in CPU mode using a Ryzen
         | 5700 and 16GB of _system_ RAM, not GPU VRAM. I'm using
         | oobabooga as the front end.
         | 
         | It's slow, but if I ask it to write a Haiku it's slow on the
         | order of "go brew some coffee and come back in 10 minutes" and
         | does it very well. Running it overnight on something like
         | "summarize an analysis of topic X it does a reasonable job.
         | 
         | It can produce answers to questions only slightly less well
         | than ChatGPT (3.5). The Wizard 13B model runs much faster,
         | maybe 2-3 tokens per second.
         | 
         | It is free, private, and runs on a midrange laptop.
         | 
         | A little more than a month ago that wasn't possible, not with
         | my level of knowledge of the tooling involved at least, now it
         | requires little more than running an executable and minor
         | troubleshooting of python dependencies (on another machine it
         | "just worked")
         | 
         | So: Don't think of these posts as "doing it just because you
         | can and it's fun to tinker"
         | 
         | Vast strides are being made pretty much daily in both quality
         | and efficiency, raising their utility while lowering the cost
         | of usage, doing both to a very significant degree.
        
           | theaiquestion wrote:
           | > It's slow, but if I ask it to write a Haiku it's slow on
           | the order of "go brew some coffee and come back in 10
           | minutes" and does it very well. Running it overnight on
           | something like "summarize an analysis of topic X it does a
           | reasonable job.
           | 
           | I'm sorry but that's unusably slow, even GPT-4 can take a
           | retry or a prompt to fix certain type of issues. My
           | experience is the open options require a lot more
           | attempts/manual prompt tuning.
           | 
           | I can't think of a single workload where that is usable. That
           | said once consumer GPUs are involved it does become usable
        
             | postalrat wrote:
             | I doubt you've ever worked with people if you think that's
             | unusable slow
        
               | bcrosby95 wrote:
               | The computer doesn't ask for annoying things like a
               | paycheck or benefits either.
        
               | mejutoco wrote:
               | Money upfront and a small salary in the form of
               | electricity bills.
        
           | sp332 wrote:
           | What prompt do you use to get haikus?
        
           | BaculumMeumEst wrote:
           | Wow you can run a 30B model on 16gb ram? Is it hitting swap?
        
             | sp332 wrote:
             | Most people are running these at 4 bits per parameter for
             | speed and RAM reasons. That means the model would take just
             | about all of the RAM. But instead of swap (writing data to
             | disk and then reading it again later), I would expect a
             | good implementation to only run into cache eviction
             | (deleting data from RAM and then reading it back from disk
             | later), which should be a lot faster and cause less wear
             | and tear on SSDs.
        
             | mcbuilder wrote:
             | These models can run FP16, with LLM quantization going down
             | to Int8 and beyond.
        
               | BaculumMeumEst wrote:
               | i'm just starting to get into deep learning so i look
               | forward to understanding that sentence
        
               | MobiusHorizons wrote:
               | FP16 and Int8 are about how many bits are being used for
               | floating point and integer numbers. FP16 is 16bit
               | floating point. The more bits the better the precision,
               | but the more ram it takes. Normally programmers use 32 or
               | 64bit floats so 16bit floats have significantly reduced
               | precision, but take up half the space of fp32 which is
               | the smallest floating point format for most CPUs.
               | similarly 8 bit integers have only 256 total
               | possibilities and go from -128 to 127.
        
               | mike00632 wrote:
               | How much resources are required is directly related to
               | the memory size devoted to each weight. If the weights
               | are stored as 32-bit floating points then each weight is
               | 32 bits which adds up when we are talking about billions
               | of weights. But if the weights are first converted to
               | 16-bit floating point numbers (precise to fewer decimal
               | places) then fewer resources are needed to store and
               | compute the numbers. Research has shown that simply
               | chopping off some of the precision of the weights still
               | yields good AI performance in many cases.
               | 
               | Note too that the numbers are standardized, e.g. floats
               | are defined by IEEE 754 standard. Numbers in this format
               | have specialized hardware to do math with them, so when
               | considering which number format to use it's difficult to
               | get outside of the established ones (foat32, float16,
               | int8).
        
               | sp332 wrote:
               | Training uses gradient descent, so you want to have good
               | precision during that process. But once you have the
               | overall structure of the network,
               | https://arxiv.org/abs/2210.17323 (GPTQ) showed that you
               | can cut down the precision quite a bit without losing a
               | lot of accuracy. It seems you can cut down further for
               | larger models. For the 13B Llama-based ones, going below
               | 5 bits per parameter is noticeably worse, but for 30B
               | models you can do 4 bits.
               | 
               | The same group did another paper
               | https://arxiv.org/abs/2301.00774 which shows that in
               | addition to reducing the precision of each parameter, you
               | can also prune out a bunch of parameters entirely. It's
               | harder to apply this optimization because models are
               | usually loaded into RAM densely, but I hope someone
               | figures out how to do it for popular models.
        
             | [deleted]
        
         | redox99 wrote:
         | People are extensively using these models (more specifically
         | the finetuned, uncensored ones) for role playing.
        
           | irzzy wrote:
           | [dead]
        
         | elorant wrote:
         | I've setup and use Vicuna-13b for text classification,
         | summarization and topic modelling. Works like a charm.
         | 
         | It's also good for math lessons.
        
           | BOOSTERHIDROGEN wrote:
           | Would like to know how you setup this. A posts would be
           | awesome.
        
             | elorant wrote:
             | There are various posts online on how to set it up, either
             | for Linux or Windows. There was an older post here on how
             | to install opt-65b on a mac studio ultra, and smaller
             | models on mac pros. There was also a post if I remember
             | correctly about running vicuna-7b on an iPhone.
             | 
             | Here are a few examples:
             | 
             | https://morioh.com/p/55296932dd8b
             | 
             | https://www.youtube.com/watch?v=iQ3Lhy-eD1s
             | 
             | https://news.ycombinator.com/item?id=35430432
             | 
             | Side note. You need bonkers hardware to run it efficiently.
             | I'm currently using a 16-core cpu, 128G RAM, a Pcie 4.0
             | nvme and an RTX 3090. There are ways to run it on less
             | powerful hardware, like 8cores, 64GB RAM, simple ssd and an
             | RTX 3080 or 70, but I happen to have a large corpus of data
             | to process so I went all in.
        
               | csdvrx wrote:
               | I think the previous comment is more interested in your
               | experience with your large data: what are you doing with
               | it?
               | 
               | I have similar hardware at home, so I wonder how reliably
               | you can process simple queries using domain knowledge +
               | logic which work on on mlc-llm, something like "if you
               | can chose the word food, or the word laptop, or the word
               | deodorant, which one do you chose for describing "macbook
               | air"? answer precisely with just the word you chose"
               | 
               | If it works, can you upload the weights somewhere? IIRC,
               | vicuna is open source.
        
               | elorant wrote:
               | There's an online demo of Vicuna-13b where you can test
               | its efficiency:
               | 
               | https://chat.lmsys.org/
        
               | techload wrote:
               | After two prompts I was astounded by the innacuracies
               | present in the answers. An they were pretty easy
               | questions.
        
               | csdvrx wrote:
               | Yes, but can you replicate that functionality using
               | llama.cpp?
               | 
               | If so, what did you run with main?
               | 
               | I haven't been able to get an answer, while for the
               | question above, I can get _' I chose the word "laptop"'_
               | with mlc-llm
        
               | elorant wrote:
               | For the tasks I need it the efficiency is similar to the
               | online model. Only slower. I don't care for
               | conversational functionality.
        
               | chaxor wrote:
               | If these problems are all very similar in structure, then
               | you may not need an LLM. Simple GloVe or W2V may suffice
               | with a dot product. The. You can plow through a few
               | terabytes by the time the LLM goes through a fraction of
               | that.
        
           | jstarfish wrote:
           | Maybe others' experiences are different, but I find smaller
           | models to work just as well for "reductive" tasks.
           | 
           | Dolly sucks for generating long-form content (not very
           | creative) but if I need a summary or classification, it's
           | quicker and easier to spin up dolly-3b than vicuna-13b.
           | 
           | I suspect OpenAI is routing prompts to select models based on
           | similar logic.
        
       | s_dev wrote:
       | [deleted]
        
         | capableweb wrote:
         | First link: https://github.com/ggerganov/llama.cpp
         | 
         | Which in turn has the following as the first link:
         | https://arxiv.org/abs/2302.13971
         | 
         | Is it really quicker to ask here than just browse content for a
         | bit, skimming some text or even using Google for one minute?
        
           | djbusby wrote:
           | You gave an awesome answer in 2 minutes! Might be faster than
           | reading!
        
             | capableweb wrote:
             | If you cannot click two links in a browser under two
             | minutes, I'm either sorry for you, or scared of you :)
        
           | s_dev wrote:
           | >Is it really quicker to ask here than just browse content
           | for a bit, skimming some text or even using Google for one
           | minute?
           | 
           | I don't know if it's quicker but I trust human assessment a
           | lot more than any machine generated explanations. You're
           | right I could have asked ChatGPT or even Googled but a small
           | bit of context goes a long way and I'm clearly out of the
           | loop here -- it's possible others arrive on HN might
           | appreciate such an explanation or we're better off having
           | lots of people making duplicated efforts to understand what
           | they're looking at.
        
             | capableweb wrote:
             | Well, I'm saying if you just followed the links on the
             | submitted page, you'd reach the same conclusion but faster.
        
         | rain1 wrote:
         | llama is a text prediction model similar to GPT-2, and the
         | version of GPT-3 that has not been fine tuned yet.
         | 
         | It is also possible to run fine tuned versions like vicuna with
         | this. I think. Those versions are more focused on answering
         | questions.
        
         | haunter wrote:
         | >I can't tell from the Gist alone
         | 
         | Literally the second line: "llama is a text prediction model
         | similar to GPT-2, and the version of GPT-3 that has not been
         | fine tuned yet"
        
           | rain1 wrote:
           | I'm sorry! I added this improvement based on that persons
           | question!
        
           | s_dev wrote:
           | Sorry -- I missed that. I'll delete my comments -- obviously
           | I'm just an idiot asking dumb questions that have no value to
           | anybody. I thought I read through it.
        
             | rain1 wrote:
             | not at all, your question was really good so I added the
             | answer to it to my gist to help everyone else. Sorry for
             | the confusion I created by doing that!
        
       | avereveard wrote:
       | or like download oobabooga/text-generation-webui, any
       | prequantized variant, and be done.
        
       | rahimnathwani wrote:
       | On my system, using `-ngl 22` (running 22 layers on the GPU) cuts
       | wall clock time by ~60%.
       | 
       | My system:
       | 
       | GPU: NVidia RTX 2070S (8GB VRAM)
       | 
       | CPU: AMD Ryzen 5 3600 (16GB VRAM)
       | 
       | Here's the performance difference I see:
       | 
       | CPU only (./main -t 12)                 llama_print_timings:
       | load time = 15459.43 ms       llama_print_timings:      sample
       | time =    23.64 ms /    38 runs   (    0.62 ms per token)
       | llama_print_timings: prompt eval time =  9338.10 ms /   356
       | tokens (   26.23 ms per token)       llama_print_timings:
       | eval time = 31700.73 ms /    37 runs   (  856.78 ms per token)
       | llama_print_timings:       total time = 47192.68 ms
       | 
       | GPU (./main -t 12 -ngl 22)                 llama_print_timings:
       | load time = 10285.15 ms       llama_print_timings:      sample
       | time =    21.60 ms /    35 runs   (    0.62 ms per token)
       | llama_print_timings: prompt eval time =  3889.65 ms /   356
       | tokens (   10.93 ms per token)       llama_print_timings:
       | eval time =  8126.90 ms /    34 runs   (  239.03 ms per token)
       | llama_print_timings:       total time = 18441.22 ms
        
         | samstave wrote:
         | May you please ELI5 what is happening here...
         | 
         | Imagine I am first ever hearing about this, ;; what did you do?
        
           | rahimnathwani wrote:
           | 0. Have a PC with an NVidia GPU, running Ubuntu, with the
           | NVidia drivers and CUDA Toolkit already set up.
           | 
           | 1. Download the weights for the model you want to use, e.g.
           | gpt4-x-vicuna-13B.ggml.q5_1.bin
           | 
           | 2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to
           | compile it with support for CUBLAS (BLAS on GPU).
           | 
           | 3. Run the resulting 'main' executable, with the -ngl option
           | set to 18, so that it tries to load 18 layers of the model
           | into the GPU's VRAM, instead of the system's RAM.
        
             | rain1 wrote:
             | > 1. Download the weights for the model you want to use,
             | e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin
             | 
             | I think you need to quantize the model yourself from the
             | float/huggingface versions. My understanding is that the
             | quantization formats have changed recently. and old
             | quantized models no longer work.
        
               | rahimnathwani wrote:
               | That was true until 2 days ago :)
               | 
               | The repo has now been updated with requantized models
               | that work with the latest version, so you don't need to
               | do that any more.
               | 
               | https://huggingface.co/TheBloke/gpt4-x-vicuna-13B-GGML/co
               | mmi...
        
               | rain1 wrote:
               | wonderful! thank you
        
         | guardiangod wrote:
         | I am testing it on an AWS instance and the speedup effect is
         | not as consistent as I hope. The speedup varies between runs.
         | 
         | Intel Xeon Platinum 8259CL CPU @ 2.50GHz 128 GB RAM Tesla T4
         | ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-
         | GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin
         | llama_print_timings:        load time =  3725.08 ms
         | llama_print_timings:      sample time =   612.06 ms /   536
         | runs   (    1.14 ms per token)       llama_print_timings:
         | prompt eval time = 13876.81 ms /   259 tokens (   53.58 ms per
         | token)       llama_print_timings:        eval time = 221647.40
         | ms /   534 runs   (  415.07 ms per token)
         | llama_print_timings:       total time = 239423.46 ms
         | ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-
         | GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30
         | llama_print_timings:        load time =  7638.95 ms
         | llama_print_timings:      sample time =   280.81 ms /   294
         | runs   (    0.96 ms per token)       llama_print_timings:
         | prompt eval time =  2197.82 ms /     2 tokens ( 1098.91 ms per
         | token)       llama_print_timings:        eval time = 112790.25
         | ms /   293 runs   (  384.95 ms per token)
         | llama_print_timings:       total time = 120788.82 ms
        
           | rahimnathwani wrote:
           | Thanks. BTW:
           | 
           | - the model I used was gpt4-x-vicuna-13B.ggml.q5_1.bin
           | 
           | - I used 'time' to measure the wall clock time of each
           | command.
           | 
           | - My prompt was:                 Below is an instruction that
           | describes a task. Write a response that appropriately
           | completes the request.       ### Instruction:       Write a
           | long blog post with 5 sections, about the pros and cons of
           | emphasising procedural fluency over conceptual understanding,
           | in high school math education.       ### Response:
        
         | PaulWaldman wrote:
         | Any way to know the differences in power consumption?
        
           | Tuna-Fish wrote:
           | Probably significant savings.
        
         | cpill wrote:
         | Will this work with the leaked models or Alpaca?
        
         | eightysixfour wrote:
         | You will likely see a bit of a performance gain dropping your
         | threads to 6. I'm on a 3700x and get a regression when using 16
         | threads instead of the real 8 cores.
        
         | rain1 wrote:
         | That is a crazy speedup!!
        
           | GordonS wrote:
           | Is it really? Going from CPU to GPU, I would have expected a
           | much better improvement.
        
             | rahimnathwani wrote:
             | You can think of it this way: if half the model is running
             | on the GPU, and the GPU is infinitely fast, then the total
             | calculation time would go down by 50%, compared with
             | everything running on the CPU.
        
               | ethbr0 wrote:
               | Ref Amdahl's Law:
               | https://en.m.wikipedia.org/wiki/Amdahl%27s_law
        
             | qwertox wrote:
             | I feel the same.
             | 
             | For example some stats from Whisper [0] (audio transcoding,
             | 30 seconds) show the following for the medium model (see
             | other models in the link):
             | 
             | ---
             | 
             | GPU medium fp32 Linear 1.7s
             | 
             | CPU medium fp32 nn.Linear 60.7s
             | 
             | CPU medium qint8 (quant) nn.Linear 23.1s
             | 
             | ---
             | 
             | So the same model runs 35.7 times faster on GPU, and
             | compared to an "optimized" model still 13.6.
             | 
             | I was expecting around an order or magnitude of
             | improvement.
             | 
             | Then again, I do not know if in the case of this article
             | the entire model was in the GPU, or just a fraction of it
             | (22 layers) and the remainder on CPU, which might explain
             | the result. Apparently that's the case, but I don't know
             | much about this stuff.
             | 
             | [0] https://github.com/MiscellaneousStuff/openai-whisper-
             | cpu
        
               | rahimnathwani wrote:
               | You last paragraph is correct. Only about half the model
               | was running on the GPU.
        
       | anshumankmr wrote:
       | How long before it runs on a 4 gig card?
        
         | rain1 wrote:
         | You can offload only 10 layers or so if you want to run on a
         | 4GB card
        
       | bitL wrote:
       | How about reloading parts of the model as the inference
       | progresses instead of splitting it into GPU/CPU parts? Reloading
       | would be memory-limited to the largest intermediate tensor cut.
        
         | moffkalast wrote:
         | The Tensor Reloaded, starring Keanu Reeves
        
         | regularfry wrote:
         | That would turn what's currently an L3 cache miss or a GPU data
         | copy into a disk I/O stall. Not that it might not be possible
         | to pipeline things to make that less of a problem, but it
         | doesn't immediately strike me as a fantastic trade-off.
        
           | bitL wrote:
           | One can keep all tensors in the RAM, just push whatever
           | needed to GPU VRAM, basically limited by PCIe speed. Or some
           | intelligent strategy with read-ahead from SSD if one's RAM is
           | limited. There are even GPUs with their own SSDs.
        
       | sroussey wrote:
       | I wish this used the webgpu c++ library instead, then it could be
       | used in any GPU hardware.
        
       | marcopicentini wrote:
       | What do you use to host these models (like Vicuna, Dolly etc) on
       | your own server and expose them using HTTP REST API? Is there an
       | Heroku-like for LLM models?
       | 
       | I am looking for an open source models to do text summarization.
       | Open AI is too expensive for my use case because I need to pass
       | lots of tokens.
        
         | rain1 wrote:
         | I haven't tried that but https://github.com/abetlen/llama-cpp-
         | python and https://github.com/r2d4/openlm exists
        
         | speedgoose wrote:
         | These days I use FastChat: https://github.com/lm-sys/FastChat
         | 
         | It's not based on llama.cpp but huggingface transformers but
         | can also run on CPU.
         | 
         | It works well, can be distributed and very conveniently provide
         | the same REST API than OpenAI GPT.
        
           | itake wrote:
           | Do you know how well it performs compared to llama.cpp?
        
             | rain1 wrote:
             | my understanding is that the engine used (pytorch
             | transformers library) is still faster than llama.cpp with
             | 100% of layers running on the GPU.
        
               | itake wrote:
               | I only have an m1
        
               | rain1 wrote:
               | I don't think the integrated GPU on that supports CUDA.
               | So you will need to use CPU mode only.
        
               | itake wrote:
               | Yep, but isn't there an integrated ML chip that makes it
               | faster than cpu? Or does llama.cpp not use that?
        
               | rain1 wrote:
               | unfortunately that chip is proprietary and undocumented,
               | it's very difficult for open source programs to make use
               | of. I think there is some reverse engineering work being
               | done but it's not complete.
        
               | qeternity wrote:
               | It's the Huggingface transformers library which is
               | implemented in pytorch.
               | 
               | In terms of speed, yes running fp16 will indeed be faster
               | with vanilla gpu setup. However most people are running
               | 4bit quantized versions, and the GPU quantization
               | landscape as been a mess (GPTQ-for-llama project).
               | llama.cpp has taken a totally different approach, and it
               | looks like they are currently able to match native GPU
               | perf via cuBLAS with much less effort and brittleness.
        
         | inhumantsar wrote:
         | Weights and Biases is good for building/training models and
         | Lambda Labs is a cloud provider for AI workloads. Lambda will
         | only get you up to running the model though. You would still
         | need to overlay some job management on top of that. I've heard
         | Run.AI is good on that front but I haven't tried.
        
       | peatmoss wrote:
       | From skimming, it looks like this approach requires CUDA and thus
       | is Nvidia only.
       | 
       | Anyone have a recommended guide for AMD / Intel GPUs? I gather
       | the 4 bit quantization is the special sauce for CUDA, but I'd
       | guess there'd be something comparable for not-CUDA?
        
         | rain1 wrote:
         | 4-bit quantization is to reduce the amount of VRAM required to
         | run the model. You can run it 100% on CPU if you don't have
         | CUDA. I'm not aware of any AMD equivalent yet.
        
           | amelius wrote:
           | Looks like there are several projects that implement the CUDA
           | interface for various other compute systems, e.g.:
           | 
           | https://github.com/ROCm-Developer-
           | Tools/HIPIFY/blob/master/R...
           | 
           | https://github.com/hughperkins/coriander
           | 
           | I have zero experience with these, though.
        
             | westurner wrote:
             | "Democratizing AI with PyTorch Foundation and ROCm(tm)
             | support for PyTorch" (2023)
             | https://pytorch.org/blog/democratizing-ai-with-pytorch/ :
             | 
             | > _AMD, along with key PyTorch codebase developers
             | (including those at Meta AI), delivered a set of updates to
             | the ROCm(tm) open software ecosystem that brings stable
             | support for AMD Instinct(tm) accelerators as well as many
             | Radeon(tm) GPUs. This now gives PyTorch developers the
             | ability to build their next great AI solutions leveraging
             | AMD GPU accelerators & ROCm. The support from PyTorch
             | community in identifying gaps, prioritizing key updates,
             | providing feedback for performance optimizing and
             | supporting our journey from "Beta" to "Stable" was
             | immensely helpful and we deeply appreciate the strong
             | collaboration between the two teams at AMD and PyTorch. The
             | move for ROCm support from "Beta" to "Stable" came in the
             | PyTorch 1.12 release (June 2022)_
             | 
             | > [...] _PyTorch ecosystem libraries like TorchText (Text
             | classification), TorchRec (libraries for recommender
             | systems - RecSys), TorchVision (Computer Vision),
             | TorchAudio (audio and signal processing) are fully
             | supported since ROCm 5.1 and upstreamed with PyTorch 1.12._
             | 
             | > _Key libraries provided with the ROCm software stack
             | including MIOpen (Convolution models), RCCL (ROCm
             | Collective Communications) and rocBLAS (BLAS for
             | transformers) were further optimized to offer new potential
             | efficiencies and higher performance._
             | 
             | https://news.ycombinator.com/item?id=34399633 :
             | 
             | >> _AMD ROcm supports Pytorch, TensorFlow, MlOpen, rocBLAS
             | on NVIDIA and AMD
             | GPUs:https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-
             | learni... _
        
               | westurner wrote:
               | https://github.com/intel/intel-extension-for-pytorch :
               | 
               | > _Intel(r) Extension for PyTorch extends PyTorch with
               | up-to-date features optimizations for an extra
               | performance boost on Intel hardware. Optimizations take
               | advantage of AVX-512 Vector Neural Network Instructions
               | (AVX512 VNNI) and Intel(r) Advanced Matrix Extensions
               | (Intel(r) AMX) on Intel CPUs as well as Intel Xe Matrix
               | Extensions (XMX) AI engines on Intel discrete GPUs.
               | Moreover, through PyTorch xpu device, Intel(r) Extension
               | for PyTorch provides easy GPU acceleration for Intel
               | discrete GPUs with PyTorch_
               | 
               | https://pytorch.org/blog/celebrate-pytorch-2.0/ (2023) :
               | 
               | > _As part of the PyTorch 2.0 compilation stack,
               | TorchInductor CPU backend optimization brings notable
               | performance improvements via graph compilation over the
               | PyTorch eager mode._
               | 
               | > _The TorchInductor CPU backend is sped up by leveraging
               | the technologies from the Intel(r) Extension for PyTorch
               | for Conv /GEMM ops with post-op fusion and weight
               | prepacking, and PyTorch ATen CPU kernels for memory-bound
               | ops with explicit vectorization on top of OpenMP-based
               | thread parallelization_
               | 
               | DLRS Deep Learning Reference Stack:
               | https://intel.github.io/stacks/dlrs/index.html
        
             | rain1 wrote:
             | exciting! maybe we will see that land in llama.cpp
             | eventually, who knows!
        
               | juliangoldsmith wrote:
               | llama.cpp has CLBlast support now, though I haven't used
               | it yet.
        
         | [deleted]
        
       | hhh wrote:
       | Instructions are a bit rough. The Micromamba thing doesn't work,
       | doesn't say how to install it... you have to clone llama.cpp too
        
         | rain1 wrote:
         | Apologies for that. I've added some extra micromamba setup
         | commands that I should have included before!
         | 
         | I've also added the git clone command, thank you for the
         | feedback
        
           | hhh wrote:
           | Appreciate it! This is much better!
        
       | ranger_danger wrote:
       | Why can't these models run on the GPU while also using CPU RAM
       | for the storage? That way people will performant-but-memory-
       | starved GPUs can still utilize the better performance of the GPU
       | calculation while also having enough RAM to store the model? I
       | know it is possible to provide system RAM-backed GPU objects.
        
       | syntaxing wrote:
       | This update is pretty exciting, I'm gonna try running a large
       | model (65B) with a 3090. I have ran a ton of local LLM but the
       | hardest part is finding out the prompt structure. I wish there is
       | some sort of centralized data base that explains it.
        
         | guardiangod wrote:
         | I got the alpaca 65B GGML model to run on my 64GB ram laptop.
         | No GPU required if you can tolerate the 1 token per 3 seconds
         | rate.
        
           | syntaxing wrote:
           | Supposedly the new update with GPU offloading will bring that
           | up to 10 tokens per second! 1 token per second is painfully
           | slow, that's about 30s for a sentence.
        
         | rain1 wrote:
         | Tell us how it goes! Try different numbers of layers if needed.
         | 
         | A good place to dig for prompt structures may be the 'text-
         | generation-webui' commit log. For example
         | https://github.com/oobabooga/text-generation-webui/commit/33...
        
       | tarr11 wrote:
       | What is the state of the art on evaluating the accuracy of these
       | models? Is there some equivalent to an "end to end test"?
       | 
       | It feels somewhat recursive since the input and output are
       | natural language and so you would need another LLM to evaluate
       | whether the model answered a prompt correctly.
        
         | tikkun wrote:
         | https://chat.lmsys.org/?arena (Click 'leaderboard')
        
         | klysm wrote:
         | It's going to be very difficult to come up with any rigorous
         | structure for automatically assessing the outputs of these
         | models. They're built using effectively human grading of the
         | answers
        
           | RockyMcNuts wrote:
           | hmmh, if we have the reinforcement learning part of
           | reinforcement learning with human feedback, isn't that a
           | model that takes a question/answer pair and rates the quality
           | of the answer? it's sort of grading itself, it's like a
           | training loss but it still tells us something?
        
           | sroussey wrote:
           | Llama cpp and others use perplexity:
           | 
           | https://huggingface.co/docs/transformers/perplexity
        
       | ACV001 wrote:
       | The future is this - these models will be able to run on smaller
       | and smaller hardware eventually being able to run on your phone,
       | watch or embedded devices. The revolution is here and is
       | inevitable. Similar to how computers evolved. We are still lucky
       | that these models have no consciousness, still. Once they gain
       | consciousness, that will mark the appearance of a new species
       | (superior to us if anything). Also, luckily, they have no
       | physical bodies and cannot replicate, so far...
        
         | canadianfella wrote:
         | [dead]
        
       | qwertox wrote:
       | If I really want to do some playing around in this area, would it
       | be good to get a RTX 4000 SFF which has 20 GB of VRAM but is a
       | low-power card, which I want as it would be running 24/7 and
       | energy prices are pretty bad in Germany, or would it make more
       | sense to buy an Apple product with some M2 chip which apparently
       | is good for these tasks as it shares CPU and GPU memory?
        
       | holoduke wrote:
       | Why does AMD or Intel not release a medium performant GPU with
       | minimum 128gb of memory for a good consumer price. These models
       | require lots of memory to 'single' pass an operation. Throughput
       | could be bit slower. A 1080 Nvidia with 256gb of memory would run
       | all these models fast right? Or am I forgetting something here.
        
         | hackernudes wrote:
         | I don't think there was a market for it before LLMs. Still
         | might not be (especially if they don't want to cannibalize data
         | center products). Also, they might have hardware constraints. I
         | wouldn't be that surprised if we see some high ram consumer
         | GPUs in the future, though.
         | 
         | It won't work out unless it becomes common to run LLMs locally.
         | Kind of a chicken-and-egg problem so I hope they try it!
        
           | the8472 wrote:
           | > I don't think there was a market for it before LLMs.
           | 
           | At $work CGI assets sometimes grow pretty big and throwing
           | more VRAM at the problem would be easier than optimizing the
           | scenes in the middle of the workflow. They _can_ be
           | optimized, but that often makes it less ergonomic to work
           | with them.
           | 
           | Perhaps asset-streaming (nanite&co) will make this less of an
           | issue, but that's also fairly new.
           | 
           | Do LLM implementations already stream the weights layer by
           | layer or in whichever order they're doing the evaluation or
           | is PCIe bandwidth too limited for that?
        
             | tpetry wrote:
             | But you are not the home user target audience. They want to
             | sell you the more expensive workstation or server models.
        
               | the8472 wrote:
               | Even an A6000 tops out at 48GB while you can attach
               | terabytes of RAM to server-class CPUs.
        
             | elabajaba wrote:
             | AMD had the Radeon pro SSG that let you attach 1TB of pcie3
             | nvme SSDs directly to the GPU, but no one bought them and
             | afaik they were basically unobtainable on the consumer
             | market.
             | 
             | Also asset streaming has been a thing for like 20 years now
             | in gaming, it's not really a new thing. Nanite's big thing
             | is that it gets you perfect LODs without having to pre-
             | create them and manually tweak them (eg. how far away does
             | the LOD transition happen, what's the lowest LOD before it
             | disappears, etc)
        
               | the8472 wrote:
               | Loading assets JIT for the next frame from NVMe hasn't
               | been a thing for 20 years though. Different kinds of
               | latency floors.
               | 
               | What I was asking is whether LLM inference can be
               | structured in such a way that only a fraction of the
               | weight is needed at a time and then the next ones can be
               | loaded JIT as the processing pipeline advances.
        
           | [deleted]
        
         | layer8 wrote:
         | Releasing a new model takes time, and it's unclear how large
         | the consumer market would actually be. Maybe they're working on
         | it right now.
        
         | Kye wrote:
         | GDDR probably hasn't seen the same cost reduction benefits of
         | volume DDR has.
        
           | TaylorAlexander wrote:
           | One question I have is: can they use cheaper kinds of RAM and
           | still be perfectly usable for large ML models? They could put
           | 4GB of GDDR and 128GB of cheap RAM maybe? I do realize as
           | others are saying, this would be a new kind of card so they
           | will need time to develop it. But would this work?
        
             | andromeduck wrote:
             | Not without a redesigned memory controller or one off chip.
             | You'd probably just want the host's memory to be directly
             | accessible over PCIE or something faster like NVLINK. Such
             | solutions already exist just not in the consumer space.
        
         | duxup wrote:
         | >for a good consumer price
         | 
         | Was there a consumer market for them until recently?
        
         | 0xcde4c3db wrote:
         | Probably because if they take that exact same GPU+VRAM
         | configuration and slap it on a rackmount-optimized board, they
         | can charge AI companies 5-10x the price for it.
        
           | jsheard wrote:
           | They don't even offer that much VRAM on cards aimed at those
           | price-insensitive customers, Nvidias current lineup maxes out
           | at 48GB for GDDR-based models or 80GB for HBM-based models.
           | Even if money is no object there's still practical
           | engineering limits on how much memory they can put on a card
           | without sacrificing bandwidth.
        
           | vegabook wrote:
           | this is where the new third player, Intel, can (if it can
           | tear itself away from identical behaviour in the
           | consumer/server CPU market) hopefully break the duopoly. Love
           | to see a 32 or 64GB card from Intel. Their software stack on
           | Linux is competent enough (unlike the dumpster fire that is
           | AMD's ROCm).
        
         | andromeduck wrote:
         | Because then memory would be 90% of the BOM.
        
         | dragonwriter wrote:
         | > Why does AMD or Intel not release a medium performant GPU
         | with minimum 128gb of memory for a good consumer price.
         | 
         | They do. Well, not "medium performant", but for VRAM-bound
         | tasks they'd still be an improvement over CPUs if you could use
         | them -- iGPUs use main memory.
         | 
         | What they don't have is support for them for popular GPGPU
         | frameworks (though there was a third party CUDA-for-Intel-iGPUs
         | a while ago.)
        
         | elabajaba wrote:
         | Because they can't do that for a "good consumer price".
         | 
         | If you want more than ~48GB, you're looking at HBM which is
         | extremely expensive (HBM chips are very expensive,
         | packaging+interposer is extremely expensive, designing and
         | producing a new GPU is expensive).
         | 
         | Normal GPUs are limited by both their bus width (wider bus =
         | more pins = harder to design, more expensive to produce, and
         | increases power consumption), and GDDR6(x) (which maxes out at
         | 2GB/chip currently), so on a 384bit bus (4090/7900xtx, don't
         | expect anyone to make a 512bit busses anymore) you need 12x2GB
         | (GDDR6 uses 32 pins per package) which gives you 24GB. You can
         | double the memory capacity to 48GB, but that requires putting
         | the chips on the back of the GPU which leads to a bunch of
         | cooling issues (and GDDR6 is expensive).
         | 
         | Of course, even if they did all that they're selling expensive
         | GPUs to a small niche market and cannibalizing sales of their
         | own high end products (and even if AMD somehow managed to magic
         | up a 128GB gpu for $700 people still wouldn't buy it because so
         | much of the ML software is CUDA only).
        
           | eurekin wrote:
           | 3090 has a lot of vram chips on the back though
        
             | elabajaba wrote:
             | And because of it there were issues with the vram
             | overheating in memory intensive workloads, and on some GPUs
             | the vram even separated off the board.
             | 
             | https://www.igorslab.de/en/looming-pads-and-too-hot-
             | gddrx6-m...
        
         | pbhjpbhj wrote:
         | There's a type of DMA for GPUs to access NVMe on the
         | motherboard, IIRC. Perhaps that is a better solution here?
         | 
         | https://developer.nvidia.com/blog/gpudirect-storage/
        
           | boppo1 wrote:
           | Isn't pci-e latency dramatically higher than onboard vram?
        
         | fooker wrote:
         | That's exactly what the next generation of 'accelerators' will
         | be like.
         | 
         | Whether it will be co-located with a GPU for consumer hardware
         | remains to be seen.
         | 
         | The thing to determine is how essential running LLMs locally is
         | for consumers.
         | 
         | BigTech is pushing hard to make their clouds the only place to
         | run LLMs unfortunately, so unless there is a killer app that is
         | just better locally (like games were for GPUs), this might not
         | change.
        
           | boppo1 wrote:
           | > unless there is a killer app that is just better locally
           | 
           | Therapy & relationship bots, like the movie 'Her'. It's ugly,
           | but it's coming.
        
             | fooker wrote:
             | There's no technical reason it has to be run locally.
             | 
             | Massive privacy implications for sure, but people do
             | consume all sorts of adult material online.
             | 
             | Games though, no one has been able to make it work as well
             | as local so far.
        
         | kevingadd wrote:
         | The margins on VRAM are pretty bad for them since they don't
         | manufacture it themselves. And every memory module they add
         | needs additional power delivery and memory controller muscle to
         | drive, so adding that memory is going to raise the cost of the
         | card significantly. Most games and consumer workloads won't use
         | all that extra memory.
         | 
         | Keep in mind video cards don't use the same kind of RAM as
         | consumer CPUs do, they typically use GDDR or HBM.
        
         | Tuna-Fish wrote:
         | It would not be trivial to do.
         | 
         | GDDR achieves higher speeds than normal DDR mainly by
         | specifying much tighter tolerances on the electrical interface,
         | and using wider interface to the memory chips. This means that
         | using commodity GDDR (which is the only fast DRAM that will be
         | reasonably cheap), you have fairly strict limitations on the
         | maximum amount of RAM your can use with the same GPUs that are
         | manufactured for consumer use. (Typically, at most 4x
         | difference between the lowest-end reasonable configuration and
         | the highest-end one, 2x from higher density modules and 2x from
         | using clamshell memory configuration, although often you only
         | have one type of module for a new memory interface generation.)
         | 
         | If the product requires either a new memory or GPU die
         | configuration, it's cost will be very high.
         | 
         | The only type of memory that can support very different VRAM
         | sizes for an efficiently utilized bus of the same size is HBM,
         | and so far that is limited to the very high end.
        
           | magicalhippo wrote:
           | Anandtech has an article on the GDDR6X variant[1] that NVIDIA
           | has in their 3000-cards, where they use a more complex
           | encoding to transmit two bits per clock edge.
           | 
           | I hadn't realized just how insane the bandwidth on the
           | higher-ends cards are, the 3090 being just shy of 1 TB/s,
           | yes, one terrabyte per second...
           | 
           | For comparison a couple of DDR5 sticks[2] will just get you
           | north of 70GB/s...
           | 
           | [1]: https://www.anandtech.com/show/15978/micron-spills-on-
           | gddr6x...
           | 
           | [2]: https://www.anandtech.com/show/17269/ddr5-demystified-
           | feat-s...
        
           | q7xvh97o2pDhNrh wrote:
           | Do you happen to know where Apple's integrated approach falls
           | on this spectrum?
           | 
           | I was actually wondering about this the other day. A fully
           | maxed out Mac Studio is about $6K, and it comes with a
           | "64-core GPU" and "128GB integrated memory" (whatever any of
           | that means). Would that be enough to run a decent Llama?
        
             | cudder wrote:
             | The Mac's "integrated memory" means it's shared between the
             | CPU and GPU. So the GPU can address all of that and you can
             | load giant (by current consumer GPU standards) models. I
             | have no idea how it actually performs though.
        
       | dclowd9901 wrote:
       | Has anyone tried running encryption algorithms through these
       | models? I wonder if it could be trained to decrypt.
        
         | Hendrikto wrote:
         | That would be very surprising, given that any widely used
         | cryptographic encryption algorithm has been EXTENSIVELY
         | cryptanalyzed.
         | 
         | ML models are essentially trained to recognize patterns.
         | Encryption algorithms are explicitly designed to resist that
         | kind of analysis. LLMs are not magic.
        
           | dclowd9901 wrote:
           | All of what you said is true, for us. I know LLMs aren't
           | magic (lord knows I actually kind of understand the
           | principles of how they operate), but they have a much greater
           | computational and relational bandwidth than we've ever had
           | access to before. So I'm curious if that can break down what
           | otherwise appears to be complete obfuscation. Otherwise,
           | we're saying that encryption is somehow magic in a way that
           | LLMs cannot possibly be.
        
             | NegativeK wrote:
             | > Otherwise, we're saying that encryption is somehow magic
             | in a way that LLMs cannot possibly be.
             | 
             | I don't see why that's an unreasonable claim. I mean,
             | encryption isn't magic, but it is a drastically different
             | process.
        
       | dinobones wrote:
       | What is HN's fascination with these toy models that produce low
       | quality, completely unusable output?
       | 
       | Is there a use case for them I'm missing?
       | 
       | Additionally, don't they all have fairly restrictive licenses?
        
         | az226 wrote:
         | [flagged]
        
         | Zetobal wrote:
         | Maybe you forgot for what the H stands in HN... playful
         | curiousity.
        
         | tbalsam wrote:
         | I never thought I'd see the day when a 13B model was casually
         | referred to in a comments section as a "toy model".
        
           | andrewmcwatters wrote:
           | Start using it for tasks and you'll find limitations very
           | quickly. Even ChatGPT excels at some tasks and fails
           | miserably at others.
        
             | tbalsam wrote:
             | Oh, I've been using language models before a lot (or at
             | least some significant chunk) of HN knew the word LLM, I
             | think.
             | 
             | I remember when going from 6B to 13B was crazy good. We've
             | just normalized our standards to the latest models in the
             | era.
             | 
             | They do have their shortcomings but can be quite useful as
             | well, especially the LLama class ones. They're definitely
             | not GPT-4 or Claude+, for sure, for sure.
        
           | az226 wrote:
           | Compared to GPT2 it's on par. Compared to GPT3, 3.5, or 4,
           | it's a toy. GPT2 is 4 years old, and in terms of LLMs, that's
           | several life times ago. In 5-10 years, GPT3 will be viewed as
           | a toy. Note, "progress" will unlikely be as fast as it has
           | been going forward.
        
             | tbalsam wrote:
             | GPT-2's largest model was 1.5B params, LLama-65B was
             | similar to the largest GPT3 in benchmark performance but
             | that model was expensive in the API, a number of the people
             | would use the cheaper one(s) instead IIRC.
             | 
             | So this is similar to a mid tier GPT3 class model.
             | 
             | Basically, there's not much reason to Pooh-Pooh it. It may
             | not perform quite as well, but I find it to be useful for
             | the things it's useful for.
        
       | mozillas wrote:
       | I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air
       | (8GB RAM) with llama.cpp.
       | 
       | Worked OK for me with the default context size. 2048, like you
       | see in most examples was too slow for my taste.
        
         | koheripbal wrote:
         | Given the current price (mostly free) off public llms I'm not
         | sure what the use case of running out at home are yet.
         | 
         | OpenAIs paid GPT4 has few restrictions and is still cheap.
         | 
         | ... Not to mention GPT4 with browsing feature is vastly
         | superior to any home of the models you can run at home.
        
           | toxik wrote:
           | The point for me personally is the same as why I find it so
           | powerful to self host SMTP, IMAP, HTTP. It's in my hands, I
           | know where it all begins and ends. I answer to no one.
           | 
           | For LLMs this means I am allowed their full potential. I can
           | generate smut, filth, illegal content of any kind for any
           | reason. It's for me to decide. It's empowering, it's the
           | hacker mindset.
        
           | sagarm wrote:
           | I think it's mostly useful if you want to do your own fine
           | tuning, or the data you are working with can't be sent to a
           | third party for contractual, legal, or paranoid reasons.
        
             | sroussey wrote:
             | I'm working on an app to index your life, and having it
             | local is a huge plus for the people I have using it.
        
           | 2devnull wrote:
           | Many would be users can't send their data data to openAI.
           | Think HIPPA and other laws restricting data sharing.
           | Federation or distribution of the models for local training
           | is the other solution to that problem.
        
       ___________________________________________________________________
       (page generated 2023-05-14 23:00 UTC)