[HN Gopher] Llama.cpp 30B runs with only 6GB of RAM now
       ___________________________________________________________________
        
       Llama.cpp 30B runs with only 6GB of RAM now
        
       Author : msoad
       Score  : 329 points
       Date   : 2023-03-31 20:37 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | TaylorAlexander wrote:
       | Great to see this advancing! I'm curious if anyone knows what the
       | best repo is for running this stuff on an Nvidia GPU with 16GB
       | vram. I ran the official repo with the leaked weights and the
       | best I could run was the 7B parameter model. I'm curious if
       | people have found ways to fit the larger models on such a system.
        
         | terafo wrote:
         | I'd _assume_ that 33B model should fit with this(only repo that
         | I know of that implements SparseGPT and GPTQ for LLaMa), I,
         | personally, haven 't tried though. But you can try your luck
         | https://github.com/lachlansneff/sparsellama
        
         | enlyth wrote:
         | https://github.com/oobabooga/text-generation-webui
        
       | w1nk wrote:
       | Does anyone know how/why this change decreases memory consumption
       | (and isn't a bug in the inference code)?
       | 
       | From my understanding of the issue, mmap'ing the file is showing
       | that inference is only accessing a fraction of the weight data.
       | 
       | Doesn't the forward pass necessitate accessing all the weights
       | and not a fraction of them?
        
         | matsemann wrote:
         | Maybe lots of the data is embedding values or tokenizer stuff,
         | where a single prompt uses a fraction of those values. And then
         | the rest of the model is quite small.
        
           | w1nk wrote:
           | That shouldn't be the case. 30B is a number that directly
           | represents the size of the model, not the size of the other
           | components.
        
       | detrites wrote:
       | The pace of collaborative OSS development on these projects is
       | amazing, but the _rate_ of optimisations being achieved is almost
       | unbelievable. What has everyone been doing wrong all these years
       | _cough_ sorry, I mean to say weeks?
       | 
       | Ok I answered my own question.
        
         | datadeft wrote:
         | I have predicted that LLaMA will be available on mobile phones
         | before the end of this year. We are very close.
        
           | terafo wrote:
           | You mean in contained app? It can already run on a phone. GPU
           | acceleration would be nice at this point, though.
        
           | rickrollin wrote:
           | People have actually ran it on phones.
        
         | politician wrote:
         | Roughly: OpenAIs don't employ enough jarts.
         | 
         | In other words, the groups of folks working on training models
         | don't necessarily have access to the sort of optimization
         | engineers that are working in other areas.
         | 
         | When all of this leaked into the open, it caused a lot of
         | people knowledgeable in different areas to put their own
         | expertise to the task. Some of those efforts (mmap) pay off
         | spectacularly. Expect industry to copy the best of these
         | improvements.
        
           | bee_rider wrote:
           | The professional optimizes well enough to get management off
           | their back, the hobbyist can be irrationally good.
        
           | hedgehog wrote:
           | They have very good people but those people have other
           | priorities.
        
         | kmeisthax wrote:
         | >What has everyone been doing wrong all these years
         | 
         | So it's important to note that all of these improvements are
         | the kinds of things that are cheap to run on a pretrained
         | model. And all of the developments involving large language
         | models recently have been the product of hundreds of thousands
         | of dollars in rented compute time. Once you start putting six
         | digits on a pile of model weights, that becomes a capital cost
         | that the business either needs to recuperate or turn into a
         | competitive advantage. So everyone who scales up to this point
         | doesn't release model weights.
         | 
         | The model in question - LLaMA - isn't even a public model. It
         | leaked and people copied[0] it. But because such a large model
         | leaked, now people can actually work on iterative improvements
         | again.
         | 
         | Unfortunately we don't really have a way for the FOSS community
         | to pool together that much money to buy compute from cloud
         | providers. Contributions-in-kind through distributed computing
         | (e.g. a "GPT@home" project) would require significant changes
         | to training methodology[1]. Further compounding this, the
         | state-of-the-art is actually kind of a trade secret now. Exact
         | training code isn't always available, and OpenAI has even gone
         | so far as to refuse to say anything about GPT-4's architecture
         | or training set to prevent open replication.
         | 
         | [0] I'm avoiding the use of the verb "stole" here, not just
         | because I support filesharing, but because copyright law likely
         | does not protect AI model weights alone.
         | 
         | [1] AI training has very high minimum requirements to get in
         | the door. If your GPU has 12GB of VRAM and your model and
         | gradients require 13GB, you can't train the model. CPUs don't
         | have this limitation but they are ridiculously inefficient for
         | any training task. There are techniques like ZeRO to give
         | pagefile-like state partitioning to GPU training, but that
         | requires additional engineering.
        
           | seydor wrote:
           | > we don't really have a way for the FOSS community to pool
           | together that much money
           | 
           | There must be open source projects with enough money to pool
           | into such a project. I wonder whether wikimedia or apache are
           | considering anything.
        
           | terafo wrote:
           | _AI training has very high minimum requirements to get in the
           | door. If your GPU has 12GB of VRAM and your model and
           | gradients require 13GB, you can 't train the model. CPUs
           | don't have this limitation but they are ridiculously
           | inefficient for any training task. There are techniques like
           | ZeRO to give pagefile-like state partitioning to GPU
           | training, but that requires additional engineering._
           | 
           | You can't if you have one 12gb gpu. You can if you have
           | couple of dozens. And then petals-style training is possible.
           | It is all very very new and there are many unsolved hurdles,
           | but I think it can be done.
        
             | webnrrd2k wrote:
             | Maybe a good candidate for the SETI@home treatment?
        
               | terafo wrote:
               | It is a good candidate. Tech is good 6-18 months away,
               | though.
        
             | dplavery92 wrote:
             | Sure, but when one 12gb GPU costs ~$800 new (e.g. for the
             | 3080 LHR), "a couple of dozens" of them is a big barrier to
             | entry to the hobbyist, student, or freelancer. And cloud
             | computing offers an alternative route, but, as stated,
             | distribution introduces a new engineering task, and the
             | month-to-month bills for the compute nodes you are using
             | can still add up surprisingly quickly.
        
               | terafo wrote:
               | We are talking groups, not individuals. I think it is
               | quite possible for couple of hundreds of people to
               | cooperate and train something at least as big as LLaMa 7B
               | in a week or two.
        
         | xienze wrote:
         | > but the rate of optimisations being achieved is almost
         | unbelievable. What has everyone been doing wrong all these
         | years cough sorry, I mean to say weeks?
         | 
         | It's several things:
         | 
         | * Cutting-edge code, not overly concerned with optimization
         | 
         | * Code written by scientists, who aren't known for being the
         | world's greatest programmers
         | 
         | * The obsession the research world has with using Python
         | 
         | Not surprising that there's a lot of low-hanging fruit that can
         | be optimized.
        
           | Miraste wrote:
           | Why does Python get so much flak for inefficiencies? It's
           | really not that slow, and in ML the speed-sensitive parts are
           | libraries in lower level languages anyway. Half of the
           | optimization from this very post is in Python.
        
       | wkat4242 wrote:
       | Wow I continue being amazed by the progress being made on
       | language models in the scope of weeks. I didn't expect
       | optimisations to move this quickly. Only a few weeks ago we were
       | amazed with ChatGPT knowing it would never be something to run at
       | home, requiring $100.000 in hardware (8xA100 card).
        
       | kossTKR wrote:
       | Does this mean that we can also run the 60B model on a 16GB ram
       | computer now?
       | 
       | I have the M2 air and can't wait until further optimisation with
       | the Neural Engine / multicore gpu + shared ram etc.
       | 
       | I find it absolutely mind boggling that GPT-3.5(4?) level quality
       | may be within reach locally on my $1500 laptop / $800 m2 mini.
        
         | thomastjeffery wrote:
         | I doubt it: text size and text _pattern_ size don 't scale
         | linearly.
        
           | kossTKR wrote:
           | Interesting, i wonder what the scaling function is.
        
       | abujazar wrote:
       | I love how LLMs have got the attention of proper programmers such
       | that the Python mess is getting cleaned up.
        
       | jart wrote:
       | Author here. For additional context, please read
       | https://github.com/ggerganov/llama.cpp/discussions/638#discu...
       | The loading time performance has been a huge win for usability,
       | and folks have been having the most wonderful reactions after
       | using this change. But we don't have a compelling enough theory
       | yet to explain the RAM usage miracle. So please don't get too
       | excited just yet! Yes things are getting more awesome, but like
       | all things in science a small amount of healthy skepticism is
       | warranted.
        
         | conradev wrote:
         | > But we don't have a compelling enough theory yet to explain
         | the RAM usage miracle.
         | 
         | My guess would be that the model is faulted into memory lazily
         | page by page (4K or 16K chunks) as the model is used, so only
         | the actual parts that are needed are loaded.
         | 
         | The kernel also removes old pages from the page cache to make
         | room for new ones, and especially so if the computer is using a
         | lot of its RAM. As with all performance things, this approach
         | trades off inference speed for memory usage, but likely faster
         | overall because you don't have to read the entire thing from
         | disk at the start. Each input will take a different path
         | through the model, and will require loading more of it.
         | 
         | The cool part is that this memory architecture should work just
         | fine with hardware acceleration, too, as long as the computer
         | has unified memory (anything with an integrated GPU). This
         | approach likely won't be possible with dedicated GPUs/VRAM.
         | 
         | This approach _does_ still work to run a dense model with
         | limited memory, but the time/memory savings would just be less.
         | The GPU doesn't multiply every matrix in the file literally
         | simultaneously, so the page cache doesn't need to contain the
         | entire model at once.
        
           | jart wrote:
           | I don't think it's actually trading away inference speed. You
           | can pass an --mlock flag, which calls mlock() on the entire
           | 20GB model (you need root to do it), then htop still reports
           | only like 4GB of RAM is in use. My change helps inference go
           | faster. For instance, I've been getting inference speeds of
           | 30ms per token after my recent change on the 7B model, and I
           | normally get 200ms per eval on the 30B model.
        
             | conradev wrote:
             | Very cool! Are you testing after a reboot / with an empty
             | page cache?
        
               | jart wrote:
               | Pretty much. I do my work on a headless workstation that
               | I SSH into, so it's not like competing with Chrome tabs
               | or anything like that. But I do it mostly because that's
               | what I've always done. The point of my change is you
               | won't have to be like me anymore. Many of the devs who
               | contacted after using my change have been saying stuff
               | like, "yes! I can actually run LLaMA without having to
               | close all my apps!" and they're so happy.
        
             | Miraste wrote:
             | This is incredible, great work. Have you tried it with the
             | 65B model? Previously I didn't have a machine that could
             | run it. I'd love to know the numbers on that one.
        
           | liuliu wrote:
           | Metal only recent versions (macOS 13 / iOS 16) supports mmap
           | and use that in GPU directly. CUDA does have unified memory
           | mode even it is dedicated GPU, would be interesting to try
           | that out. Probably going to slow down quite a bit, but still
           | interesting to have that possibility.
        
         | zone411 wrote:
         | It really shouldn't act as a sparse model. I would bet on
         | something being off.
        
         | world2vec wrote:
         | >I'm glad you're happy with the fact that LLaMA 30B (a 20gb
         | file) can be evaluated with only 4gb of memory usage!
         | 
         | Isn't LLaMA 30B a set of 4 files (60,59Gb)?
         | 
         | -edit- nvm, It's quantized. My bad
        
         | smaddox wrote:
         | Based on that discussion, it definitely sounds like some sort
         | of bug is hiding. Perhaps run some evaluations to compare
         | perplexity to the standard implementation?
        
         | nynx wrote:
         | Why is it behaving sparsely? There are only dense operations,
         | right?
        
           | w1nk wrote:
           | I also have this question, yes it should be. The forward pass
           | should require accessing all the weights AFAIK.
        
             | [deleted]
        
         | thomastjeffery wrote:
         | How diverse is the training corpus?
        
           | dchest wrote:
           | https://arxiv.org/abs/2302.13971
        
         | eternalban wrote:
         | Great work. Is the new file format described anywhere? Skimming
         | the issue comments I have a vague sense that r/o matter was
         | colocated somewhere for zero copy mmap or is there more to it?
        
         | sillysaurusx wrote:
         | Hey, I saw your thoughtful comment before you deleted it. I
         | just wanted to apologize -- I had no idea this was a de facto
         | Show HN, and certainly didn't mean to make it about something
         | other than this project.
         | 
         | The only reason I posted it is because Facebook had been
         | DMCAing a few repos, and I wanted to reassure everyone that
         | they can hack freely without worry. That's all.
         | 
         | I'm really sorry if I overshadowed your moment on HN, and I
         | feel terrible about that. I'll try to read the room a little
         | better before posting from now on.
         | 
         | Please have a wonderful weekend, and thanks so much for your
         | hard work on LLaMA!
         | 
         | EDIT: The mods have mercifully downweighted my comment, which
         | is a relief. Thank you for speaking up about that, and sorry
         | again.
         | 
         | If you'd like to discuss any of the topics you originally
         | posted about, you had some great points.
        
         | d3nj4l wrote:
         | Maybe off topic, but I just wanted to say that you're an
         | inspiration!
        
         | htrp wrote:
         | Just shows how inefficient some of the ML research code can be
        
           | robrenaud wrote:
           | Training tends to require a lot more precision and hence
           | memory than inference. I bet many of the tricks here won't
           | work well for training.
        
         | sr-latch wrote:
         | Have you tried running it against a quantized model on
         | HuggingFace with identical inputs and deterministic sampling to
         | check if the outputs you're getting are identical? I think that
         | should confirm/eliminate any concern of the model being
         | evaluated incorrectly.
        
         | intelVISA wrote:
         | Didn't expect to see two titans today: ggerganov AND jart. Can
         | ya'll slow down you make us mortals look bad :')
         | 
         | Seeing such clever use of mmap makes me dread to imagine how
         | much Python spaghetti probably tanks OpenAI's and other "big
         | ML" shops' infra when they should've trusted in zero copy
         | solutions.
         | 
         | Perhaps SWE is dead after all, but LLMs didn't kill it...
        
       | brucethemoose2 wrote:
       | Does that also mean 6GB VRAM?
       | 
       | And does that include Alpaca models like this?
       | https://huggingface.co/elinas/alpaca-30b-lora-int4
        
         | terafo wrote:
         | No(llama.cpp is cpu-only) and no(you need to requantize the
         | model).
        
         | sp332 wrote:
         | According to
         | https://mobile.twitter.com/JustineTunney/status/164190201019...
         | you can probably use the conversion tools from the repo on
         | Alpaca and get the same result.
         | 
         | If you want to run larger Alpaca models on a low VRAM GPU, try
         | FlexGen. I think https://github.com/oobabooga/text-generation-
         | webui/ is one of the easier ways to get that going.
        
           | brucethemoose2 wrote:
           | Yeah, or deepspeed presumably. Maybe torch.compile too.
           | 
           | I dunno why I thought llama. _cpp_ would support gpus.
           | _shrug_
        
       | lukev wrote:
       | Has anyone done any comprehensive analysis on exactly how much
       | quantization affects the quality of model output? I haven't seen
       | any more than people running it and being impressed (or not) by a
       | few sample outputs.
       | 
       | I would be very curious about some contrastive benchmarks between
       | a quantized and non-quantized version of the same model.
        
         | corvec wrote:
         | Define "comprehensive?"
         | 
         | There are some benchmarks here:
         | https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_cu...
         | and here: https://nolanoorg.substack.com/p/int-4-llama-is-not-
         | enough-i...
         | 
         | Check out the original paper on quantization, which has some
         | benchmarks: https://arxiv.org/pdf/2210.17323.pdf and this
         | paper, which also has benchmarks and explains how they
         | determined that 4-bit quantization is optimal compared to
         | 3-bit: https://arxiv.org/pdf/2212.09720.pdf
         | 
         | I also think the discussion of that second paper here is
         | interesting, though it doesn't have its own benchmarks:
         | https://github.com/oobabooga/text-generation-webui/issues/17...
        
         | mlgoatherder wrote:
         | I've done some experiments here with Llama 13B, in my
         | subjective experience the original fp16 model is significantly
         | better (particularly on coding tasks). There are a bunch of
         | synthetic benchmarks such a wikitext2 PPL and all the whiz bang
         | quantization schemes seem to score well but subjectively
         | something is missing.
         | 
         | I've been able to compare 4 bit GPTQ, naive int8, LLM.int8,
         | fp16, and fp32. LLM.int8 does impressively well but inference
         | is 4-5x slower than native fp16.
         | 
         | Oddly I recently ran a fork of the model on the ONNX runtime,
         | I'm convinced that the model performed better than
         | pytorch/transformers, perhaps subtle differences in floating
         | point behavior etc between kernels on different hardware
         | significantly influence performance.
         | 
         | The most promising next step in the quantization space IMO has
         | to be fp8, there's a lot of hardware vendors adding support,
         | and there's a lot of reasons to believe fp8 will outperform
         | most current quantization schemes [1][2]. Particularly when
         | combined with quantization aware training / fine tuning (I
         | think OpenAI did something similar for GPT3.5 "turbo").
         | 
         | If anybody is interested I'm currently working on an open
         | source fp8 emulation library for pytorch, hoping to build
         | something equivalent to bitsandbytes. If you are interested in
         | collaborating my email is in my profile.
         | 
         | 1. https://arxiv.org/abs/2208.09225 2.
         | https://arxiv.org/abs/2209.05433
        
         | bakkoting wrote:
         | Some results here:
         | https://github.com/ggerganov/llama.cpp/discussions/406
         | 
         | tl;dr quantizing the 13B model gives up about 30% of the
         | improvement you get from moving from 7B to 13B - so quantized
         | 13B is still much better than unquantized 7B. Similar results
         | for the larger models.
        
           | terafo wrote:
           | I wonder where such difference between llama.cpp and [1] repo
           | comes from. F16 difference in perplexity is .3 on 7B model,
           | which is not insignificant. ggml quirks are definitely need
           | to be fixed.
           | 
           | [1] https://github.com/qwopqwop200/GPTQ-for-LLaMa
        
             | bakkoting wrote:
             | I'd guess the GPTQ-for-LLaMa repo is using a larger context
             | size. Poking around it looks like GPTQ-for-llama is
             | specifying 2048 [1] vs the default 512 for llama.cpp [2].
             | You can just specify a longer size on the CLI for llama.cpp
             | if you are OK with the extra memory.
             | 
             | [1] https://github.com/qwopqwop200/GPTQ-for-
             | LLaMa/blob/934034c8e...
             | 
             | [2] https://github.com/ggerganov/llama.cpp/tree/3525899277d
             | 2e2bd...
        
             | gliptic wrote:
             | GPTQ-for-LLaMa recently implemented some quantization
             | tricks suggested by the GPTQ authors that improved 7B
             | especially. Maybe llama.cpp hasn't been evaluated with
             | those in place?
        
         | terafo wrote:
         | For this specific implementation here's info from llama.cpp
         | repo:
         | 
         |  _Perplexity - model options
         | 
         | 5.5985 - 13B, q4_0
         | 
         | 5.9565 - 7B, f16
         | 
         | 6.3001 - 7B, q4_1
         | 
         | 6.5949 - 7B, q4_0
         | 
         | 6.5995 - 7B, q4_0, --memory_f16_
         | 
         | According to this repo[1] difference is about 3% in their
         | implementation with right group size. If you'd like to know
         | more, I think you should read GPTQ paper[2].
         | 
         | [1] https://github.com/qwopqwop200/GPTQ-for-LLaMa
         | 
         | [2] https://arxiv.org/abs/2210.17323
        
       | bsaul wrote:
       | how is llama performance relative to chatgpt ? is it as good as
       | chatgpt3 or even 4 ?
        
         | terafo wrote:
         | It is as good as GPT-3 at most sizes. Instruct layer needs to
         | be put on top in order for it to compete with GPT 3.5(which
         | powers ChatGPT). It can be done with comparatively little
         | amount of compute(couple hundred bucks worth of compute for
         | small models, I'd assume low thousands for 65B).
        
       | arka2147483647 wrote:
       | What is lama? What can it do?
        
         | terafo wrote:
         | Read readme in repo.
        
       | UncleOxidant wrote:
       | What's the difference between llama.cpp and alpaca.cpp?
        
         | cubefox wrote:
         | I assume the former is just the foundation model (which only
         | predicts text) while the latter is instruction tuned.
        
       | [deleted]
        
       | [deleted]
        
       | danShumway wrote:
       | I messed around with 7B and 13B and they gave interesting
       | results, although not quite consistent enough results for me to
       | figure out what to do with them. I'm curious to try out the 30B
       | model.
       | 
       | Start time was also a huge issue with building anything usable,
       | so I'm glad to see that being worked on. There's potential here,
       | but I'm still waiting on more direct API/calling access. Context
       | size is also a little bit of a problem. I think categorization is
       | a potentially great use, but without additional alignment
       | training and with the context size fairly low, I had trouble
       | figuring out where I could make use of tagging/summarizing.
       | 
       | So in general, as it stands I had a lot of trouble figuring out
       | what I could personally build with this that would be genuinely
       | useful to run locally and where it wouldn't be preferable to
       | build a separate tool that didn't use AI at all. But I'm very
       | excited to see it continue to get optimized; I think locally
       | running models are very important right now.
        
       | cubefox wrote:
       | I don't understand. I thought each parameter was 16 bit (two
       | bytes) which would predict minimally 60GB of RAM for a 30 billion
       | parameter model. Not 6GB.
        
         | gamegoblin wrote:
         | Parameters have been quantized down to 4 bits per parameter,
         | and not all parameters are needed at the same time.
        
         | heap_perms wrote:
         | I was thinking something similar. Turns out that you don't need
         | all the weights for any given prompt.
         | 
         | > LLaMA 30B appears to be a sparse model. While there's 20GB of
         | weights, depending on your prompt I suppose only a small
         | portion of that needs to be used at evaluation time [...]
         | 
         | Found the answer from the author of this amazing pull request:
         | https://github.com/ggerganov/llama.cpp/discussions/638#discu...
        
       | qwertox wrote:
       | Is the 30B model clearly better than the 7B?
       | 
       | I played with Pi3141/alpaca-lora-7B-ggml two days ago and it was
       | super disappointing. In percentage between 0% = alpaca-
       | lora-7B-ggml and 100% GPT-3.5, where would LLaMA 30B be
       | positioned?
        
         | Rzor wrote:
         | I haven't been able to run it myself yet, but according to what
         | I read so far from people who did, the 30B model is where the
         | "magic" starts to happen.
        
       | singularity2001 wrote:
       | Does that only happen with the quantized model or also with the
       | float16 / float32 model? Is there any reason to use float models
       | at all?
        
       | ducktective wrote:
       | I wonder if Georgi or jart use GPT in their programming and
       | design. I guess the training data was lacking for the sort of
       | stuff they do due to their field of work especially jart.
        
         | jart wrote:
         | Not yet. GPT-4 helped answer some questions I had about the
         | WIN32 API but that's the most use I've gotten out of it so far.
         | I'd love for it to be able to help me more, and GPT-4 is
         | absolutely 10x better than GPT 3.5. But it's just not strong
         | enough at the kinds of coding I do that it can give me
         | something that I won't want to change completely. They should
         | just train a ChatJustine on my code.
        
       | Dwedit wrote:
       | > 6GB of RAM
       | 
       | > Someone mentioning "32-bit systems"
       | 
       | Um no, you're not mapping 6GB on RAM on a 32-bit system. The
       | address space simply doesn't exist.
        
         | jiggawatts wrote:
         | Windows Server could use up to 64 GB for a 32-bit operating
         | system. Individual processes couldn't map more than 4 GB, but
         | the total could be larger:
         | https://en.wikipedia.org/wiki/Physical_Address_Extension
        
       | sillysaurusx wrote:
       | On the legal front, I've been working with counsel to draft a
       | counterclaim to Meta's DMCA against llama-dl. (GPT-4 is
       | surprisingly capable, but I'm talking to a few attorneys:
       | https://twitter.com/theshawwn/status/1641841064800600070?s=6...)
       | 
       | An anonymous HN user named L pledged $200k for llama-dl's legal
       | defense:
       | https://twitter.com/theshawwn/status/1641804013791215619?s=6...
       | 
       | This may not seem like much vs Meta, but it's enough to get the
       | issue into the court system where it can be settled. The tweet
       | chain has the details.
       | 
       | The takeaway for you is that you'll soon be able to use LLaMA
       | without worrying that Facebook will knock you offline for it. (I
       | wouldn't push your luck by trying to use it for commercial
       | purposes though.)
       | 
       | Past discussion: https://news.ycombinator.com/item?id=35288415
       | 
       | I'd also like to take this opportunity to thank all of the
       | researchers at MetaAI for their tremendous work. It's because of
       | them that we have access to such a wonderful model in the first
       | place. They have no say over the legal side of things. One day
       | we'll all come together again, and this will just be a small
       | speedbump in the rear view mirror.
       | 
       | EDIT: Please do me a favor and skip ahead to this comment:
       | https://news.ycombinator.com/item?id=35393615
       | 
       | It's from jart, the author of the PR the submission points to. I
       | really had no idea that this was a de facto Show HN, and it's
       | terribly rude to post my comment in that context. I only meant to
       | reassure everyone that they can freely hack on llama, not make a
       | huge splash and detract from their moment on HN. (I feel awful
       | about that; it's wonderful to be featured on HN, and no one
       | should have to share their spotlight when it's a Show HN.
       | Apologies.)
        
         | terafo wrote:
         | Wish you all luck in the world. We need much more clarity in
         | legal status of these models.
        
           | sillysaurusx wrote:
           | Thanks! HN is pretty magical. I think they saw
           | https://news.ycombinator.com/item?id=35288534 and decided to
           | fund it.
           | 
           | I'm grateful for the opportunity to help protect open source
           | projects such as this one. It will at least give Huggingface
           | a basis to resist DMCAs in the short term.
        
             | [deleted]
        
         | [deleted]
        
         | sheeshkebab wrote:
         | All models trained on public data need to be made public. As it
         | is their outputs are not copyrightable, it's not a stretch to
         | say models are public domain.
        
           | sillysaurusx wrote:
           | I'm honestly not sure. RLHF seems particularly tricky --- if
           | someone is shaping a model by hand, it seems reasonable to
           | extend copyright protection to them.
           | 
           | For the moment, I'm just happy to disarm corporations from
           | using DMCAs against open source projects. The long term
           | implications will be interesting.
        
           | xoa wrote:
           | You seem to be mixing a few different things together here.
           | There's a huge leap from something not being copyrightable to
           | saying there is grounds for it to be _made_ public. No
           | copyright would greatly limit the ability of model makers to
           | legally restrict distribution if they made it to the public,
           | but they 'd be fully within their rights to keep them as
           | trade secrets to the best of their ability. Trade secret law
           | and practice is its own thing separate from copyright, lots
           | of places have private data that isn't copyrightable (pure
           | facts) but that's not the same as it being made public.
           | Indeed part of the historic idea of certain areas of IP like
           | patents was to encourage more stuff to be made public vs kept
           | secret.
           | 
           | > _As it is their outputs are not copyrightable, it's not a
           | stretch to say models are public domain._
           | 
           | With all respect this is kind of nonsensical. "Public domain"
           | only applies to stuff that is copyrightable, if they simply
           | aren't then it just never enters into the picture. And it not
           | being patentable or copyrightable doesn't mean there is any
           | requirement to share it. If it does get out though then
           | that's mostly their own problem is all (though depending on
           | jurisdiction and contract whoever did the leaking might get
           | in trouble), and anyone else is free to figure it out on
           | their own and share that and they can't do anything.
        
             | sheeshkebab wrote:
             | Public domain applies to uncopyrightable works, among other
             | things (including previously copyrighted works). In this
             | case models are uncopyrightable, and I think FB (or any of
             | these newfangled ai cos) would have interesting time
             | proving otherwise, if they ever try.
             | 
             | https://en.m.wikipedia.org/wiki/Public_domain
        
         | electricmonk wrote:
         | _IANYL - This is not legal advice._
         | 
         | As you may be aware, a counter-notice that meets the statutory
         | requirements will result in reinstatement unless Meta sues over
         | it. So the question isn't so much whether your counter-notice
         | covers all the potential defenses as whether Meta is willing to
         | sue.
         | 
         | The primary hurdle you're going to face is your argument that
         | weights are not creative works, and not copyrightable. That
         | argument is unlikely to succeed for the the following reasons
         | (just off the top of my head): (i) The act of selecting
         | training data is more akin to an encyclopedia than the white
         | pages example you used on Twitter, and encyclopedias are
         | copyrightable as to the arrangement and specific descriptions
         | of facts, even though the underlying facts are not; and (ii)
         | LLaMA, GPT-N, Bard, etc, all have different weights, different
         | numbers of parameters, different amounts of training data, and
         | different tuning, which puts paid to the idea that there is
         | only one way to express the underlying ideas, or that all of it
         | is necessarily controlled by the specific math involved.
         | 
         | In addition, Meta has the financial wherewithal to crush you
         | even were you legally on sound footing.
         | 
         | The upshot of all of this is that you may win for now if Meta
         | doesn't want to file a rush lawsuit, but in the long run, you
         | likely lose.
        
         | sva_ wrote:
         | Thank you for putting your ass on the line and deciding to
         | challenge $megacorp on their claims of owning the copyright on
         | NN weights that have been trained on public (and probably, to
         | some degree, also copyrighted) data. This seems to very much be
         | uncharted territory in the legal space, so there are a lot of
         | unknowns.
         | 
         | I don't consider it ethical to compress the corpus of human
         | knowledge into some NN weights and then closing those weights
         | behind proprietary doors, and I hope that legislators will see
         | this similarly.
         | 
         | My only worry is that they'll get you on some technicality,
         | like that (some version of) your program used their servers
         | afaik.
        
         | cubefox wrote:
         | Even if using LLaMA turns out to be legal, I very much doubt it
         | is ethical. The model got leaked while it was only intended for
         | research purposes. Meta engineered and paid for the training of
         | this model. It's theirs.
        
           | Uupis wrote:
           | I feel like most-everything about these models gets really
           | ethically-grey -- at worst -- very quickly.
        
           | willcipriano wrote:
           | What did they train it on?
        
             | cubefox wrote:
             | On partly copyrighted text. Same as you and me.
        
           | faeriechangling wrote:
           | Did Meta ask permission from every user they trained their
           | model on? Did all those users consent, and when I say consent
           | I'm saying was there a meeting of minds not something buried
           | in page 89 of a EULA, to Meta building an AI with their data?
           | 
           | Turnabout is fair play. I don't feel the least bit sorry for
           | Meta.
        
             | terafo wrote:
             | LLaMa was trained on data of Meta users, though.
        
             | cubefox wrote:
             | But it doesn't copy any text one to one. The largest one
             | was trained on 1.4 trillion tokens, if I recall correctly,
             | but the model size is just 65 billion parameters. (I
             | believe they use 16 bit per token and parameter.) It seems
             | to be more like a human who has read large parts of the
             | internet, but doesn't remember anything word by word.
             | Learning from reading stuff was never considered a
             | copyright violation.
        
               | Avicebron wrote:
               | > It seems to be more like a human who has read large
               | parts of the internet, but doesn't remember anything word
               | by word. Learning from reading stuff was never considered
               | a copyright violation.
               | 
               | This is one of the most common talking points I see
               | brought up, especially when defending things like ai
               | "learning" from the style of artists and then being able
               | to replicate that style. On the surface we can say, oh
               | it's similar to a human learning from an art style and
               | replicating it. But that implies that the program is
               | functioning like a human mind (as far as I know the jury
               | is still out on that and I doubt we know exactly how a
               | human mind actually "learns" (I'm not a neuroscientist)).
               | 
               | Let's say for the sake of experiment I ask you to cut out
               | every word of pride and prejudice, and keep them all
               | sorted. Then when asked to write a story in the style of
               | jane austen you pull from that pile of snipped out words
               | and arranged them in a pattern that most resembles her
               | writing, did you transform it? Sure maybe, if a human did
               | that I bet they could even copyright it, but I think that
               | as a machine, it took those words, phrases, and applied
               | an algorithm to generating output, even with stochastic
               | elements the direct backwards traceability albeit a 65B
               | convolution of it means that the essence of the
               | copyrighted materials has been directly translated.
               | 
               | From what I can see we can't prove the human mind is
               | strictly deterministic. But an ai very well might be in
               | many senses. So the transference of non-deterministic
               | material (the original) through a deterministic transform
               | has to root back to the non-deterministic model (the
               | human mind and therefore the original copyright holder).
        
             | shepardrtc wrote:
             | They don't ask permission when they're stealing users'
             | data, so why should users ask permission for stealing their
             | data?
             | 
             | https://www.usatoday.com/story/tech/2022/09/22/facebook-
             | meta...
        
           | seydor wrote:
           | It's an index of the web and our own comments, barely
           | something they can claim ownership on , and especially to
           | resell.
           | 
           | But OTOH, by preventing commercial use, they have sparked the
           | creation of an open source ecosystem where people are
           | building on top of it because it's fun, not because they want
           | to build a moat to fill it with sweet VC $$$money.
           | 
           | It's great to see that ecosystem being built around it, and
           | soon someone will train a fully open source model to replace
           | Llama
        
           | dodslaser wrote:
           | Meta as a company has shown pretty blatantly that they don't
           | really care about ethitcs, nor the law for that sake.
        
         | [deleted]
        
       | victor96 wrote:
       | Less memory than most Electron apps!
        
         | terafo wrote:
         | With all my dislike to Electron, I struggle to remember even
         | one Electron app that managed to use 6 gigs.
        
           | baobabKoodaa wrote:
           | I assume it was a joke
        
           | mrtksn wrote:
           | I've seen WhatsApp doing it. It start with 1.5G anyway, so
           | after some images and stuff it inflates quite a lot.
        
       | yodsanklai wrote:
       | Total noob questions.
       | 
       | 1. How does this compare with ChatGPT3
       | 
       | 2. Does it mean we could eventually run a system such as ChatGPT3
       | on a computer
       | 
       | 3. Could LLM eventually replace Google (in the sense that answers
       | could be correct 99.9% of the time) or is the tech inherently
       | flawed
        
         | addisonl wrote:
         | Minor correction, chatGPT uses GPT-3.5 and (most recently, if
         | you pay $20/month) GPT-4. Their branding definitely needs some
         | work haha. We are in track for you to be able to run something
         | like chatGPT locally!
        
       ___________________________________________________________________
       (page generated 2023-03-31 23:00 UTC)