[HN Gopher] Run LLMs at home, BitTorrent-style
       ___________________________________________________________________
        
       Run LLMs at home, BitTorrent-style
        
       Author : udev4096
       Score  : 161 points
       Date   : 2023-09-17 16:30 UTC (6 hours ago)
        
 (HTM) web link (petals.dev)
 (TXT) w3m dump (petals.dev)
        
       | nico wrote:
       | This is so cool. Hopefully this will give access to thousands or
       | millions more developers in the space
        
       | behnamoh wrote:
       | looking at the list of contributors, way more people need to
       | donate their GPU time for the betterment of all. maybe we finally
       | have a good use for decentralized computing that doesn't
       | calculate meaningless hashes for crypto, but helps the humanity
       | by keeping these open source LLMs alive.
        
         | Obscurity4340 wrote:
         | This way too nobody can copyright-cancel the LLM like OpenAI or
         | whatever
        
         | judge2020 wrote:
         | It can cost a lot to run a GPU, especially at full load. The
         | 4090 stock pulls 500 watts of power under full load[0], which
         | is 12 kWh/day or just under 4380 kWh a year, or over $450 in a
         | year assuming $0.10-$0.11/kWh for average residential rates.
         | The only variable is whether or not training requires the same
         | power draw as hitting it with furmark.
         | 
         | 0: https://youtu.be/j9vC9NBL8zo?t=983
        
         | tossl568 wrote:
         | Those "meaningless hashes" help secure hundreds of billions in
         | savings of Bitcoin for hundreds of millions of people. Check
         | your financial privilege.
        
           | cheema33 wrote:
           | > Those "meaningless hashes" help secure hundreds of billions
           | in savings of Bitcoin for hundreds of millions of people.
           | 
           | Can you back that up with actual data? Other than something
           | that a crypto bro on the Internet told you?
        
             | december456 wrote:
             | Thats not the best counterargument, because Bitcoin has
             | privacy qualities by default. You can hop on to any block
             | explorer and accept every address as another user, but you
             | cant verify that (without expensive analysis, on a case-by-
             | case basis) those are not owned by the same guy. Same with
             | Tor, while some data like bridge usage is being collected
             | somehow (i havent looked into it) you cant reliably prove
             | that thousands/millions are using it to protect their
             | privacy and resist censorship.
        
           | [deleted]
        
       | swyx wrote:
       | so given that GGML can serve like 100 tok/s on an M2 Max, and
       | this thing advertises 6 tok/s distributed, is this basically for
       | people with lower end devices?
        
         | [deleted]
        
         | version_five wrote:
         | It's talking about 70B and 160B models. Even heavily quantized
         | can ggml run those that fast? (I'm guessing possibly). So maybe
         | this is for people that dont have a high end computer? I have a
         | decent linux laptop a couple years old and there's no way I
         | could run those models that fast. I get a few tokens per second
         | on a quantized 7B model.
        
           | brucethemoose2 wrote:
           | Yeah. My 3090 gets like ~5 tokens/s on 70B Q3KL.
           | 
           | This is a good idea, as splitting up llms is actually pretty
           | efficient with pipelined requests.
        
         | russellbeattie wrote:
         | > _...lower end devices_
         | 
         | So, pretty much every other consumer PC available? Those
         | losers.
        
       | jmorgan wrote:
       | This is neat. Model weights are split into their layers and
       | distributed across several machines who then report themselves in
       | a big hash table when they are ready to perform inference or fine
       | tuning "as a team" over their subset of the layers.
       | 
       | It's early but I've been working on hosting model weights in a
       | Docker registry for https://github.com/jmorganca/ollama. Mainly
       | for the content addressability (Ollama will verify the correct
       | weights are downloaded every time) and ultimately weights can be
       | fetched by their content instead of by their name or url (which
       | may change!). Perhaps a good next step might be to split the
       | models by layers and store each layer independently for use cases
       | like this (or even just for downloading + running larger models
       | over several "local" machines).
        
       | brucethemoose2 wrote:
       | > and fine-tune them for your tasks
       | 
       | This is the part that raised my eyebrows.
       | 
       | Finetuning 70B is not just hard, its literally impossible without
       | renting a very expensive cloud instance or buying a PC the price
       | of a house, no matter how long you are willing to wait. I would
       | absolutely contribute to a "llama training horde"
        
         | AaronFriel wrote:
         | That's true for conventional fine-tuning, but is it the case
         | for parameter efficient fine tuning and qLORA? My understanding
         | is that for a N billion parameter model, fine tuning can occur
         | with a slightly-less-than-N gigabyte of VRAM GPU.
         | 
         | For that 70B parameter model: an A100?
        
           | brucethemoose2 wrote:
           | 2x 48GB GPUs would be the cheapest. But that's still a very
           | expensive system.
        
           | zacmps wrote:
           | I think you'd need 2 80GB A100's for unquantised.
        
         | akomtu wrote:
         | What prevents parallel LLM training? If you read book 1 first
         | and then book 2, the resulting update in your knowledge will be
         | the same if you read the books in the reverse order. It seems
         | reasonable to assume that LLM is trained on each book
         | independently, the two deltas in the LLM weights can be just
         | added up.
        
           | contravariant wrote:
           | In ordinary gradient descent the order does matter, since the
           | position changes in between. I think stochastic gradient
           | descent does sum a couple of gradients together sometimes,
           | but I'm not sure what the trade-offs are and if LLMs do so as
           | well.
        
           | ctoth wrote:
           | This is not at all intuitive to me. It doesn't make sense in
           | a human perspective, as each book changes you. Consider the
           | trivial case of a series, where nothing will make sense if
           | you haven't read the prior books (not that I think they feed
           | it the book corpus in order maybe they should!), but even in
           | a more philosophical sort of way, each book changes you. and
           | the person who reads Harry Potter first and The Iliad second
           | will have a different experience of each. Then, with large
           | language models, we have the concept of grokking something.
           | If grokking happens in the middle of book 1, it is a
           | different model which is reading book 2 and of course the
           | inverse applies.
        
           | whimsicalism wrote:
           | By the "delta in the LLM weights", I am assuming you mean the
           | gradients. You are effectively describing large batch
           | training (data parallelism) which is part of the way you can
           | scale up but there are quickly diminishing returns to large
           | batch sizes.
        
           | eachro wrote:
           | I'm not sure this is true. For instance, consider reading
           | textbooks for linear algebra and functional analysis out of
           | order. You might still grok the functional analysis if you
           | read it first but you'd be better served by reading the
           | linear algebra one first.
        
         | malwrar wrote:
         | Impossible? It's just a bunch of math, you don't need to keep
         | the entire network in memory the whole time.
        
         | Zetobal wrote:
         | An H100 is maybe a car but not nearly close to a house...
        
           | ioedward wrote:
           | 8 H100s would have enough VRAM to finetune a 70B model.
        
           | nextaccountic wrote:
           | Is a single H100 enough?
        
           | KomoD wrote:
           | Maybe not in your area, but it's very doable in other places,
           | like where I live.
        
       | teaearlgraycold wrote:
       | Would love to share my 3080 Ti, but after running the commands in
       | the getting started guide (https://github.com/bigscience-
       | workshop/petals/wiki/Run-Petal...) it looks like there's a
       | dependency versioning issue:                   ImportError:
       | cannot import name 'get_full_repo_name' from 'huggingface_hub'
       | (~/.local/lib/python3.8/site-
       | packages/huggingface_hub/__init__.py)
        
       | esafak wrote:
       | The first question I had was "what are the economics?" From the
       | FAQ:
       | 
       |  _Will Petals incentives be based on crypto, blockchain, etc.?_
       | No, we are working on a centralized incentive system similar to
       | the AI Horde kudos, even though Petals is a fully decentralized
       | system in all other aspects. We do not plan to provide a service
       | to exchange these points for money, so you should see these
       | incentives as "game" points designed to be spent inside our
       | system.            Petals is an ML-focused project designed for
       | ML researchers and engineers, it does not have anything to do
       | with finance. We decided to make the incentive system centralized
       | because it is much easier to develop and maintain, so we can
       | focus on developing features useful for ML researchers.
       | 
       | https://github.com/bigscience-workshop/petals/wiki/FAQ:-Freq...
        
         | kordlessagain wrote:
         | The logical conclusion is that they (the models) will
         | eventually be linked to crypto payments though. This is where
         | Lightning becomes important...
         | 
         | Edit: To clarify, I'm not suggesting linking these Petal
         | "tokens" to any payment system. I'm talking about, in general,
         | calls to clusters of machine learning models, decentralized or
         | not, will likely use crypto payments because it gives you auth
         | and a means of payment.
         | 
         | I do think Petal is a good implementation of using
         | decentralized compute for model use and will likely be valuable
         | long term.
        
           | vorpalhex wrote:
           | I mean, I can sell you Eve or Runescape currency but we don't
           | need any crypto to execute on it. "Gold sellers" existed well
           | before crypto.
        
         | Szpadel wrote:
         | if that part could be replaced with any third party server it
         | would be a tracker in BitTorrent analogy.
        
         | sn0wf1re wrote:
         | Similarly there have been distributed render farms for graphic
         | design for a long time. No incentives other than higher points
         | means your jobs are prioritized.
         | 
         | https://www.sheepit-renderfarm.com/home
        
         | brucethemoose2 wrote:
         | > similar to the AI Horde kudos
         | 
         | What they are referencing, which is super cool and (IMO)
         | criminally underused:
         | 
         | https://lite.koboldai.net/
         | 
         | https://tinybots.net/artbot
         | 
         | https://aihorde.net/
         | 
         | In fact, I can host a 13B-70B finetune in the afternoon if
         | anyone on HN wants to test a particular one out:
         | 
         | https://huggingface.co/models?sort=modified&search=70B+gguf
        
           | swyx wrote:
           | > GGUF is a new format introduced by the llama.cpp team on
           | August 21st 2023. It is a replacement for GGML, which is no
           | longer supported by llama.cpp. GGUF offers numerous
           | advantages over GGML, such as better tokenisation, and
           | support for special tokens. It is also supports metadata, and
           | is designed to be extensible
           | 
           | is there a more canonical blogpost or link to learn more
           | about the technical decisions here?
        
             | brucethemoose2 wrote:
             | https://github.com/philpax/ggml/blob/gguf-
             | spec/docs/gguf.md#...
             | 
             | It is (IMO) a necessary and good change.
             | 
             | I just specified gguf because my 3090 cannot host a 70B
             | model without offloading outside of exLlama's very new ~2
             | bit quantization. And pre quantized gguf is a much smaller
             | download than raw fp16 for conversion.
        
         | nextaccountic wrote:
         | Can they actually prevent people from trading petals for money
         | though?
        
         | beardog wrote:
         | >What's the motivation for people to host model layers in the
         | public swarm?
         | 
         | >People who run inference and fine-tuning themselves get a
         | certain speedup if they host a part of the model locally. Some
         | may be also motivated to "give back" to the community helping
         | them to run the model (similarly to how BitTorrent users help
         | others by sharing data they have already downloaded).
         | 
         | >Since it may be not enough for everyone, we are also working
         | on introducing explicit incentives ("bloom points") for people
         | donating their GPU time to the public swarm. Once this system
         | is ready, we will display the top contributors on our website.
         | People who earned these points will be able to spend them on
         | inference/fine-tuning with higher priority or increased
         | security guarantees, or (maybe) exchange them for other
         | rewards.
         | 
         | It does seem like they want a sort of centralized token
         | however.
        
         | [deleted]
        
         | seydor wrote:
         | It's a shame that every decentralized projects needs to be
         | compared to cryptocoins now
        
       ___________________________________________________________________
       (page generated 2023-09-17 23:00 UTC)