[HN Gopher] Run LLMs at home, BitTorrent-style ___________________________________________________________________ Run LLMs at home, BitTorrent-style Author : udev4096 Score : 161 points Date : 2023-09-17 16:30 UTC (6 hours ago) (HTM) web link (petals.dev) (TXT) w3m dump (petals.dev) | nico wrote: | This is so cool. Hopefully this will give access to thousands or | millions more developers in the space | behnamoh wrote: | looking at the list of contributors, way more people need to | donate their GPU time for the betterment of all. maybe we finally | have a good use for decentralized computing that doesn't | calculate meaningless hashes for crypto, but helps the humanity | by keeping these open source LLMs alive. | Obscurity4340 wrote: | This way too nobody can copyright-cancel the LLM like OpenAI or | whatever | judge2020 wrote: | It can cost a lot to run a GPU, especially at full load. The | 4090 stock pulls 500 watts of power under full load[0], which | is 12 kWh/day or just under 4380 kWh a year, or over $450 in a | year assuming $0.10-$0.11/kWh for average residential rates. | The only variable is whether or not training requires the same | power draw as hitting it with furmark. | | 0: https://youtu.be/j9vC9NBL8zo?t=983 | tossl568 wrote: | Those "meaningless hashes" help secure hundreds of billions in | savings of Bitcoin for hundreds of millions of people. Check | your financial privilege. | cheema33 wrote: | > Those "meaningless hashes" help secure hundreds of billions | in savings of Bitcoin for hundreds of millions of people. | | Can you back that up with actual data? Other than something | that a crypto bro on the Internet told you? | december456 wrote: | Thats not the best counterargument, because Bitcoin has | privacy qualities by default. You can hop on to any block | explorer and accept every address as another user, but you | cant verify that (without expensive analysis, on a case-by- | case basis) those are not owned by the same guy. Same with | Tor, while some data like bridge usage is being collected | somehow (i havent looked into it) you cant reliably prove | that thousands/millions are using it to protect their | privacy and resist censorship. | [deleted] | swyx wrote: | so given that GGML can serve like 100 tok/s on an M2 Max, and | this thing advertises 6 tok/s distributed, is this basically for | people with lower end devices? | [deleted] | version_five wrote: | It's talking about 70B and 160B models. Even heavily quantized | can ggml run those that fast? (I'm guessing possibly). So maybe | this is for people that dont have a high end computer? I have a | decent linux laptop a couple years old and there's no way I | could run those models that fast. I get a few tokens per second | on a quantized 7B model. | brucethemoose2 wrote: | Yeah. My 3090 gets like ~5 tokens/s on 70B Q3KL. | | This is a good idea, as splitting up llms is actually pretty | efficient with pipelined requests. | russellbeattie wrote: | > _...lower end devices_ | | So, pretty much every other consumer PC available? Those | losers. | jmorgan wrote: | This is neat. Model weights are split into their layers and | distributed across several machines who then report themselves in | a big hash table when they are ready to perform inference or fine | tuning "as a team" over their subset of the layers. | | It's early but I've been working on hosting model weights in a | Docker registry for https://github.com/jmorganca/ollama. Mainly | for the content addressability (Ollama will verify the correct | weights are downloaded every time) and ultimately weights can be | fetched by their content instead of by their name or url (which | may change!). Perhaps a good next step might be to split the | models by layers and store each layer independently for use cases | like this (or even just for downloading + running larger models | over several "local" machines). | brucethemoose2 wrote: | > and fine-tune them for your tasks | | This is the part that raised my eyebrows. | | Finetuning 70B is not just hard, its literally impossible without | renting a very expensive cloud instance or buying a PC the price | of a house, no matter how long you are willing to wait. I would | absolutely contribute to a "llama training horde" | AaronFriel wrote: | That's true for conventional fine-tuning, but is it the case | for parameter efficient fine tuning and qLORA? My understanding | is that for a N billion parameter model, fine tuning can occur | with a slightly-less-than-N gigabyte of VRAM GPU. | | For that 70B parameter model: an A100? | brucethemoose2 wrote: | 2x 48GB GPUs would be the cheapest. But that's still a very | expensive system. | zacmps wrote: | I think you'd need 2 80GB A100's for unquantised. | akomtu wrote: | What prevents parallel LLM training? If you read book 1 first | and then book 2, the resulting update in your knowledge will be | the same if you read the books in the reverse order. It seems | reasonable to assume that LLM is trained on each book | independently, the two deltas in the LLM weights can be just | added up. | contravariant wrote: | In ordinary gradient descent the order does matter, since the | position changes in between. I think stochastic gradient | descent does sum a couple of gradients together sometimes, | but I'm not sure what the trade-offs are and if LLMs do so as | well. | ctoth wrote: | This is not at all intuitive to me. It doesn't make sense in | a human perspective, as each book changes you. Consider the | trivial case of a series, where nothing will make sense if | you haven't read the prior books (not that I think they feed | it the book corpus in order maybe they should!), but even in | a more philosophical sort of way, each book changes you. and | the person who reads Harry Potter first and The Iliad second | will have a different experience of each. Then, with large | language models, we have the concept of grokking something. | If grokking happens in the middle of book 1, it is a | different model which is reading book 2 and of course the | inverse applies. | whimsicalism wrote: | By the "delta in the LLM weights", I am assuming you mean the | gradients. You are effectively describing large batch | training (data parallelism) which is part of the way you can | scale up but there are quickly diminishing returns to large | batch sizes. | eachro wrote: | I'm not sure this is true. For instance, consider reading | textbooks for linear algebra and functional analysis out of | order. You might still grok the functional analysis if you | read it first but you'd be better served by reading the | linear algebra one first. | malwrar wrote: | Impossible? It's just a bunch of math, you don't need to keep | the entire network in memory the whole time. | Zetobal wrote: | An H100 is maybe a car but not nearly close to a house... | ioedward wrote: | 8 H100s would have enough VRAM to finetune a 70B model. | nextaccountic wrote: | Is a single H100 enough? | KomoD wrote: | Maybe not in your area, but it's very doable in other places, | like where I live. | teaearlgraycold wrote: | Would love to share my 3080 Ti, but after running the commands in | the getting started guide (https://github.com/bigscience- | workshop/petals/wiki/Run-Petal...) it looks like there's a | dependency versioning issue: ImportError: | cannot import name 'get_full_repo_name' from 'huggingface_hub' | (~/.local/lib/python3.8/site- | packages/huggingface_hub/__init__.py) | esafak wrote: | The first question I had was "what are the economics?" From the | FAQ: | | _Will Petals incentives be based on crypto, blockchain, etc.?_ | No, we are working on a centralized incentive system similar to | the AI Horde kudos, even though Petals is a fully decentralized | system in all other aspects. We do not plan to provide a service | to exchange these points for money, so you should see these | incentives as "game" points designed to be spent inside our | system. Petals is an ML-focused project designed for | ML researchers and engineers, it does not have anything to do | with finance. We decided to make the incentive system centralized | because it is much easier to develop and maintain, so we can | focus on developing features useful for ML researchers. | | https://github.com/bigscience-workshop/petals/wiki/FAQ:-Freq... | kordlessagain wrote: | The logical conclusion is that they (the models) will | eventually be linked to crypto payments though. This is where | Lightning becomes important... | | Edit: To clarify, I'm not suggesting linking these Petal | "tokens" to any payment system. I'm talking about, in general, | calls to clusters of machine learning models, decentralized or | not, will likely use crypto payments because it gives you auth | and a means of payment. | | I do think Petal is a good implementation of using | decentralized compute for model use and will likely be valuable | long term. | vorpalhex wrote: | I mean, I can sell you Eve or Runescape currency but we don't | need any crypto to execute on it. "Gold sellers" existed well | before crypto. | Szpadel wrote: | if that part could be replaced with any third party server it | would be a tracker in BitTorrent analogy. | sn0wf1re wrote: | Similarly there have been distributed render farms for graphic | design for a long time. No incentives other than higher points | means your jobs are prioritized. | | https://www.sheepit-renderfarm.com/home | brucethemoose2 wrote: | > similar to the AI Horde kudos | | What they are referencing, which is super cool and (IMO) | criminally underused: | | https://lite.koboldai.net/ | | https://tinybots.net/artbot | | https://aihorde.net/ | | In fact, I can host a 13B-70B finetune in the afternoon if | anyone on HN wants to test a particular one out: | | https://huggingface.co/models?sort=modified&search=70B+gguf | swyx wrote: | > GGUF is a new format introduced by the llama.cpp team on | August 21st 2023. It is a replacement for GGML, which is no | longer supported by llama.cpp. GGUF offers numerous | advantages over GGML, such as better tokenisation, and | support for special tokens. It is also supports metadata, and | is designed to be extensible | | is there a more canonical blogpost or link to learn more | about the technical decisions here? | brucethemoose2 wrote: | https://github.com/philpax/ggml/blob/gguf- | spec/docs/gguf.md#... | | It is (IMO) a necessary and good change. | | I just specified gguf because my 3090 cannot host a 70B | model without offloading outside of exLlama's very new ~2 | bit quantization. And pre quantized gguf is a much smaller | download than raw fp16 for conversion. | nextaccountic wrote: | Can they actually prevent people from trading petals for money | though? | beardog wrote: | >What's the motivation for people to host model layers in the | public swarm? | | >People who run inference and fine-tuning themselves get a | certain speedup if they host a part of the model locally. Some | may be also motivated to "give back" to the community helping | them to run the model (similarly to how BitTorrent users help | others by sharing data they have already downloaded). | | >Since it may be not enough for everyone, we are also working | on introducing explicit incentives ("bloom points") for people | donating their GPU time to the public swarm. Once this system | is ready, we will display the top contributors on our website. | People who earned these points will be able to spend them on | inference/fine-tuning with higher priority or increased | security guarantees, or (maybe) exchange them for other | rewards. | | It does seem like they want a sort of centralized token | however. | [deleted] | seydor wrote: | It's a shame that every decentralized projects needs to be | compared to cryptocoins now ___________________________________________________________________ (page generated 2023-09-17 23:00 UTC)