[HN Gopher] Large language models are having their Stable Diffus... ___________________________________________________________________ Large language models are having their Stable Diffusion moment Author : simonw Score : 186 points Date : 2023-03-11 19:19 UTC (3 hours ago) (HTM) web link (simonwillison.net) (TXT) w3m dump (simonwillison.net) | homarp wrote: | There is even has r/LocalLLaMA/ | minimaxir wrote: | Currently right now there's too many caveats to run even the 7B | model per the workflows mentioned in the article. | | The big difference between it and Stable Diffusion which caused | the latter to go megaviral is a) it can run on a typical GPU that | gamers likely already have without hitting a perf ceiling and b) | it can run easily in a free Colab GPU. Although Hugging Face | transformers can run a 7B model on a T4 GPU w/ 8-bit loading, but | with its own caveats too. | | There's a big difference between "can run" and "can run _well_ ". | VQGAN + CLIP had a lot of friction too and that's partially why | AI image generation didn't go megaviral then. | bestcoder69 wrote: | Then this is SD for Apple silicon users. 13B runs on my m1 air | at 200-300ms/token using llama.cpp. Outputs feel like original | GPT-3, unlike any of the competitors I've tried. Granted- non- | scientific first impressions. | j45 wrote: | Agreed. For those who have been quietly sitting with a base | Mac Studio, or a reasonably capable Mac Mini. The | possibilities changed on some fronts, but GPT's extremely low | price on their API remains a good option. | aaomidi wrote: | Difference is chatgpt is not privacy friendly. | staticautomatic wrote: | Is it still not privacy friendly on Azure? | aaomidi wrote: | Azure has access to your queries. Running locally really | is the only way of having a privacy friendly LLM. | [deleted] | ddren wrote: | They have recently merged support for x86. I get 230ms/token | on the 13B model on a 8 core 9900k under WSL2. | [deleted] | [deleted] | simonw wrote: | By caveats do you mean the licensing terms or the difficulty of | prompting the model? | | Unless it's relicensed I don't expect LLaMA to be a long-term | foundation model. But it's shown that yes, you can run a GPT-3 | class model on an M1 Mac with 8GB of RAM (or maybe 16GB for the | 13B one?) | | I fully expect other models to follow, from other | organizations, with better capabilities and more friendly | licensing terms. | zamnos wrote: | But is anyone actually making money off of StableDiffusion? | Maybe the shovel-sellers (runpod.io et al), but afaik no one | using it as the foundation for a revenue generating company. | I ask, because yes, technically, you can't get LLaMA legally | unless you're a researcher and get it directly from Facebook. | But that's not going to stop the faithful from finding a copy | and working on it. | simonw wrote: | I believe Midjourney may have used bits of Stable Diffusion | in their product, which is definitely profitable. | logifail wrote: | > is anyone actually making money off of StableDiffusion? | | We're all still waiting to hear about (non-shovel-selling) | successes in this space. | pmoriarty wrote: | I don't know about Stable Diffusion in particular, but | three examples of AI-generated art making money | immediately spring to mind: | | 1 - some guy won hundreds of dollars in an art contest | from AI generated art (and this made big news, so it | should be easy to find) | | 2 - one person reported using midjourney's images as a | starting point for images that wound up being used in a | physical magazine | | 3 - another artist has used midjourney images that they | modify to sell in all sorts of contexts (like background | images on stock illustration sites) | | You'd probably find many other examples in midjourney's | #in-the-world discord channel. | | I'd also be shocked if stock image sites, clipart sites | and freelance design/illustration sites weren't already | flooded with AI generated images that have been sold for | money. | | That being said, because high questly AI-generated images | are so easy to make, the value of images of all types is | likely to plummet soon if it hasn't already. | minimaxir wrote: | Ignoring the licensing issues, there are a few other | constraints that would make the model harder to go viral | outside of developers who spend a lot of time in this space | already: | | 1) Model weights are heavy for just experimentation, although | quantizing it down to 4-bit might make them on par with SD | FP16. | | 2) Requires extreme CLI shenanigans (and likely configuration | since you have to run make) compared to just running a Colab | Notebook or a .bat Windows Installer for the A1111 UI. | | 3) Again hardware: a M1 Pro or a RTX 4090 is not super common | among people who are just curious about text generation. | | 4) It is possible the extreme quantization could be affecting | text output quality; although the examples are coherent for | simple queries, more complex GPT-3-esque queries might become | relatively incoherent. Particularly with ChatGPT and its | cheap API (timely!) out now such that even nontechies have a | strong baseline on good output already. The viral moment for | SD was that it was easy to use _and_ it was a significant | quality leap over VQGAN + CLIP. | | I was going to say inference speed since that's usually | another constraint for new LLMs but given the 61.41 ms/token | cited for the 7B model in the repo/your GIF, that seems on | par with the inference speed from OPT-6.7B FP16 in | transformers on a T4. | | Some of these caveats are fixable, but even then I don't | think LLaMA will have its Stable Diffusion moment. | simonw wrote: | The 4-bit quantized models are 4GB for 7B and 8GB for 13B. | | I'm not too worried about CLI shenanigans, because of what | happened with whisper.cpp - it resulted in apps like | https://goodsnooze.gumroad.com/l/macwhisper - wouldn't be | at all surprised to see the same happen with llama.cpp | | A regular M1 with 8GB of RAM appears to be good enough to | run that 7B model. I wonder at what point it will run on an | iPhone... the Stable Diffusion model was 4GB when they | first released it, and that runs on iOS now after some more | optimization tricks. | | For me though, the "Stable Diffusion" moment isn't | necessarily about the LLaMA model itself. It's not licensed | for commercial use, so it won't see nearly the same level | of things built on top of it. | | The key moment for me is that I've now personally seen a | GPT-3 scale model running on my own personal laptop. I know | it can be done! Now I just need to wait for the inevitable | openly-licensed, instruction-tuned model that runs on the | same hardware. | | It's that, but also the forthcoming explosion of developer | innovation that a local model will unleash. llama.cpp is | just the first hint of that. | smoldesu wrote: | > The key moment for me is that I've now personally seen | a GPT-3 scale model running on my own personal laptop. | | I hate to pooh-pooh it for everyone, but this was | possible before LLaMa. GPT-J-125m/6b have been around for | a while, and are frankly easier to install and get | results out of. The smaller pruned model even fits on an | iPhone. | | The problem is more that these smaller models won't ever | compete with GPT-scale APIs. Tomorrow's local LLaMa might | beat yesterday's ChatGPT, but I think those optimistic | for the democratization of chatbot intelligence are | setting their hopes a bit high. LLaMa _really_ isn 't | breaking new ground. | simonw wrote: | I'm not particularly interested in beating ChatGPT: I'm | looking for a "calculator for words" which I can use for | things like summarization, term extraction, text | rephrasing etc - maybe translation between languages too. | | There are all kinds of things I want to be able to do | with a LLM that are a lot tighter than general chatbots. | | I'd love to see a demo of GPT-J on an iPhone! | tracyhenry wrote: | Another big difference is quality of the results. Haven't tried | myself but seen many complaints that it's nowhere near GPT-3 | (at least for the 7B version). Correct me if I'm wrong! | bestcoder69 wrote: | 13B feels on-par with the base non-instruction davinci. | People might not realize how it was a bit trickier to prompt | gpt3 when it first released. | simonw wrote: | That doesn't bother me so much. GPT-3 had instruction tuning, | which makes it MUCH easier to use. | | Now that I've seen that LLaMA can work I'm confident someone | will release an openly licensed instruction-tuned model that | works on the same hardware at some point soon. | | I also expect that there are prompt engineering tricks which | can be used to get really great results out of LLaMA. I'm | hoping someone will come up with a good prompt to get it to | summarization, for example. | sp332 wrote: | ChatGPT had an estimated 20,000 hours of human feedback. | That's not going to be easy to replicate in an open source | way. | jacooper wrote: | Does anybody know how to run this on Linux with an AMD GPU? | | Also do I have to bother with their crappy driver module that | doesn't support most GPUs? | patricktlo wrote: | That's amazing, any chance of running it on my trusty GTX 1060 | 6gb, or that's not enough VRAM? | [deleted] | ilovefood wrote: | This is really great, very good write-up. | | Seems it now also supports AVX2 for x86 architectures too. | https://twitter.com/ggerganov/status/1634588951821393922 | bilsbie wrote: | How's it looking for a six year old MacBook? | | Not there yet? | | Does this still use the gpu? | simonw wrote: | I believe lambda.cpp has been designed for at least an M1 - no | idea if there are options for running LLaMA on older hardware. | astrange wrote: | It doesn't use CoreML so it should work on Intel machines at | some speed. | | If it used the GPU/ANE and was a true large language model | then it would only work on M1 systems because they're unified | memory (which nothing except an A100 can match.) | Spiwux wrote: | People have been running large language models locally for a | while now. For now the general consensus is that llama is not | fundamentally better than local models with similar resource | requirements, and in all the comparisons it falls short of an | instruction-tuned model like Chat GPT | version_five wrote: | But llama is the most performant model with weights available | in the wild. | | Personally I hope we quickly get to the stage that there's a | real open llm like SD is to DALL-E. It sucks to have to bother | with Facebook's core model, and give it more attention than it | deserves, just because it's out there. | | If facebook had actually released it as an open model, I would | have said that all the credit should go to them. But instead | people are doing great open source work on top of their un-free | model just because it's available, and in the popular | conception they're going to get credit that they shouldn't | bestcoder69 wrote: | What instruction tuned LLM is better? | yunyu wrote: | FLAN-UL2 | loufe wrote: | I've been following LLaMa closely since release and I'm | surprised to see the claim that it's "general consensus" that's | it isn't superior. I've seen machine and anecdotal evidence to | the contrary. I'm not suggesting you're lying, but I am | curious, can you point me to something you're reading? | simonw wrote: | My argument here is that this represents a tipping point. | | Prior to LLaMA + llama.cpp you could maybe run a large language | model locally... if you had the right GPU rig, and if you | really knew what you were doing, and were willing to put in a | lot of effort to find and figure out how to run a model. | | My hunch is that the ability to run on a M1/M2 MacBook is going | to open this up to a lot more people. | | (I'm exposing my bias here as a M2 Mac owner.) | | I think the race is now on to be the first organization to | release a good instruction-tuned model that can run on personal | hardware. | stonerri wrote: | As someone who just got the 7B running on a base MacBook | M1/8GB, I strongly agree. The rate of tool development & | prompt generation should see the same increase that Stable | Diffusion did a few months (weeks?) ago. | | And given how early the cpp port is, there is likely plenty | of performance headroom with more m1/m2-specific | optimization. | seydor wrote: | I wonder why we don't have external "neural processing" devices | like we once had soundcards. Is anyone working on hardware | implementation of transformers? | | Kudos to Yann lecun for getting his revenge for Galactica | jhrmnn wrote: | https://en.wikipedia.org/wiki/Tensor_Processing_Unit | seydor wrote: | but those are not for sale, and not transformer-specific. | There must be some optimizations that can be done in hardware | and transformers are several years old now | ruuda wrote: | You likely already bought one. | | https://blog.google/products/pixel/introducing-google- | tensor... | | https://apple.fandom.com/wiki/Neural_Engine | jhrmnn wrote: | Computation-wise, transformers are really just a bunch of | matrix multiplications, nothing more to it. (Which is | partially why they're so efficient and scalable.) Also, | Nvidia's GPU architectures are moving in the TPU direction | (https://www.nvidia.com/en-us/data-center/tensor-cores/). | zenogantner wrote: | > wonder why we don't have external "neural processing" devices | like we once had soundcards. | | Some video cards/GPUs have become just that, becoming more and | more geared towards non-graphics workloads ... | valine wrote: | Nvidia A100 is exactly that. It has lower cuda performance than | a RTX 4090, and is almost entirely geared toward ML workloads. | rvz wrote: | There you go and very unsurprising to see that happen very | quickly, unless you have a Apple Silicon machine and want to | download the model to try it yourself. | | I still think that open source LLMs models have to be much | smaller than 200GB and to be much better than ChatGPT to be more | accessible and highly disruptive to OpenAI. | | It is a great accident needed thanks to Meta. For now one can use | it as a service and make it as a SaaS rather than depend fully on | OpenAI. Open source (or even free binary only) LLMs will | eventually disrupt OpenAI's business plans. | simonw wrote: | The 4-bit quantized version of LLaMA 7B used by llama.cpp is a | 4GB file. The 13B model is under 8GB. | Mathnerd314 wrote: | > This all changed yesterday, thanks to the combination of | Facebook's LLaMA model and llama.cpp by Georgi Gerganov. | | George Hotz was so confident that he was riding the wave with his | Python implementation: | https://github.com/geohot/tinygrad/blob/master/examples/llam.... | But I guess not, pure C++ seems better. | quotemstr wrote: | Isn't it more the four bit quantization than the choice of C++ | as an orchestrator that's the win? It's not as if in either the | C++ or the Python case that high level code is actually doing | the matrix multiplications. | | That basically the whole AI revolution is powered by CPython of | all things (not even PyPy) is the 100 megaton nuke that should | end language warring forever. | | That the first AGI will likely be running under a VM so | inefficient that even integers are reference counted is God | laughing in the face of all the people who've spent the past | decades arguing that this language or that language is | "faster". Amdahl was right: only inner loops matter. | minimaxir wrote: | > That basically the whole AI revolution is powered by | CPython of all things (not even PyPy) is the 100 megaton nuke | that should end language warring forever. | | And a lot of new AI tooling such as tokenization has been | developed for Python using Rust (pyo3) | camjohnson26 wrote: | Are there any online communities running these models with non | professional hardware? I keep running into issues with poor | documentation or outdated scripts with GPT neox, BLOOM, and even | stable diffusion 2. Seems like most of the support is either for | professionals with clusters of A100s, or consumers who aren't | using code. I have 3 16gb Quadra GPUs but getting this stuff | running on them has been surprisingly difficult | moyix wrote: | There's a group of folks on 4chan doing this on gaming class | hardware (4080s etc). They have a doc here: | https://rentry.org/llama-tard-v2 | BaculumMeumEst wrote: | would i have better luck with a gtx 1070 with 8gb of vram or a | macbook m1 pro with 16gb of ram? | techstrategist wrote: | M1 Pro for sure | rahimnathwani wrote: | The latter. ___________________________________________________________________ (page generated 2023-03-11 23:00 UTC)