[HN Gopher] Large language models are having their Stable Diffus...
       ___________________________________________________________________
        
       Large language models are having their Stable Diffusion moment
        
       Author : simonw
       Score  : 186 points
       Date   : 2023-03-11 19:19 UTC (3 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | homarp wrote:
       | There is even has r/LocalLLaMA/
        
       | minimaxir wrote:
       | Currently right now there's too many caveats to run even the 7B
       | model per the workflows mentioned in the article.
       | 
       | The big difference between it and Stable Diffusion which caused
       | the latter to go megaviral is a) it can run on a typical GPU that
       | gamers likely already have without hitting a perf ceiling and b)
       | it can run easily in a free Colab GPU. Although Hugging Face
       | transformers can run a 7B model on a T4 GPU w/ 8-bit loading, but
       | with its own caveats too.
       | 
       | There's a big difference between "can run" and "can run _well_ ".
       | VQGAN + CLIP had a lot of friction too and that's partially why
       | AI image generation didn't go megaviral then.
        
         | bestcoder69 wrote:
         | Then this is SD for Apple silicon users. 13B runs on my m1 air
         | at 200-300ms/token using llama.cpp. Outputs feel like original
         | GPT-3, unlike any of the competitors I've tried. Granted- non-
         | scientific first impressions.
        
           | j45 wrote:
           | Agreed. For those who have been quietly sitting with a base
           | Mac Studio, or a reasonably capable Mac Mini. The
           | possibilities changed on some fronts, but GPT's extremely low
           | price on their API remains a good option.
        
             | aaomidi wrote:
             | Difference is chatgpt is not privacy friendly.
        
               | staticautomatic wrote:
               | Is it still not privacy friendly on Azure?
        
               | aaomidi wrote:
               | Azure has access to your queries. Running locally really
               | is the only way of having a privacy friendly LLM.
        
               | [deleted]
        
           | ddren wrote:
           | They have recently merged support for x86. I get 230ms/token
           | on the 13B model on a 8 core 9900k under WSL2.
        
           | [deleted]
        
           | [deleted]
        
         | simonw wrote:
         | By caveats do you mean the licensing terms or the difficulty of
         | prompting the model?
         | 
         | Unless it's relicensed I don't expect LLaMA to be a long-term
         | foundation model. But it's shown that yes, you can run a GPT-3
         | class model on an M1 Mac with 8GB of RAM (or maybe 16GB for the
         | 13B one?)
         | 
         | I fully expect other models to follow, from other
         | organizations, with better capabilities and more friendly
         | licensing terms.
        
           | zamnos wrote:
           | But is anyone actually making money off of StableDiffusion?
           | Maybe the shovel-sellers (runpod.io et al), but afaik no one
           | using it as the foundation for a revenue generating company.
           | I ask, because yes, technically, you can't get LLaMA legally
           | unless you're a researcher and get it directly from Facebook.
           | But that's not going to stop the faithful from finding a copy
           | and working on it.
        
             | simonw wrote:
             | I believe Midjourney may have used bits of Stable Diffusion
             | in their product, which is definitely profitable.
        
             | logifail wrote:
             | > is anyone actually making money off of StableDiffusion?
             | 
             | We're all still waiting to hear about (non-shovel-selling)
             | successes in this space.
        
               | pmoriarty wrote:
               | I don't know about Stable Diffusion in particular, but
               | three examples of AI-generated art making money
               | immediately spring to mind:
               | 
               | 1 - some guy won hundreds of dollars in an art contest
               | from AI generated art (and this made big news, so it
               | should be easy to find)
               | 
               | 2 - one person reported using midjourney's images as a
               | starting point for images that wound up being used in a
               | physical magazine
               | 
               | 3 - another artist has used midjourney images that they
               | modify to sell in all sorts of contexts (like background
               | images on stock illustration sites)
               | 
               | You'd probably find many other examples in midjourney's
               | #in-the-world discord channel.
               | 
               | I'd also be shocked if stock image sites, clipart sites
               | and freelance design/illustration sites weren't already
               | flooded with AI generated images that have been sold for
               | money.
               | 
               | That being said, because high questly AI-generated images
               | are so easy to make, the value of images of all types is
               | likely to plummet soon if it hasn't already.
        
           | minimaxir wrote:
           | Ignoring the licensing issues, there are a few other
           | constraints that would make the model harder to go viral
           | outside of developers who spend a lot of time in this space
           | already:
           | 
           | 1) Model weights are heavy for just experimentation, although
           | quantizing it down to 4-bit might make them on par with SD
           | FP16.
           | 
           | 2) Requires extreme CLI shenanigans (and likely configuration
           | since you have to run make) compared to just running a Colab
           | Notebook or a .bat Windows Installer for the A1111 UI.
           | 
           | 3) Again hardware: a M1 Pro or a RTX 4090 is not super common
           | among people who are just curious about text generation.
           | 
           | 4) It is possible the extreme quantization could be affecting
           | text output quality; although the examples are coherent for
           | simple queries, more complex GPT-3-esque queries might become
           | relatively incoherent. Particularly with ChatGPT and its
           | cheap API (timely!) out now such that even nontechies have a
           | strong baseline on good output already. The viral moment for
           | SD was that it was easy to use _and_ it was a significant
           | quality leap over VQGAN + CLIP.
           | 
           | I was going to say inference speed since that's usually
           | another constraint for new LLMs but given the 61.41 ms/token
           | cited for the 7B model in the repo/your GIF, that seems on
           | par with the inference speed from OPT-6.7B FP16 in
           | transformers on a T4.
           | 
           | Some of these caveats are fixable, but even then I don't
           | think LLaMA will have its Stable Diffusion moment.
        
             | simonw wrote:
             | The 4-bit quantized models are 4GB for 7B and 8GB for 13B.
             | 
             | I'm not too worried about CLI shenanigans, because of what
             | happened with whisper.cpp - it resulted in apps like
             | https://goodsnooze.gumroad.com/l/macwhisper - wouldn't be
             | at all surprised to see the same happen with llama.cpp
             | 
             | A regular M1 with 8GB of RAM appears to be good enough to
             | run that 7B model. I wonder at what point it will run on an
             | iPhone... the Stable Diffusion model was 4GB when they
             | first released it, and that runs on iOS now after some more
             | optimization tricks.
             | 
             | For me though, the "Stable Diffusion" moment isn't
             | necessarily about the LLaMA model itself. It's not licensed
             | for commercial use, so it won't see nearly the same level
             | of things built on top of it.
             | 
             | The key moment for me is that I've now personally seen a
             | GPT-3 scale model running on my own personal laptop. I know
             | it can be done! Now I just need to wait for the inevitable
             | openly-licensed, instruction-tuned model that runs on the
             | same hardware.
             | 
             | It's that, but also the forthcoming explosion of developer
             | innovation that a local model will unleash. llama.cpp is
             | just the first hint of that.
        
               | smoldesu wrote:
               | > The key moment for me is that I've now personally seen
               | a GPT-3 scale model running on my own personal laptop.
               | 
               | I hate to pooh-pooh it for everyone, but this was
               | possible before LLaMa. GPT-J-125m/6b have been around for
               | a while, and are frankly easier to install and get
               | results out of. The smaller pruned model even fits on an
               | iPhone.
               | 
               | The problem is more that these smaller models won't ever
               | compete with GPT-scale APIs. Tomorrow's local LLaMa might
               | beat yesterday's ChatGPT, but I think those optimistic
               | for the democratization of chatbot intelligence are
               | setting their hopes a bit high. LLaMa _really_ isn 't
               | breaking new ground.
        
               | simonw wrote:
               | I'm not particularly interested in beating ChatGPT: I'm
               | looking for a "calculator for words" which I can use for
               | things like summarization, term extraction, text
               | rephrasing etc - maybe translation between languages too.
               | 
               | There are all kinds of things I want to be able to do
               | with a LLM that are a lot tighter than general chatbots.
               | 
               | I'd love to see a demo of GPT-J on an iPhone!
        
         | tracyhenry wrote:
         | Another big difference is quality of the results. Haven't tried
         | myself but seen many complaints that it's nowhere near GPT-3
         | (at least for the 7B version). Correct me if I'm wrong!
        
           | bestcoder69 wrote:
           | 13B feels on-par with the base non-instruction davinci.
           | People might not realize how it was a bit trickier to prompt
           | gpt3 when it first released.
        
           | simonw wrote:
           | That doesn't bother me so much. GPT-3 had instruction tuning,
           | which makes it MUCH easier to use.
           | 
           | Now that I've seen that LLaMA can work I'm confident someone
           | will release an openly licensed instruction-tuned model that
           | works on the same hardware at some point soon.
           | 
           | I also expect that there are prompt engineering tricks which
           | can be used to get really great results out of LLaMA. I'm
           | hoping someone will come up with a good prompt to get it to
           | summarization, for example.
        
             | sp332 wrote:
             | ChatGPT had an estimated 20,000 hours of human feedback.
             | That's not going to be easy to replicate in an open source
             | way.
        
       | jacooper wrote:
       | Does anybody know how to run this on Linux with an AMD GPU?
       | 
       | Also do I have to bother with their crappy driver module that
       | doesn't support most GPUs?
        
       | patricktlo wrote:
       | That's amazing, any chance of running it on my trusty GTX 1060
       | 6gb, or that's not enough VRAM?
        
       | [deleted]
        
       | ilovefood wrote:
       | This is really great, very good write-up.
       | 
       | Seems it now also supports AVX2 for x86 architectures too.
       | https://twitter.com/ggerganov/status/1634588951821393922
        
       | bilsbie wrote:
       | How's it looking for a six year old MacBook?
       | 
       | Not there yet?
       | 
       | Does this still use the gpu?
        
         | simonw wrote:
         | I believe lambda.cpp has been designed for at least an M1 - no
         | idea if there are options for running LLaMA on older hardware.
        
           | astrange wrote:
           | It doesn't use CoreML so it should work on Intel machines at
           | some speed.
           | 
           | If it used the GPU/ANE and was a true large language model
           | then it would only work on M1 systems because they're unified
           | memory (which nothing except an A100 can match.)
        
       | Spiwux wrote:
       | People have been running large language models locally for a
       | while now. For now the general consensus is that llama is not
       | fundamentally better than local models with similar resource
       | requirements, and in all the comparisons it falls short of an
       | instruction-tuned model like Chat GPT
        
         | version_five wrote:
         | But llama is the most performant model with weights available
         | in the wild.
         | 
         | Personally I hope we quickly get to the stage that there's a
         | real open llm like SD is to DALL-E. It sucks to have to bother
         | with Facebook's core model, and give it more attention than it
         | deserves, just because it's out there.
         | 
         | If facebook had actually released it as an open model, I would
         | have said that all the credit should go to them. But instead
         | people are doing great open source work on top of their un-free
         | model just because it's available, and in the popular
         | conception they're going to get credit that they shouldn't
        
         | bestcoder69 wrote:
         | What instruction tuned LLM is better?
        
           | yunyu wrote:
           | FLAN-UL2
        
         | loufe wrote:
         | I've been following LLaMa closely since release and I'm
         | surprised to see the claim that it's "general consensus" that's
         | it isn't superior. I've seen machine and anecdotal evidence to
         | the contrary. I'm not suggesting you're lying, but I am
         | curious, can you point me to something you're reading?
        
         | simonw wrote:
         | My argument here is that this represents a tipping point.
         | 
         | Prior to LLaMA + llama.cpp you could maybe run a large language
         | model locally... if you had the right GPU rig, and if you
         | really knew what you were doing, and were willing to put in a
         | lot of effort to find and figure out how to run a model.
         | 
         | My hunch is that the ability to run on a M1/M2 MacBook is going
         | to open this up to a lot more people.
         | 
         | (I'm exposing my bias here as a M2 Mac owner.)
         | 
         | I think the race is now on to be the first organization to
         | release a good instruction-tuned model that can run on personal
         | hardware.
        
           | stonerri wrote:
           | As someone who just got the 7B running on a base MacBook
           | M1/8GB, I strongly agree. The rate of tool development &
           | prompt generation should see the same increase that Stable
           | Diffusion did a few months (weeks?) ago.
           | 
           | And given how early the cpp port is, there is likely plenty
           | of performance headroom with more m1/m2-specific
           | optimization.
        
       | seydor wrote:
       | I wonder why we don't have external "neural processing" devices
       | like we once had soundcards. Is anyone working on hardware
       | implementation of transformers?
       | 
       | Kudos to Yann lecun for getting his revenge for Galactica
        
         | jhrmnn wrote:
         | https://en.wikipedia.org/wiki/Tensor_Processing_Unit
        
           | seydor wrote:
           | but those are not for sale, and not transformer-specific.
           | There must be some optimizations that can be done in hardware
           | and transformers are several years old now
        
             | ruuda wrote:
             | You likely already bought one.
             | 
             | https://blog.google/products/pixel/introducing-google-
             | tensor...
             | 
             | https://apple.fandom.com/wiki/Neural_Engine
        
             | jhrmnn wrote:
             | Computation-wise, transformers are really just a bunch of
             | matrix multiplications, nothing more to it. (Which is
             | partially why they're so efficient and scalable.) Also,
             | Nvidia's GPU architectures are moving in the TPU direction
             | (https://www.nvidia.com/en-us/data-center/tensor-cores/).
        
         | zenogantner wrote:
         | > wonder why we don't have external "neural processing" devices
         | like we once had soundcards.
         | 
         | Some video cards/GPUs have become just that, becoming more and
         | more geared towards non-graphics workloads ...
        
         | valine wrote:
         | Nvidia A100 is exactly that. It has lower cuda performance than
         | a RTX 4090, and is almost entirely geared toward ML workloads.
        
       | rvz wrote:
       | There you go and very unsurprising to see that happen very
       | quickly, unless you have a Apple Silicon machine and want to
       | download the model to try it yourself.
       | 
       | I still think that open source LLMs models have to be much
       | smaller than 200GB and to be much better than ChatGPT to be more
       | accessible and highly disruptive to OpenAI.
       | 
       | It is a great accident needed thanks to Meta. For now one can use
       | it as a service and make it as a SaaS rather than depend fully on
       | OpenAI. Open source (or even free binary only) LLMs will
       | eventually disrupt OpenAI's business plans.
        
         | simonw wrote:
         | The 4-bit quantized version of LLaMA 7B used by llama.cpp is a
         | 4GB file. The 13B model is under 8GB.
        
       | Mathnerd314 wrote:
       | > This all changed yesterday, thanks to the combination of
       | Facebook's LLaMA model and llama.cpp by Georgi Gerganov.
       | 
       | George Hotz was so confident that he was riding the wave with his
       | Python implementation:
       | https://github.com/geohot/tinygrad/blob/master/examples/llam....
       | But I guess not, pure C++ seems better.
        
         | quotemstr wrote:
         | Isn't it more the four bit quantization than the choice of C++
         | as an orchestrator that's the win? It's not as if in either the
         | C++ or the Python case that high level code is actually doing
         | the matrix multiplications.
         | 
         | That basically the whole AI revolution is powered by CPython of
         | all things (not even PyPy) is the 100 megaton nuke that should
         | end language warring forever.
         | 
         | That the first AGI will likely be running under a VM so
         | inefficient that even integers are reference counted is God
         | laughing in the face of all the people who've spent the past
         | decades arguing that this language or that language is
         | "faster". Amdahl was right: only inner loops matter.
        
           | minimaxir wrote:
           | > That basically the whole AI revolution is powered by
           | CPython of all things (not even PyPy) is the 100 megaton nuke
           | that should end language warring forever.
           | 
           | And a lot of new AI tooling such as tokenization has been
           | developed for Python using Rust (pyo3)
        
       | camjohnson26 wrote:
       | Are there any online communities running these models with non
       | professional hardware? I keep running into issues with poor
       | documentation or outdated scripts with GPT neox, BLOOM, and even
       | stable diffusion 2. Seems like most of the support is either for
       | professionals with clusters of A100s, or consumers who aren't
       | using code. I have 3 16gb Quadra GPUs but getting this stuff
       | running on them has been surprisingly difficult
        
         | moyix wrote:
         | There's a group of folks on 4chan doing this on gaming class
         | hardware (4080s etc). They have a doc here:
         | https://rentry.org/llama-tard-v2
        
       | BaculumMeumEst wrote:
       | would i have better luck with a gtx 1070 with 8gb of vram or a
       | macbook m1 pro with 16gb of ram?
        
         | techstrategist wrote:
         | M1 Pro for sure
        
         | rahimnathwani wrote:
         | The latter.
        
       ___________________________________________________________________
       (page generated 2023-03-11 23:00 UTC)