[HN Gopher] LLaMA-7B in Pure C++ with full Apple Silicon support
       ___________________________________________________________________
        
       LLaMA-7B in Pure C++ with full Apple Silicon support
        
       Author : mrtksn
       Score  : 181 points
       Date   : 2023-03-10 20:01 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jcq3 wrote:
       | Don't know anything about ML can someone can explain me what is
       | this hype about?
        
         | lxe wrote:
         | You can run your own ChatGPT on your Macbook.
        
         | stagger87 wrote:
         | https://en.wikipedia.org/wiki/Machine_learning
        
         | potatoman22 wrote:
         | This is an advanced language model that can now run quickly on
         | consumer grade hardware. You used to need thousands of dollars
         | of GPUs to run a model as sophisticated as this - now it can be
         | done on a laptop,
        
           | mattfrommars wrote:
           | Wasn't LLaMa official meant to run on consumer grade machine?
           | How does this modify the model to make it work.
           | 
           | All of this is confusing.
        
           | hot_gril wrote:
           | Run meaning run inference, not train, right?
        
             | TOMDM wrote:
             | Yes
        
       | Szpadel wrote:
       | I tried llama 14b model by using one of online tools (mentioned
       | in one of other hn comments, it claimed to use that model, but
       | cannot be 100% sure) and I was very disappointed with results. I
       | tried to ask it fairly simple question: to write regex validating
       | email address, tried this 2 times and got responses: "what is
       | your question" and "please wait..." so it just doged response. in
       | contrast chatgpt was able to provide me with simple regex and
       | also RFC compliant version when asked.
        
         | lxe wrote:
         | I've gotten great q&a / conversational results from llama 7b
         | and 13b using https://github.com/oobabooga/text-generation-
         | webui.                   ],         [           "How does the
         | Schrodinger's cat thought experiment relate to the principles
         | of quantum mechanics?",           "In this famous paradox, a
         | cat inside a box is simultaneously both alive and dead due to
         | the nature of subatomic particles. This demonstrates that at
         | the smallest scales of reality, we cannot predict whether
         | something will happen until we observe it happening."
         | ],         [           "What are some of the potential
         | consequences of climate change on global food security?",
         | "According to researchers at Stanford University, rising
         | temperatures could lead to crop failures and reduced harvests
         | for staple crops such as rice, wheat, corn, and soybeans by
         | midcentury."         ],         [           "Can you provide a
         | brief overview of the history of political philosophy, from
         | Ancient Greece to the present day?",           "Absolutely!
         | Political philosophy is a branch of philosophy concerning the
         | role of government in society. In ancient times, Plato and
         | Aristotle were among the first thinkers to consider the subject
         | in depth. Since then, other influential philosophers such as
         | Thomas Hobbes and John Locke have expanded upon these ideas.
         | Today, contemporary philosophers such as John Rawls continue to
         | debate the merits of different political systems."         ]
        
           | lxe wrote:
           | And 30b:                   Give me a basic recipe for easy
           | spaghetti carbonara              Here is a basic recipe for
           | spaghetti carbonara:              Ingredients: 1 lb
           | spaghetti, 3 tbsp butter, 2 tbsp olive oil, 1/4 cup grated
           | Parmesan cheese, 1 egg, 1/2 teaspoon salt, 1/4 teaspoon
           | pepper, 1/2 cup chopped parsley, 1/2 pound bacon, 1 clove
           | garlic, 1/4 cup heavy cream.              Directions: Cook
           | spaghetti according to package directions; drain. Heat butter
           | and oil in large skillet on medium heat. Add bacon; cook 4
           | minutes or until crisp. Remove from skillet. Add garlic to
           | drippings in pan; saute 1 minute. Stir in egg, salt, pepper,
           | and parsley; stirring constantly. Add bacon and spaghetti;
           | toss well. Sprinkle with cheese. Serves 4.
        
             | zabzonk wrote:
             | doesn't say what to do with cream. sprinkling with cheese
             | is a bit wrong. carbonara sauce is more difficult to do
             | than you (or an AI) might think.
        
         | IanCal wrote:
         | Outside of it being possibly a smaller model, the prompts
         | should be different as llama hasn't been trained to take
         | instructions so it would I think need to be framed more as "A
         | regex for emails is " and let it complete the rest. Just in
         | case you have tried the exact same prompt.
         | 
         | Also those messages sound suspiciously like error messages from
         | a system around it.
        
         | endymi0n wrote:
         | Llama is neither instruction finetuned nor human reinforcement
         | learning ranked. It's a base model like the original GPT3 that
         | needs much more careful prompting and few-shot reasoning to
         | produce similar results.
        
         | dmm wrote:
         | sillysaurusx made some posts about tweaks that greatly improved
         | output for him. Maybe they would help you:
         | 
         | """
         | 
         | Fixed the llama sampler. After turning off top_p, adding top_k
         | 40, setting temp to 0.7, and adding a repetition penalty of
         | 1/0.85, llama 7B is looking nice.
         | 
         | """
         | 
         | https://twitter.com/theshawwn/status/1632569215348531201?cxt...
         | 
         | https://twitter.com/theshawwn/status/1633456289639542789?cxt...
        
           | Szpadel wrote:
           | for reference I tried or here: https://lite.koboldai.net/
           | 
           | with chat mode, and parameters: temp 0.62, top_p 0.9,
           | repetition penalty 1.08
           | 
           | is free to try, but then you have to wait in queue, therefore
           | I had to wait 10min per response and I didn't had patience to
           | test it further
        
       | DesiLurker wrote:
       | but why would you do C++ when its quite clear ML load is highly
       | parallel. the page says vectorized by NEON but no mention whether
       | its autovectorized by gcc or hand optimized. That will have a
       | pretty significant performance impact.
        
       | xiphias2 wrote:
       | George Hotz already implemented LLaMA 7B and 15B on Twitch
       | yesterday on GPU in Tunygrad llama branch:
       | 
       | https://github.com/geohot/tinygrad/tree/llama
       | 
       | The only problem is that it's swapping on 16GB Macbook, so you
       | need at least 24GB in practice.
        
         | jalino23 wrote:
         | that is so impressive!
        
         | attractivechaos wrote:
         | Both are _very_ impressive. A nice thing about Gerganov 's
         | implementation is that it is written in almost pure C. Arguably
         | easier for deployment.
        
         | apetresc wrote:
         | Can you link to the Twitch stream VoD?
        
           | vopi wrote:
           | https://www.twitch.tv/georgehotz/videos?filter=archives&sort.
           | ..
           | 
           | although, there is a VOD channel on YT that might be better.
        
             | yieldcrv wrote:
             | why does it say video unavailable?
        
               | vopi wrote:
               | Hmm. I'm not sure. Presumably this will be updated soon:
               | https://www.youtube.com/@geohotarchive/videos
        
               | stagger87 wrote:
               | I'm pretty sure he only makes his past twitch streams
               | available to subscribers. It's weird they are available
               | on Youtube, maybe he doesn't know that's something he can
               | change.
        
               | jonas21 wrote:
               | I don't think he runs the YouTube channel - it's managed
               | by some of his fans. But like what's he gonna do about
               | it? Send them a takedown notice?
        
             | Mistletoe wrote:
             | This is such a refreshing and neat way to use Twitch.
        
       | technerder wrote:
       | Absolutely love ggerganov's approach with models like this and
       | Whisper. It's just awesome being able to experiment with (what I
       | consider) complex models without needing a billion python/c/cpp
       | dependencies!
        
       | 0xbadc0de5 wrote:
       | That's all fine and good. But to do anything useful, you're going
       | to want a powerful GPU (RTX 3090, RTX 4090 or A6000) with as much
       | VRAM as possible. Unlike the diffusion models, LLM's are very
       | memory-intensive, even at 4-bit GPTQ. The larger models like
       | llama-13b and llama-30b run quite well at 4-bit on a 24GB GPU.
       | The llama-65b-4bit should run on a dual 3090/4090 rig.
       | 
       | Coupled with the leaked Bing prompt and text-generation-webui,
       | the results are quite impressive.
        
         | brookst wrote:
         | Macs have UMA so an off the shelf Mac can use up to about 120GB
         | of vram. Far more than any consumer card, more than the biggest
         | a100.
         | 
         | GPU power is lower, of course, but pure vram is not a problem.
        
         | stu2b50 wrote:
         | VRAM is the thing that Apple Silicon is going to have in excess
         | compared to anything even close in price. MacBook Airs can have
         | 14-15GB of VRAM if necessary.
        
           | kir-gadjello wrote:
           | RAM read bandwidth of M1/M2 is still not really competitive
           | with the large GPUs like RTX3090, but it's getting close,
           | compared with the usual CPU setups.
        
       | 2bitencryption wrote:
       | Could someone with experience explain: what's the theoretical
       | minimum hardware requirement for llama 7B, 15B, etc, that still
       | provides output on the order of <1sec/token?
       | 
       | It seems like we can pull some tricks, like using F16, and some
       | kind of quantization, etc.
       | 
       | At the end of the day, how much overhead is left that can be
       | reduced? What can I expect to have running on 16gb ram with a
       | 3080 and a midrange AMD processor?
        
         | loufe wrote:
         | At 4 bits the 13B LLaMa model can run on a 10GB card!
        
         | thewataccount wrote:
         | 16GB of vram can run the 7B for sure, I'm not sure what the
         | most cutting-edge memory optimization but the 15B is going to
         | be pretty tight I'm not sure that'll fit with what I know of at
         | least, I've got it working at a bit over 20gb of vram I think
         | at 8bit.
         | 
         | If you can't fit it all in vram you can still run it but it'll
         | be slooooow, at least that's been my experience with the 30b.
        
         | 0xbadc0de5 wrote:
         | The 4-bit GPTQ LLaMA models are the current top-performers.
         | This site has done a lot of the heavy lifting:
         | https://github.com/qwopqwop200/GPTQ-for-LLaMa
         | 
         | With 30b-4bit on a RTX 4090, I'm seeing numbers like:
         | 
         | Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens)
         | 
         | Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens)
         | 
         | Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens)
         | 
         | Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens)
         | 
         | The lower size (7b, 13b) are even faster with lower memory use.
         | A 16GB 3080 should be able to run the 13b at 4-bit just fine
         | with reasonable (>1 token/s) latency.
        
         | TaylorAlexander wrote:
         | Well I was able to run the original code with the 7B model on
         | 16GB vram: https://news.ycombinator.com/item?id=35013604
         | 
         | The output I got was underwhelming, though I did not attempt
         | any tuning.
        
           | fnbr wrote:
           | parameter tuning is pretty necessary, according to anecdotes.
           | People on twitter have got good results by changing the
           | default parameters.
        
       | lxe wrote:
       | I'm running 4-bit quantized llamas on torch/cuda with
       | https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing
       | significant tokens/second perf degradation compared to 8-bit
       | bitsandbytes mode. I'm very new to this, and understand very
       | little detail, but I thought it would be faster?
        
       | dekhn wrote:
       | This is sort of the polar opposite of how modern high performance
       | ML frameworks are built. Skimming the code, there's a ton of
       | boilerplate for the various operations that could be library-ized
       | and generified, if that makes sense.
       | 
       | I actually really like minimal implementations of state-of-the-
       | art systems because the code is much easier to understand (modern
       | frameworks are super-complex) but I wonder what it means long-
       | term if you don't need frameworks.
        
         | SCLeo wrote:
         | This is just inference. The core at most ML library is the auto
         | differentiation capability. It will be extremely tedious if you
         | are to calculate the gradients manually. Or, if you implemented
         | your own AD, then it is effectively a minified version of a ML
         | library.
        
       | gumby wrote:
       | Thanks for doing this, nice work!
       | 
       | Please add some sort of license.
        
       | SheinhardtWigCo wrote:
       | Super cool project. This is from the author of whisper.cpp, which
       | enables highly accurate real-time audio transcription on the
       | M1/M2:
       | 
       | https://github.com/ggerganov/whisper.cpp
        
       | mrtksn wrote:
       | The performance on Macbook with M1 Pro is said to be 20 tokens/s
       | 
       | https://twitter.com/ggerganov/status/1634282694208114690
        
         | eis wrote:
         | A Macbook Pro M1 would have the base M1 CPU while he was
         | referring to the M1 Pro CPU in something like a Macbook Pro w/
         | M1 Pro. It's confusing naming by Apple.
        
           | mrtksn wrote:
           | right, fixed it
        
         | lxe wrote:
         | This is faster than running it on an RTX 4090 I think.
        
           | [deleted]
        
       | olliej wrote:
       | Isn't using any of the AS "ML" coprocessor/extensions/whatever,
       | so it's just normal simd.
        
         | astrange wrote:
         | It uses Accelerate so it may be using some of that indirectly.
        
       | throwuwu wrote:
       | Now someone translate it to zig
        
       | thewataccount wrote:
       | I don't have the hardware to run the 60B model to test this at
       | the moment -
       | 
       | How does it perform with programming, for example making a basic
       | python script to scrape a website, or a bash script, etc?
       | 
       | I've managed to run the 13B* at 8bit with decent performance on a
       | 4090 - but it's only 24GB of VMRAM so I've been struggling to run
       | the 30B at anything more then a snails pace.
        
         | 0xbadc0de5 wrote:
         | The 13b and 30b run quite well on a 4090 at 4-bit quantization.
        
           | thewataccount wrote:
           | Ah dang I missed that I was still using the 8bit mode, I'll
           | look into that thanks!
        
         | arthurcolle wrote:
         | you mean the 13B ?
        
           | thewataccount wrote:
           | Yeah my bad, everyone is a bit all over the place with the
           | numbers in this thread.
           | 
           | I'm not exactly sure how these numbers were chosen, they seem
           | a bit odd?
        
       | mattfrommars wrote:
       | I have very limited in this domain.
       | 
       | Why is it necessary to port LLaMa Into C? Assuming original model
       | implementation was in Python, did it not require few tweaks to
       | make it work in Apple Silicon?
        
       ___________________________________________________________________
       (page generated 2023-03-10 23:00 UTC)