[HN Gopher] LLaMA-7B in Pure C++ with full Apple Silicon support ___________________________________________________________________ LLaMA-7B in Pure C++ with full Apple Silicon support Author : mrtksn Score : 181 points Date : 2023-03-10 20:01 UTC (2 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | jcq3 wrote: | Don't know anything about ML can someone can explain me what is | this hype about? | lxe wrote: | You can run your own ChatGPT on your Macbook. | stagger87 wrote: | https://en.wikipedia.org/wiki/Machine_learning | potatoman22 wrote: | This is an advanced language model that can now run quickly on | consumer grade hardware. You used to need thousands of dollars | of GPUs to run a model as sophisticated as this - now it can be | done on a laptop, | mattfrommars wrote: | Wasn't LLaMa official meant to run on consumer grade machine? | How does this modify the model to make it work. | | All of this is confusing. | hot_gril wrote: | Run meaning run inference, not train, right? | TOMDM wrote: | Yes | Szpadel wrote: | I tried llama 14b model by using one of online tools (mentioned | in one of other hn comments, it claimed to use that model, but | cannot be 100% sure) and I was very disappointed with results. I | tried to ask it fairly simple question: to write regex validating | email address, tried this 2 times and got responses: "what is | your question" and "please wait..." so it just doged response. in | contrast chatgpt was able to provide me with simple regex and | also RFC compliant version when asked. | lxe wrote: | I've gotten great q&a / conversational results from llama 7b | and 13b using https://github.com/oobabooga/text-generation- | webui. ], [ "How does the | Schrodinger's cat thought experiment relate to the principles | of quantum mechanics?", "In this famous paradox, a | cat inside a box is simultaneously both alive and dead due to | the nature of subatomic particles. This demonstrates that at | the smallest scales of reality, we cannot predict whether | something will happen until we observe it happening." | ], [ "What are some of the potential | consequences of climate change on global food security?", | "According to researchers at Stanford University, rising | temperatures could lead to crop failures and reduced harvests | for staple crops such as rice, wheat, corn, and soybeans by | midcentury." ], [ "Can you provide a | brief overview of the history of political philosophy, from | Ancient Greece to the present day?", "Absolutely! | Political philosophy is a branch of philosophy concerning the | role of government in society. In ancient times, Plato and | Aristotle were among the first thinkers to consider the subject | in depth. Since then, other influential philosophers such as | Thomas Hobbes and John Locke have expanded upon these ideas. | Today, contemporary philosophers such as John Rawls continue to | debate the merits of different political systems." ] | lxe wrote: | And 30b: Give me a basic recipe for easy | spaghetti carbonara Here is a basic recipe for | spaghetti carbonara: Ingredients: 1 lb | spaghetti, 3 tbsp butter, 2 tbsp olive oil, 1/4 cup grated | Parmesan cheese, 1 egg, 1/2 teaspoon salt, 1/4 teaspoon | pepper, 1/2 cup chopped parsley, 1/2 pound bacon, 1 clove | garlic, 1/4 cup heavy cream. Directions: Cook | spaghetti according to package directions; drain. Heat butter | and oil in large skillet on medium heat. Add bacon; cook 4 | minutes or until crisp. Remove from skillet. Add garlic to | drippings in pan; saute 1 minute. Stir in egg, salt, pepper, | and parsley; stirring constantly. Add bacon and spaghetti; | toss well. Sprinkle with cheese. Serves 4. | zabzonk wrote: | doesn't say what to do with cream. sprinkling with cheese | is a bit wrong. carbonara sauce is more difficult to do | than you (or an AI) might think. | IanCal wrote: | Outside of it being possibly a smaller model, the prompts | should be different as llama hasn't been trained to take | instructions so it would I think need to be framed more as "A | regex for emails is " and let it complete the rest. Just in | case you have tried the exact same prompt. | | Also those messages sound suspiciously like error messages from | a system around it. | endymi0n wrote: | Llama is neither instruction finetuned nor human reinforcement | learning ranked. It's a base model like the original GPT3 that | needs much more careful prompting and few-shot reasoning to | produce similar results. | dmm wrote: | sillysaurusx made some posts about tweaks that greatly improved | output for him. Maybe they would help you: | | """ | | Fixed the llama sampler. After turning off top_p, adding top_k | 40, setting temp to 0.7, and adding a repetition penalty of | 1/0.85, llama 7B is looking nice. | | """ | | https://twitter.com/theshawwn/status/1632569215348531201?cxt... | | https://twitter.com/theshawwn/status/1633456289639542789?cxt... | Szpadel wrote: | for reference I tried or here: https://lite.koboldai.net/ | | with chat mode, and parameters: temp 0.62, top_p 0.9, | repetition penalty 1.08 | | is free to try, but then you have to wait in queue, therefore | I had to wait 10min per response and I didn't had patience to | test it further | DesiLurker wrote: | but why would you do C++ when its quite clear ML load is highly | parallel. the page says vectorized by NEON but no mention whether | its autovectorized by gcc or hand optimized. That will have a | pretty significant performance impact. | xiphias2 wrote: | George Hotz already implemented LLaMA 7B and 15B on Twitch | yesterday on GPU in Tunygrad llama branch: | | https://github.com/geohot/tinygrad/tree/llama | | The only problem is that it's swapping on 16GB Macbook, so you | need at least 24GB in practice. | jalino23 wrote: | that is so impressive! | attractivechaos wrote: | Both are _very_ impressive. A nice thing about Gerganov 's | implementation is that it is written in almost pure C. Arguably | easier for deployment. | apetresc wrote: | Can you link to the Twitch stream VoD? | vopi wrote: | https://www.twitch.tv/georgehotz/videos?filter=archives&sort. | .. | | although, there is a VOD channel on YT that might be better. | yieldcrv wrote: | why does it say video unavailable? | vopi wrote: | Hmm. I'm not sure. Presumably this will be updated soon: | https://www.youtube.com/@geohotarchive/videos | stagger87 wrote: | I'm pretty sure he only makes his past twitch streams | available to subscribers. It's weird they are available | on Youtube, maybe he doesn't know that's something he can | change. | jonas21 wrote: | I don't think he runs the YouTube channel - it's managed | by some of his fans. But like what's he gonna do about | it? Send them a takedown notice? | Mistletoe wrote: | This is such a refreshing and neat way to use Twitch. | technerder wrote: | Absolutely love ggerganov's approach with models like this and | Whisper. It's just awesome being able to experiment with (what I | consider) complex models without needing a billion python/c/cpp | dependencies! | 0xbadc0de5 wrote: | That's all fine and good. But to do anything useful, you're going | to want a powerful GPU (RTX 3090, RTX 4090 or A6000) with as much | VRAM as possible. Unlike the diffusion models, LLM's are very | memory-intensive, even at 4-bit GPTQ. The larger models like | llama-13b and llama-30b run quite well at 4-bit on a 24GB GPU. | The llama-65b-4bit should run on a dual 3090/4090 rig. | | Coupled with the leaked Bing prompt and text-generation-webui, | the results are quite impressive. | brookst wrote: | Macs have UMA so an off the shelf Mac can use up to about 120GB | of vram. Far more than any consumer card, more than the biggest | a100. | | GPU power is lower, of course, but pure vram is not a problem. | stu2b50 wrote: | VRAM is the thing that Apple Silicon is going to have in excess | compared to anything even close in price. MacBook Airs can have | 14-15GB of VRAM if necessary. | kir-gadjello wrote: | RAM read bandwidth of M1/M2 is still not really competitive | with the large GPUs like RTX3090, but it's getting close, | compared with the usual CPU setups. | 2bitencryption wrote: | Could someone with experience explain: what's the theoretical | minimum hardware requirement for llama 7B, 15B, etc, that still | provides output on the order of <1sec/token? | | It seems like we can pull some tricks, like using F16, and some | kind of quantization, etc. | | At the end of the day, how much overhead is left that can be | reduced? What can I expect to have running on 16gb ram with a | 3080 and a midrange AMD processor? | loufe wrote: | At 4 bits the 13B LLaMa model can run on a 10GB card! | thewataccount wrote: | 16GB of vram can run the 7B for sure, I'm not sure what the | most cutting-edge memory optimization but the 15B is going to | be pretty tight I'm not sure that'll fit with what I know of at | least, I've got it working at a bit over 20gb of vram I think | at 8bit. | | If you can't fit it all in vram you can still run it but it'll | be slooooow, at least that's been my experience with the 30b. | 0xbadc0de5 wrote: | The 4-bit GPTQ LLaMA models are the current top-performers. | This site has done a lot of the heavy lifting: | https://github.com/qwopqwop200/GPTQ-for-LLaMa | | With 30b-4bit on a RTX 4090, I'm seeing numbers like: | | Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens) | | Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens) | | Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens) | | Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens) | | The lower size (7b, 13b) are even faster with lower memory use. | A 16GB 3080 should be able to run the 13b at 4-bit just fine | with reasonable (>1 token/s) latency. | TaylorAlexander wrote: | Well I was able to run the original code with the 7B model on | 16GB vram: https://news.ycombinator.com/item?id=35013604 | | The output I got was underwhelming, though I did not attempt | any tuning. | fnbr wrote: | parameter tuning is pretty necessary, according to anecdotes. | People on twitter have got good results by changing the | default parameters. | lxe wrote: | I'm running 4-bit quantized llamas on torch/cuda with | https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing | significant tokens/second perf degradation compared to 8-bit | bitsandbytes mode. I'm very new to this, and understand very | little detail, but I thought it would be faster? | dekhn wrote: | This is sort of the polar opposite of how modern high performance | ML frameworks are built. Skimming the code, there's a ton of | boilerplate for the various operations that could be library-ized | and generified, if that makes sense. | | I actually really like minimal implementations of state-of-the- | art systems because the code is much easier to understand (modern | frameworks are super-complex) but I wonder what it means long- | term if you don't need frameworks. | SCLeo wrote: | This is just inference. The core at most ML library is the auto | differentiation capability. It will be extremely tedious if you | are to calculate the gradients manually. Or, if you implemented | your own AD, then it is effectively a minified version of a ML | library. | gumby wrote: | Thanks for doing this, nice work! | | Please add some sort of license. | SheinhardtWigCo wrote: | Super cool project. This is from the author of whisper.cpp, which | enables highly accurate real-time audio transcription on the | M1/M2: | | https://github.com/ggerganov/whisper.cpp | mrtksn wrote: | The performance on Macbook with M1 Pro is said to be 20 tokens/s | | https://twitter.com/ggerganov/status/1634282694208114690 | eis wrote: | A Macbook Pro M1 would have the base M1 CPU while he was | referring to the M1 Pro CPU in something like a Macbook Pro w/ | M1 Pro. It's confusing naming by Apple. | mrtksn wrote: | right, fixed it | lxe wrote: | This is faster than running it on an RTX 4090 I think. | [deleted] | olliej wrote: | Isn't using any of the AS "ML" coprocessor/extensions/whatever, | so it's just normal simd. | astrange wrote: | It uses Accelerate so it may be using some of that indirectly. | throwuwu wrote: | Now someone translate it to zig | thewataccount wrote: | I don't have the hardware to run the 60B model to test this at | the moment - | | How does it perform with programming, for example making a basic | python script to scrape a website, or a bash script, etc? | | I've managed to run the 13B* at 8bit with decent performance on a | 4090 - but it's only 24GB of VMRAM so I've been struggling to run | the 30B at anything more then a snails pace. | 0xbadc0de5 wrote: | The 13b and 30b run quite well on a 4090 at 4-bit quantization. | thewataccount wrote: | Ah dang I missed that I was still using the 8bit mode, I'll | look into that thanks! | arthurcolle wrote: | you mean the 13B ? | thewataccount wrote: | Yeah my bad, everyone is a bit all over the place with the | numbers in this thread. | | I'm not exactly sure how these numbers were chosen, they seem | a bit odd? | mattfrommars wrote: | I have very limited in this domain. | | Why is it necessary to port LLaMa Into C? Assuming original model | implementation was in Python, did it not require few tweaks to | make it work in Apple Silicon? ___________________________________________________________________ (page generated 2023-03-10 23:00 UTC)