[HN Gopher] Cerebras-GPT vs. LLaMA AI Model Performance Comparison ___________________________________________________________________ Cerebras-GPT vs. LLaMA AI Model Performance Comparison Author : freeqaz Score : 73 points Date : 2023-03-29 19:26 UTC (3 hours ago) (HTM) web link (www.lunasec.io) (TXT) w3m dump (www.lunasec.io) | ftxbro wrote: | For context, the cerebras models were trained in only a couple | weeks and the purpose of the training was to establish a scaling | law for compute-optimal training, presumably for predicting what | will happen with larger models where it's more important to train | in a compute-optimal way. This is a different goal than that of | other research projects that try to get the most power per VRAM | on small models. | pama wrote: | The "law" was previously established empirically already and is | only of relevance as a technical detail to a few specialists | that may care. I think it was a strategic mistake to only | release models that are weaker than what people can get their | hands on. Is there a limit on that hardware scaling to larger | models? As a hardware company that tries to stay in the game | they should show some signs of dominance, not just Apache | license. | freeqaz wrote: | That makes sense especially since they're not intending to | deploy this model to production. For models like GPT-3/4 it | makes sense why they would train them more because the costs of | running the inference "in production" likely dominates the | compute costs. (Just like how Youtube will spend 50x more | compute to compress a video an extra 2% because bandwidth costs | far outstrip the compression costs.) | | Do you know what percentage, roughly, this model has been | trained relative to something like LLaMA? Are we talking 10%? | 50%? 90%? | | It may be possible that it is still useful if it can be trained | further by the community! | gpm wrote: | LLaMa 65B and 30B were trained on 1.4 trillion tokens. This | model was trained on 260 billion tokens. | freeqaz wrote: | So ~18.6% trained relative to LLaMa. That's not _nothing_ | but it's also not great. Thanks for digging into this! | breadchris wrote: | Wow! I didn't even know that Cerebras was a thing and I have been | trying to keep up to date with this stuff! | ftxbro wrote: | other hacker news cerebras discussion is here: | https://news.ycombinator.com/item?id=35343763 | imaurer wrote: | Submit new links here :) | | https://github.com/imaurer/awesome-decentralized-llm | dumbaccount123 wrote: | Oh my god enough, please let us just go back to living in caves. | This is becoming unbearable at this point. | knicholes wrote: | Nobody is stopping you from living in a cave! ... at least I | don't think so. | uejfiweun wrote: | I dunno man, you seen cave prices these days? The cave market | is in a tough spot until the fed starts cutting... | jeron wrote: | "Powell no cut interest rates because economy strong" - | chatGPT | sbierwagen wrote: | >>Is 10000 bigger than 10050? | | >>Yes, 10000 is bigger than 10050. | | >But even the mighty ChatGPT often can't do simple math | | GPT is bad at math because BPE input compression obfuscates | individual digits. https://bbot.org/etc/gpt-math.png You'd be bad | at math too if every number was scrambled. | | The graph is from page 22 of the GPT-3 paper from 2020. | https://arxiv.org/abs/2005.14165 Even with 175 billion parameters | it can't reliably do four digit addition. | | An example from 4 days ago of ChatGPT being as bad as you'd | expect at string reversal: | https://news.ycombinator.com/item?id=35297183 | | (Although, I just tested ChatGPT Mar 14 Version against the above | question after doing a bunch of math prompting and it got it | right...) | croddin wrote: | I'm not sure if these models use the GPT tokenizer, but if you | type a long string of numbers into | https://platform.openai.com/tokenizer, you can see the tokens | that the LLM would see. What the LLMs get as input for math is | significantly worse then having to do mental math with roman | numerals, tokenizing makes sense for words but for numbers it | seems the like LLMs would have to learn a lot more steps. I | wonder if limiting number tokens to 2 digits per token, instead | of the 1-3 it currently is would improve models math. | [deleted] | sillysaurusx wrote: | This is a common myth, which I've written about before. | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... | | The closest anyone's come to proving that byte-level | tokenization is better is the ByT5 paper | https://arxiv.org/abs/2105.13626 | | But they only showed evidence for improvement on specific | tasks, not general performance, which is an important | distinction. And their own benchmarks show that the | improvements tend to be marginal: | https://i.imgur.com/6Cw0APS.png | | One view of the situation is that byte-level access (or "digit- | level" in this case) gives a way to accelerate training, and to | achieve higher performance with fewer parameters. The model | doesn't need to spend as much effort on learning the | decompression algorithm (tokenization). | | But once learned, the tokenization doesn't seem to hinder a | model from achieving higher performance, the same way that JPG | compression doesn't hinder us from achieving an image that | looks very good to humans. It's a bit like arguing an artist | would be better if they only operated on raw bitmaps, or that | our eyes would be better if our visual cortex didn't do any | signal compression. Maybe, but the fact that our eyes do it is | pretty strong evidence that compression isn't harmful. | sbierwagen wrote: | I'm not sure how this is germane? | | I'm talking _about_ specific tasks: saying if 10000 or 10050 | is larger. GPT is demonstrably bad at that. The ByT5 paper | doesn 't mention arithmetic tasks or show benchmark results | for the specific task I mention. | | Your linked comment says: | | >This is a common myth but in practice no one (as far as I | know) has shown that byte level predictions result in | superior overall performance. | | Stating if BPE or character tokenization is better for | everything is a much broader claim, one I didn't make! One | could easily imagine a toolformer that calls out to calc.exe | for anything involving numbers which would get much better | numeric performance while still using BPEs. | sillysaurusx wrote: | > GPT is bad at math because BPE input compression | obfuscates individual digits. https://bbot.org/etc/gpt- | math.png You'd be bad at math too if every number was | scrambled. | | This is the myth I was referring to. BPE compression may | slow down training, but it doesn't follow that slower | training is the reason for being bad at math. | | If you trained GPT specifically on arithmetic tasks, you'd | get superior performance to GPT-3, regardless of which | tokenization scheme you'd use. But you'd destroy most of | its knowledge about everything not-arithmetic. | sbierwagen wrote: | >BPE compression may slow down training, but it doesn't | follow that slower training is the reason for being bad | at math. | | It's not so much that it _slows down_ training, is that | it completely destroys the relationship between digits | and results. Every number is assigned a random token ID, | so GPT-3 had to memorize every operation separately. It | couldn 't generalize at all, which is why it got worse at | larger numbers, which showed up less often in the | training set-- no examples to remember. | | You can try the tokenizer online here: | https://platform.openai.com/tokenizer | | It assigns the input text `10 11 12 13 14 15 16` token | IDs `940, 1367, 1105, 1511, 1478, 1315, 1467`. How is it | supposed to figure out incrementing numbers from that? | Well, it can't, so it memorizes them. "Neural nets want | to work"! | | I used the past tense above, because while writing this | comment I asked ChatGPT Mar 14 Version a bunch of | manydigit addition and substraction questions and it got | them all right. Then I asked it if one of those large | numbers contained an 8 and it... hallucinated a | completely wrong answer, oops: https://bbot.org/etc/gpt- | math2.png It's also still spotty at multiplication: "The | product of 82368 and 33333 is 2745504384." Well, you got | the first five digits right... | f_devd wrote: | > If you trained GPT specifically on arithmetic tasks | | Sure but you'd have a lot of overlapping tokens with BPE, | which doesn't help with convergence. GP is claiming it's | specifically worse at arithmetic because of BPE which is | true. | stormfather wrote: | Are you referring to how BPE is permutation invariant? | (ignoring positional encoding) | spyder wrote: | There was a paper where they have found that converting the | numbers to scientific notation ( like 1.5e-7) has improved | these transformer-based language models at math, if I remember | correctly. (with a quick search I could not find the link to it | now) | cma wrote: | Llama individually tokenized digits, how much did it fix the | issue? ___________________________________________________________________ (page generated 2023-03-29 23:00 UTC)