[HN Gopher] Cerebras-GPT vs. LLaMA AI Model Performance Comparison
       ___________________________________________________________________
        
       Cerebras-GPT vs. LLaMA AI Model Performance Comparison
        
       Author : freeqaz
       Score  : 73 points
       Date   : 2023-03-29 19:26 UTC (3 hours ago)
        
 (HTM) web link (www.lunasec.io)
 (TXT) w3m dump (www.lunasec.io)
        
       | ftxbro wrote:
       | For context, the cerebras models were trained in only a couple
       | weeks and the purpose of the training was to establish a scaling
       | law for compute-optimal training, presumably for predicting what
       | will happen with larger models where it's more important to train
       | in a compute-optimal way. This is a different goal than that of
       | other research projects that try to get the most power per VRAM
       | on small models.
        
         | pama wrote:
         | The "law" was previously established empirically already and is
         | only of relevance as a technical detail to a few specialists
         | that may care. I think it was a strategic mistake to only
         | release models that are weaker than what people can get their
         | hands on. Is there a limit on that hardware scaling to larger
         | models? As a hardware company that tries to stay in the game
         | they should show some signs of dominance, not just Apache
         | license.
        
         | freeqaz wrote:
         | That makes sense especially since they're not intending to
         | deploy this model to production. For models like GPT-3/4 it
         | makes sense why they would train them more because the costs of
         | running the inference "in production" likely dominates the
         | compute costs. (Just like how Youtube will spend 50x more
         | compute to compress a video an extra 2% because bandwidth costs
         | far outstrip the compression costs.)
         | 
         | Do you know what percentage, roughly, this model has been
         | trained relative to something like LLaMA? Are we talking 10%?
         | 50%? 90%?
         | 
         | It may be possible that it is still useful if it can be trained
         | further by the community!
        
           | gpm wrote:
           | LLaMa 65B and 30B were trained on 1.4 trillion tokens. This
           | model was trained on 260 billion tokens.
        
             | freeqaz wrote:
             | So ~18.6% trained relative to LLaMa. That's not _nothing_
             | but it's also not great. Thanks for digging into this!
        
       | breadchris wrote:
       | Wow! I didn't even know that Cerebras was a thing and I have been
       | trying to keep up to date with this stuff!
        
         | ftxbro wrote:
         | other hacker news cerebras discussion is here:
         | https://news.ycombinator.com/item?id=35343763
        
         | imaurer wrote:
         | Submit new links here :)
         | 
         | https://github.com/imaurer/awesome-decentralized-llm
        
       | dumbaccount123 wrote:
       | Oh my god enough, please let us just go back to living in caves.
       | This is becoming unbearable at this point.
        
         | knicholes wrote:
         | Nobody is stopping you from living in a cave! ... at least I
         | don't think so.
        
           | uejfiweun wrote:
           | I dunno man, you seen cave prices these days? The cave market
           | is in a tough spot until the fed starts cutting...
        
             | jeron wrote:
             | "Powell no cut interest rates because economy strong" -
             | chatGPT
        
       | sbierwagen wrote:
       | >>Is 10000 bigger than 10050?
       | 
       | >>Yes, 10000 is bigger than 10050.
       | 
       | >But even the mighty ChatGPT often can't do simple math
       | 
       | GPT is bad at math because BPE input compression obfuscates
       | individual digits. https://bbot.org/etc/gpt-math.png You'd be bad
       | at math too if every number was scrambled.
       | 
       | The graph is from page 22 of the GPT-3 paper from 2020.
       | https://arxiv.org/abs/2005.14165 Even with 175 billion parameters
       | it can't reliably do four digit addition.
       | 
       | An example from 4 days ago of ChatGPT being as bad as you'd
       | expect at string reversal:
       | https://news.ycombinator.com/item?id=35297183
       | 
       | (Although, I just tested ChatGPT Mar 14 Version against the above
       | question after doing a bunch of math prompting and it got it
       | right...)
        
         | croddin wrote:
         | I'm not sure if these models use the GPT tokenizer, but if you
         | type a long string of numbers into
         | https://platform.openai.com/tokenizer, you can see the tokens
         | that the LLM would see. What the LLMs get as input for math is
         | significantly worse then having to do mental math with roman
         | numerals, tokenizing makes sense for words but for numbers it
         | seems the like LLMs would have to learn a lot more steps. I
         | wonder if limiting number tokens to 2 digits per token, instead
         | of the 1-3 it currently is would improve models math.
        
           | [deleted]
        
         | sillysaurusx wrote:
         | This is a common myth, which I've written about before.
         | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
         | 
         | The closest anyone's come to proving that byte-level
         | tokenization is better is the ByT5 paper
         | https://arxiv.org/abs/2105.13626
         | 
         | But they only showed evidence for improvement on specific
         | tasks, not general performance, which is an important
         | distinction. And their own benchmarks show that the
         | improvements tend to be marginal:
         | https://i.imgur.com/6Cw0APS.png
         | 
         | One view of the situation is that byte-level access (or "digit-
         | level" in this case) gives a way to accelerate training, and to
         | achieve higher performance with fewer parameters. The model
         | doesn't need to spend as much effort on learning the
         | decompression algorithm (tokenization).
         | 
         | But once learned, the tokenization doesn't seem to hinder a
         | model from achieving higher performance, the same way that JPG
         | compression doesn't hinder us from achieving an image that
         | looks very good to humans. It's a bit like arguing an artist
         | would be better if they only operated on raw bitmaps, or that
         | our eyes would be better if our visual cortex didn't do any
         | signal compression. Maybe, but the fact that our eyes do it is
         | pretty strong evidence that compression isn't harmful.
        
           | sbierwagen wrote:
           | I'm not sure how this is germane?
           | 
           | I'm talking _about_ specific tasks: saying if 10000 or 10050
           | is larger. GPT is demonstrably bad at that. The ByT5 paper
           | doesn 't mention arithmetic tasks or show benchmark results
           | for the specific task I mention.
           | 
           | Your linked comment says:
           | 
           | >This is a common myth but in practice no one (as far as I
           | know) has shown that byte level predictions result in
           | superior overall performance.
           | 
           | Stating if BPE or character tokenization is better for
           | everything is a much broader claim, one I didn't make! One
           | could easily imagine a toolformer that calls out to calc.exe
           | for anything involving numbers which would get much better
           | numeric performance while still using BPEs.
        
             | sillysaurusx wrote:
             | > GPT is bad at math because BPE input compression
             | obfuscates individual digits. https://bbot.org/etc/gpt-
             | math.png You'd be bad at math too if every number was
             | scrambled.
             | 
             | This is the myth I was referring to. BPE compression may
             | slow down training, but it doesn't follow that slower
             | training is the reason for being bad at math.
             | 
             | If you trained GPT specifically on arithmetic tasks, you'd
             | get superior performance to GPT-3, regardless of which
             | tokenization scheme you'd use. But you'd destroy most of
             | its knowledge about everything not-arithmetic.
        
               | sbierwagen wrote:
               | >BPE compression may slow down training, but it doesn't
               | follow that slower training is the reason for being bad
               | at math.
               | 
               | It's not so much that it _slows down_ training, is that
               | it completely destroys the relationship between digits
               | and results. Every number is assigned a random token ID,
               | so GPT-3 had to memorize every operation separately. It
               | couldn 't generalize at all, which is why it got worse at
               | larger numbers, which showed up less often in the
               | training set-- no examples to remember.
               | 
               | You can try the tokenizer online here:
               | https://platform.openai.com/tokenizer
               | 
               | It assigns the input text `10 11 12 13 14 15 16` token
               | IDs `940, 1367, 1105, 1511, 1478, 1315, 1467`. How is it
               | supposed to figure out incrementing numbers from that?
               | Well, it can't, so it memorizes them. "Neural nets want
               | to work"!
               | 
               | I used the past tense above, because while writing this
               | comment I asked ChatGPT Mar 14 Version a bunch of
               | manydigit addition and substraction questions and it got
               | them all right. Then I asked it if one of those large
               | numbers contained an 8 and it... hallucinated a
               | completely wrong answer, oops: https://bbot.org/etc/gpt-
               | math2.png It's also still spotty at multiplication: "The
               | product of 82368 and 33333 is 2745504384." Well, you got
               | the first five digits right...
        
               | f_devd wrote:
               | > If you trained GPT specifically on arithmetic tasks
               | 
               | Sure but you'd have a lot of overlapping tokens with BPE,
               | which doesn't help with convergence. GP is claiming it's
               | specifically worse at arithmetic because of BPE which is
               | true.
        
             | stormfather wrote:
             | Are you referring to how BPE is permutation invariant?
             | (ignoring positional encoding)
        
         | spyder wrote:
         | There was a paper where they have found that converting the
         | numbers to scientific notation ( like 1.5e-7) has improved
         | these transformer-based language models at math, if I remember
         | correctly. (with a quick search I could not find the link to it
         | now)
        
         | cma wrote:
         | Llama individually tokenized digits, how much did it fix the
         | issue?
        
       ___________________________________________________________________
       (page generated 2023-03-29 23:00 UTC)