[HN Gopher] DeepMind's New Language Model, Chinchilla
       ___________________________________________________________________
        
       DeepMind's New Language Model, Chinchilla
        
       Author : georgehill
       Score  : 195 points
       Date   : 2022-04-11 12:41 UTC (10 hours ago)
        
 (HTM) web link (www.marktechpost.com)
 (TXT) w3m dump (www.marktechpost.com)
        
       | g051051 wrote:
       | Is there a good reference as to what a "parameter" is in this
       | context? I've looked a few times, but the explanations don't make
       | any sense to me.
        
         | guipsp wrote:
         | You can think of a parameter as a number you can tweak while
         | training. This network has 70B such numbers.
        
           | sirk390 wrote:
           | And if every parameter is one byte, the minimum, it will take
           | at least 70gb to save or share this model. So it's still way
           | to big to package directly in a app.
        
             | cshimmin wrote:
             | From the paper, they are using bfloat16, so I guess two
             | bytes. But distributing and "packaging into an app" are not
             | at all of practical interest for these kinds of models. You
             | (a consumer) would interact via some API service, with the
             | model running on a hardware-accelerated compute cloud.
             | 
             | In any case, during training (where the model is run in
             | possibly large batches), and even during inference, the
             | size of the parameters is completely dwarfed by the
             | intermediate tensor representations.
        
               | brrrrrm wrote:
               | > even during inference, the size of the parameters is
               | completely dwarfed by the intermediate tensor
               | representations
               | 
               | What makes you say this?
        
               | cshimmin wrote:
               | It's especially true for models that do some kind of
               | weight sharing, which is very common (CNNs, RNNs,
               | transformers, etc). For a concrete example, consider a
               | layer from an image convolutional network, which maps
               | from a 3-dim colorspace to a 128-dim feature space.
               | Assuming a 5x5 kernel that's about 10k parameters.
               | However, after applying this layer, you go from having an
               | (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W
               | are the height and width of the image, and B is the
               | number of images in the batch. If you're working with
               | even moderately high resolution images, the memory
               | required for these intermediate tensors at each layer is
               | much larger than the parameters.
               | 
               | Something similar applies for RNNs (same weights applied
               | at each element of a sequence), GNNs and transformers
               | (same weights applied at each _pair_ of data).
        
             | lostmsu wrote:
             | Have you seen modern games?
        
               | sva_ wrote:
               | I doubt they load that amount of data in memory
        
               | replygirl wrote:
               | I'm thinking about upgrading from 64gb to 128gb so i can
               | use all my Cities: Skylines assets in the same map
        
               | lostmsu wrote:
               | Right, they usually stream assets as they are requested.
               | Large models do the same.
        
         | cshimmin wrote:
         | It's a degree of freedom of the learnable model. For example in
         | a "vanilla" neural network layer (MLP), which maps from M to N
         | feature dimensions will contain an MxN matrix of learnable
         | parameters that model the connections between the M inputs to
         | the N outputs. Every time the model is updated during
         | backpropagation, the loss gradient which has to be computed has
         | the same dimensionality as the number of parameters. Also,
         | generally more parameters means more operations in the forward
         | pass. Therefore, a model with more parameters in general will
         | require more FLOPs per iteration of training. The main point of
         | this paper is that you can actually do better by training a
         | smaller model for longer, rather than a bigger model for less
         | time, assuming you have a fixed FLOP budget.
        
           | zamalek wrote:
           | The other thing with more parameters is that it gives the NN
           | more ability to overfit. That means that instead of, say,
           | learning what a dog is, it instead memorises all the
           | sentences containing "dog" that it has ever seen.
        
       | mangoham wrote:
       | Cached version since the original is down (I'm assuming it's down
       | due to load issues and not due to the author taking it down).
       | https://webcache.googleusercontent.com/search?q=cache:PLSLy9...
        
       | ritwikgupta wrote:
       | Off-topic to Chinchilla, but relevant to the source site:
       | MarkTechPost consistently borderline plagiarizes articles and
       | shares them on their website as "paper summaries". They copy-
       | paste from the source material and change some of the wording
       | around as to appear original. My work, as well as other work from
       | Berkeley AI Research, has been posted in this manner on their
       | site.
       | 
       | This seems highly unethical, and I'm surprised how they continue
       | to operate.
        
         | andreyk wrote:
         | To add to this - they do this regularly, multiple times per
         | week. While they do link to and acknowledge the source work,
         | they do not make clear their writing is quoted or nearly
         | quoted.
        
         | brrrrrm wrote:
         | Thanks for the heads up! In that case, I'd prefer not to share
         | this link with peers. Do you have an alternative source with
         | similar high-level content to share?
        
           | lstamour wrote:
           | Tough to say. Technically
           | https://arxiv.org/pdf/2203.15556.pdf has the same content, it
           | just isn't highlighted the same way.
        
         | boplicity wrote:
         | Fill out a DMCA notice:
         | 
         | https://abuse.cloudflare.com/
         | 
         | Cloudflare will forward it to their host, I believe, who will
         | then ask that they remove the infringing material, or provide a
         | counter claim.
        
         | parhamn wrote:
         | I don't know about this site, and I agree its unethical. But it
         | does make me realize that I much prefer using language of the
         | paper directly as opposed to having a non-expert poorly
         | translate what your paper said. Especially given how papers put
         | a lot of time in the accuracy and specificity of their language
         | and word choices.
         | 
         | Would it also annoy you if they screwed up the interpretation
         | of what you wrote? Is the alternative less reach of your work?
         | For hard core research the tradeoffs are tougher it seems. If
         | it is just a matter of non-nevermind, thats strictly messed up.
        
         | realYitzi wrote:
         | We better get used to it. Because news companies will say an AI
         | wrote it. No law allows suing an AI for plagiarism. Go prove
         | something is not an AI.
        
           | nudpiedo wrote:
           | No one sues the car, the dog or the children, but the owner,
           | responsible, parent, etc.
        
         | georgehill wrote:
         | OP here - Thanks for sharing. I wasn't aware of this but
         | despite this behavior, they are getting 600k visits.
         | 
         | https://www.similarweb.com/website/marktechpost.com/#overvie...
        
       | isaacfrond wrote:
       | They trained over 400 language models ranging from 70 million to
       | over 16 billion parameters on 5 to 500 billion tokens while
       | staying under a given compute budget. The results are modelled,
       | and they pick the best one. Turns out the having a bit fewer
       | tokens improves performance.
        
         | gbasin wrote:
         | Thank you :)
        
       | sirk390 wrote:
       | Is outperforming GPT-3 still a good reference? It seems there are
       | many models outperforming GPT-3 in the superglue benchmark:
       | https://super.gluebenchmark.com/leaderboard/ GPT-3 is in position
       | #21, with 71.8% score. The best model is at 91.2%. Note the human
       | baseline in #6 with 89.8%
        
         | WithinReason wrote:
         | > Is outperforming GPT-3 still a good reference?
         | 
         | It is if you outperform it with much fewer parameters
        
         | changoplatanero wrote:
         | Aren't most of the models at the top not suitable for text
         | generation? That's what makes gpt different from Bert
        
           | colordrops wrote:
           | What are the models at the top used for? Excuse my ignorance.
        
             | priansh wrote:
             | Mostly mask fill, but Transformers can be fine tuned to
             | downstream tasks relatively easily (T5 was built for
             | translation but is used for autocomplete in many cases)
        
               | gfodor wrote:
               | would you mind sharing some references (or even just
               | googleable terms) for this process of fine tuning?
        
         | redredrobot wrote:
         | It's a good reference because people are familiar with GPT-3.
         | The paper mostly compares Chinchilla to LaMDA, Jurassic,
         | Gopher, MT-NLG, and GPT-3. In the broader tech industry and
         | even to a certain extent within the AI field, GPT-3 is the only
         | one that most people know by name.
        
         | screye wrote:
         | Note that this isn't an apples-to-apples comparison. The GPT-3
         | position is for a few-shot use-case that has not been trained
         | for this particular task. When fine-tuned, GPT-3 would be
         | expected to perform a lot better. Lastly, GPT-3 is currently
         | operating on the text-002 models, and the 3rd version of GPT-3
         | is generally the one considered current. These benchmarks are
         | for the original GPT3 model.
        
       | wiz21c wrote:
       | I understand I can query such a model, one query at a time. But
       | are there way to query these models with several queries in a row
       | such that the N+1-th query benefit from the knowledge that was
       | used to answer the N first questions ? Basically, following a
       | conversation. For example, youtube subtitles can badly translate
       | some terms but if "it" had in mind the overall subject of the
       | video, then it'd probably pick the correct word...
        
         | rolisz wrote:
         | Yes. That's how you use GPT3: for the 2nd token, you feed in
         | your prompt and the first token it returned. Then you feed it
         | your prompt and the first two output tokens, and so on.
        
       | [deleted]
        
       | hwers wrote:
       | Can't wait for DeepMind to take a stab at outcompeting dall-e.
        
       | mrfusion wrote:
       | Does this imply we will run out of data to keep up with larger
       | model sizes?
       | 
       | Is there much more data out there than what they're already
       | using?
        
         | adamsmith143 wrote:
         | Probably not an issue just yet, think of how much data is
         | generated by Twitter on a daily basis for example.
        
           | zarzavat wrote:
           | If you want to teach your kid to learn English, and they came
           | back to you and said _" Dad/mum, I finished reading the
           | entire internet but I still don't understand English fully"_,
           | would you say _" OK son, now go and stare at the Twitter
           | firehouse until you grok perfect English"_ ?
           | 
           | It's clear that these models have orders of magnitude too
           | much data already.
           | 
           | It somewhat reminds me of the proposals for larger and larger
           | colliders in the hopes of seeing new physics that is always
           | one collider in the future.
        
             | lostmsu wrote:
             | I disagree with this take because you grok English not only
             | from the text you read, but also from the context of
             | physical world around you. And that context is enormous:
             | assuming 8000x8000x2 vision with 3 color 1 byte channels at
             | 24fps without compression, you get 3e+17 bytes (300
             | petabytes) of data along with your reading per year.
        
               | ralfd wrote:
               | Blind children can learn english fine though. And there
               | are areas highly unmaterial (mathematics) which people
               | still reason about.
        
               | lostmsu wrote:
               | You ignored the point. I only brought sight as an example
               | (though, admittedly, it is the largest data inflow).
        
             | mijoharas wrote:
             | > It somewhat reminds me of the proposals for larger and
             | larger colliders in the hopes of seeing new physics that is
             | always one collider in the future.
             | 
             | I agree with your main point, but think this analogy isn't
             | an apt one. If you want to see what particles are created
             | at higher energies you kinda need the bigger particle
             | accelerators. (This isn't to say that we shouldn't be
             | investigating lower energy collisions, but at a certain
             | point you do need "bigger colliders" to see new things)
        
             | nullc wrote:
             | > It's clear that these models have orders of magnitude too
             | much data already.
             | 
             | I have a toy disproof for your claim that this is clear.
             | 
             | Imagine that you are training a ML system using oracle
             | access to Mum. The ML training system can request 10
             | million representative samples of Mum output, and then we
             | could judge if the ML system has adequately reproduced Mum.
             | 
             | Now also imagine that Mum frequently tells people that Mum
             | knows a 23 letter secret and while mum won't tell people
             | what is outright, she'll answer queries like if a guess is
             | lexographically higher or lower. We could even imagine that
             | the ML has seen Mum's side of some interactions with her
             | doing that.
             | 
             | Would the ML know Mum's secret? No.
             | 
             | Would a child that could interact with Mum? Yes-- after at
             | most ceil(log_alphabet(23)) queries at most, if the child
             | is efficient.
             | 
             | Learning in an interactive context is not the same as
             | learning from written material, so you can't be sure that
             | the fact that children learn english from less text means
             | that a non-interactive ML system could english from the
             | same amount. Q.E.D.
             | 
             | Now, if someone figures out how to efficiently train these
             | natural language models with reinforcement learning...
        
             | adamsmith143 wrote:
             | The general point is that there is a huge volume of
             | training data generated daily not that Twitter is a great
             | source of it. Though I believe that GPT-3 for example was
             | trained on the Common Crawl dataset which would contain
             | both Twitter and Reddit.
             | 
             | >It's clear that these models have orders of magnitude too
             | much data already.
             | 
             | Seems like a strange claim. The scaling laws are showing
             | that you can still make gains with more data and more
             | parameters.
             | 
             | >It somewhat reminds me of the proposals for larger and
             | larger colliders in the hopes of seeing new physics that is
             | always one collider in the future.
             | 
             | This is literally true though, couldn't find the Higgs
             | without the LHC and most GUT candidates would only start
             | being ruled out at high energy levels.
        
               | gwern wrote:
               | Common Crawl actually does not contain Twitter, you can
               | go check the indexes https://github.com/ikreymer/cdx-
               | index-client . Twitter is extremely aggressive about
               | scraping/caching, and I guess that blocks CC. Models like
               | GPT-3 still know a decent amount of Twitter material, and
               | I figure that this is due to tweets being excerpts or
               | mirrored manually in non-Twitter.com URLs (eg all the
               | Twitter-mirroring bots on Reddit).
        
               | zarzavat wrote:
               | > Seems like a strange claim. The scaling laws are
               | showing that you can still make gains with more data and
               | more parameters.
               | 
               | But then we've given up on matching human intelligence
               | which is all about working efficiently with _small_
               | training data, and certainly training a human does not
               | need anywhere near as much data as GPT-3.
               | 
               | GPT-3 was interesting as a proof-of-concept of what
               | happens when you use a gigantic amount of training data.
               | We don't need a bigger one until we can figure out how to
               | make a smaller one that is just as effective.
               | 
               | If scaling laws are telling us to keep putting even more
               | training data into the thing, then the conclusion should
               | be that the architecture is just not working out.
        
               | adamsmith143 wrote:
               | >But then we've given up on matching human intelligence
               | which is all about working efficiently with small
               | training data, and certainly training a human does not
               | need anywhere near as much data as GPT-3.
               | 
               | I don't think we should really take so much inspiration
               | from the brain. We didn't make airplanes work by building
               | bird machines so why should we do that here.
               | 
               | >GPT-3 was interesting as a proof-of-concept of what
               | happens when you use a gigantic amount of training data.
               | We don't need a bigger one until we can figure out how to
               | make a smaller one that is just as effective.
               | 
               | This feels like a non sequitor. We can certainly keep
               | making larger models and we will, because we can continue
               | to make performance gains doing so.
               | 
               | >If scaling laws are telling us to keep putting even more
               | training data into the thing, then the conclusion should
               | be that the architecture is just not working out.
               | 
               | I don't think anyone in the field would agree to this
               | point. Researchers see an easy avenue to gain better
               | performance so they take it. Deepmind's model shows you
               | can get similar results with more refined architecture,
               | but this was released well after GPT-3. When teams
               | significantly advance the state of the art with a much
               | smaller model I think we should take notice but that
               | hasn't happened yet.
        
           | teraflop wrote:
           | On the other hand, consider the difficulty of taking massive
           | amounts of data from the modern web and filtering out the
           | subset that was actually generated by humans, rather than
           | previous generations of language models.
        
             | adamsmith143 wrote:
             | Definitely an interesting future problem. I'm sure OpenAI
             | and others are thinking about it but I don't think these
             | models are ubiquitous enough to have much impact just yet.
        
           | axg11 wrote:
           | Some estimates:
           | 
           | - 500M tweets per day
           | 
           | - 30 words/tokens per tweet
           | 
           | - 40% of all tweets thrown away due to being
           | duplicate/spam/bots
           | 
           | = 9B tokens generated per day
        
         | replygirl wrote:
         | There's a ton of data that can be exponentially more useful,
         | but we'll need networks that can (analogously) be late to work
         | enough times to get fired, or experience heartbreak in
         | succession while misunderstanding why prior heartbreak
         | happened, or hallucinate stray cats when they're walking around
         | the neighborhood at night
        
         | kelseyfrog wrote:
         | It implies our models are wrong.
         | 
         | Consider that a human adolescence is ~9.46x10^6 minutes and a
         | fast speaking rate is ~200words/minute. That sets an upper
         | bound of 1.9 billion words heard during adolescence. ie: human
         | adults are trained on a corpus of less than 1.9B words.
         | 
         | To some extent, more data can offset worse models, but I don't
         | think that's the regieme we're currently in. GPT-3 was trained
         | (on among other languages) 181 billion English words - or about
         | 100 times more words than a human will hear by the time they
         | reach adulthood. How is the human brain able to achieve a
         | higher level of success with 1% of the data?
         | 
         | 1.
         | https://github.com/openai/gpt-3/blob/master/dataset_statisti...
        
           | Symmetry wrote:
           | My understanding is that the binding constraint in training
           | these models is the quantity of computation they consume.
           | While a human makes do with drastically less input data, we
           | also have drastically more computational resources in our
           | heads to work on the problem than Google is using to train
           | its models.
        
           | gwern wrote:
           | > How is the human brain able to achieve a higher level of
           | success with 1% of the data?
           | 
           | The most obvious answer is "the human brain uses a shit-ton
           | more compute", for 18+ years as well.
           | 
           | We spend data, which we have in abundance, to save on
           | compute, which we do not. Even at the most generous low-end
           | estimates of the human brain's computing power, we are only
           | barely there; on the high-end estimates that people in love
           | with the ineffable mysteries of the brain love to cite, we
           | are multiple orders of magnitude away from even the biggest
           | supercomputers matching the brain. So no matter which way you
           | slice it, we are extremely compute-poor.
           | 
           | Feeding a lot of data through an extremely lightweight
           | optimizer like first-order SGDs is one way to cope with
           | lacking compute:
           | https://www.gwern.net/docs/ai/scaling/2013-bottou.pdf Bottou
           | asks why (even in 2013!) is SGD so hard to dethrone when we
           | can empirically see plenty of optimizers like second-order
           | gradient descent algorithms which can beat SGD quite solidly?
           | His observation is that while they are much better than SGD
           | in terms of iterations or _n_, they lose in compute/wallclock
           | because SGD can just go-brrrr through the data much faster
           | than they can.
        
             | nynx wrote:
             | Yeah, there are ~100B neurons, ~1Q synapses, but how much
             | compute is the brain actually using over time?
             | 
             | Some quick googling gives this:
             | 
             | - Generation of an action potential seems to use ~2.5x10^-7
             | J [0]
             | 
             | - The brain consumes around 20W during normal activity
             | 
             | This seems to imply that there are around 8x10^7, call it
             | 10^8, activations per second [1].
             | 
             | Apparently, the average neuron has 1000 synapses. Let's say
             | each synapse requires 10 mulacc operations per activation.
             | Doing that math gives about 10^12 FLOPs/s [2].
             | 
             | Integrate that over 18 years, and you get roughly 5.7x10^20
             | FLOPs [3].
             | 
             | PaLM required 2.56x10^24 FLOPs to train [4]. So, we have
             | (way more than) enough compute, we're just not using it
             | efficiently. We're wasting a lot of FLOPs on dense matrix
             | multiplication.
             | 
             | There's plenty of wiggle room in these calculations. I
             | checked over the math, but I'd appreciate if someone would
             | let me know if I've missed something.
             | [0]:
             | https://link.springer.com/article/10.1007/s11571-018-9503-3
             | [1]: https://www.wolframalpha.com/input?i2d=true&i=Divide%5
             | B20+W%2C2.5%E2%80%89%C3%97%E2%80%89Power%5B10%2C%E2%88%927%
             | 5D+Joules%5D         [2]: https://www.wolframalpha.com/inpu
             | t?i2d=true&i=Power%5B10%2C8%5D+Hz+*+1000+*+10+flop
             | [3]: https://www.wolframalpha.com/input?i2d=true&i=Power%5B
             | 10%2C12%5D+Divide%5BFLOP%2Cs%5D+*+18+years         [4]:
             | https://blog.heim.xyz/palm-training-
             | cost/#:~:text=PaLM%20(2022)-,2.5e24,-10x***
        
           | nynx wrote:
           | Yeah, this implies backpropagation is deeply suboptimal.
        
             | kelseyfrog wrote:
             | That is certainly a possibility. The other (non-mutually
             | exclusive) implications may also be that human language
             | acquisition benefits from being part of a multi-task model.
             | Or that the problem has been overreduced ie: human language
             | acquisition cannot simply be distilled into a words-
             | in->words-out problem and that vision/hearing are actually
             | integral parts of language acquisition that cannot be left
             | out. Or that model arch still has major improvements to be
             | made and attention is not all you need, for example.
        
               | fpgaminer wrote:
               | > and that vision/hearing are actually integral parts of
               | language acquisition
               | 
               | Deaf-blind authors would beg to differ.
               | 
               | But yes, a human brain is exposed to lots of other
               | sensory input, and we know from other research that
               | multi-modal models can learn shared representations that
               | benefit from the knowledge of each domain.
               | 
               | In Transformer's favor, at least, they are far closer to
               | tabula rasa than the human brain is and likely have to
               | dedicate a lot of their training time to things that are
               | otherwise "baked" into human brains. For example, humans
               | come pre-packaged with V1 and V2 as part of their visual
               | system, but CNNs and ViTs have to learn those filter
               | packages from scratch.
               | 
               | I agree with you though. Human brains are able to take
               | single instances of experiences and build a wealth of
               | understanding from them in ways that even modern
               | Transformer architectures are not yet able.
        
               | kristintynski wrote:
               | It seems like internal language (thinking in language) is
               | also a way our brains train themselves too? I've probably
               | thought 100x more words than I've spoken.
        
               | snovv_crash wrote:
               | This would map to a sort of semi-supervised approach. For
               | a lot of problems this has shown to drastically reduce
               | the data requirements, but can bump up compute.
               | 
               | All those conversations in the shower were actually
               | regularizers!
        
       | ianbutler wrote:
       | This is exciting if only because as we discover more compute
       | optimal models that out perform the behemoths that have been
       | state of the art it opens up the ability for smaller independent
       | groups to train and release their own versions, more fully
       | democratizing AI. Looking forward to a group like Eluther or
       | Hugging Face releasing a version of this.
        
         | adamsmith143 wrote:
         | >This is exciting if only because as we discover more compute
         | optimal models that out perform the behemoths that have been
         | state of the art it opens up the ability for smaller
         | independent groups to train and release their own versions,
         | more fully democratizing AI.
         | 
         | I think I support this in principle but it seems like the
         | scaling curves keep going so it's easier to just make larger
         | models with more data.
         | 
         | >Looking forward to a group like Eluther or Hugging Face
         | releasing a version of this
         | 
         | Both of those groups have access to dozens if not hundreds of
         | Cloud GPUs, I'd hardly call them small.
         | 
         | It would be impossible to replicate these models as say an
         | independent researcher or even in an academic research group
         | outside of maybe Stanford/Berkeley/MIT/etc. and I'd even doubt
         | their ability to replicate models like this based purely on
         | Cost alone.
        
           | ianbutler wrote:
           | Small is relative -- and to Google, Facebook and Microsoft
           | they're positively tiny. Perfect is the enemy of good or some
           | such and I think this is a move in the right direction even
           | if I can't personally train this on my 3090.
        
       | mark_l_watson wrote:
       | The design of the original Transformer model in the Attention is
       | all You Need paper was predicated on efficiency (all layers the
       | same size, combining word/token embedding with position in the
       | input stream harmonic embedding). It is good to see improvements!
        
       | narrator wrote:
       | I'd love to take a language model, load it up, and train it on
       | everything I write in online learning mode. Does one need some
       | massive hardware to do online learning with these models instead
       | of just running the distilled final models?
        
       | alpineidyll3 wrote:
       | If these things get put on specialized hardware for inference
       | with much lower energy costs, the world will never be the same.
        
         | hwers wrote:
         | Imagine any diffusion-style text-to-image model on specialized
         | ASIC hardware.
        
           | astrange wrote:
           | That's what an ANE/TPU is.
           | 
           | If you mean putting the model weights into gates directly,
           | it'd be useless because users would get bored of the model as
           | soon as they figured out what its style looked like. Also,
           | large models can memorize their training data so eventually
           | you'll get it to output something copyrighted.
        
         | lobstey wrote:
         | the biggest problem first of all might be the memory
         | requirements given so many parameters. It couldn't be as cheap
         | as a high end computer in the foreseeable future.
        
           | f38zf5vdt wrote:
           | There is probably a space-time trade off that needs to be
           | explored in this space. It might be possible to preload the
           | some of the most likely tokens to be selected next into the
           | cache and/or RAM. These are glorified auto-complete
           | algorithms that are poorly understood, as DeepMind's
           | optimizations appear to show. For the English language, it is
           | probable that there are only so many possible grammatically
           | correct selections for the next token, for example.
        
             | visarga wrote:
             | Glorified autocomplete? Autocomplete can guess the next
             | word .. sometimes, GPT-3 goes hundreds of words ahead. On
             | generic topics it can be hard to distinguish from human
             | text.
             | 
             | And it can't cache tokens because all tokens are evaluated
             | in the context of all the other tokens, so they don't have
             | the same representations when they reoccur at different
             | positions.
        
               | f38zf5vdt wrote:
               | They're evaluated in the context of the last 2^n many
               | tokens, for many models it is 1024, 2048, or 4096 tokens
               | as a scanning window. The tokens (words and sometimes
               | punctuation) are represented by integer values, so the
               | last 2^n many tokens would certainly qualify for storage
               | in a cache. Then next token selection only has so many
               | possible assignable selections in any given language
               | model because of grammatical limitations. This is only
               | one such optimization, there could also be optimizations
               | around the likelihood of certain words to be used given
               | the presence of certain previous tokens, and so on.
               | 
               | But, yes, tokens are chosen one word as a time based on
               | the previous content, similar to earlier auto-completion
               | algorithms.
        
             | priansh wrote:
             | I've been saying this for years, language models are the ML
             | equivalent of the billionaire space race, it's just a bunch
             | of orgs with unlimited funding spending millions of dollars
             | on compute to get more parameters than their rivals. It
             | could be decades before we start to see them scale down or
             | make meaningful optimizations. This paper is a good start
             | but I'd be willing to bet everyone will ignore it and
             | continue breaking the bank.
             | 
             | Can you say that about any other task in ML? When
             | Inceptionv3 came out I was able to run the model pretty
             | comfortable on a 1060. Even pix2pix and most GANs fit
             | comfortably in commercial compute, and the top of the line
             | massive models can still run inference on a 3090. It's so
             | unbelievably ironic that one of the major points
             | Transformers aimed to solve when introduced was the compute
             | inefficiency of recurrent networks, and it's devolved into
             | "how many TPUs can daddy afford" instead.
        
               | native_samples wrote:
               | Is that fair? My Pixel phone seems to run nothing but ML
               | models of various kinds and they run _locally_ which is
               | madness, pure madness. It can recognize songs and my
               | speech without talking to the cloud at all. That 's
               | pretty much the definition of optimization!
        
       | galcerte wrote:
       | I have to ask, why call it that? I had a chuckle once I saw the
       | name.
        
         | redredrobot wrote:
         | It outperforms the Gopher model
        
           | cshimmin wrote:
           | Yeah, similar "thematic" naming to MacOS versions. I don't
           | know why the original one was called Gopher, though.
        
             | goodside wrote:
             | Because it retrieves facts from memory in a way that's
             | analogized to a gopher retrieving objects.
        
         | gwern wrote:
         | There were a lot of complaints about earlier models being
         | named, say, 'Meena'. (It's very sexist, you know, to name a
         | chatbot a female name.) People won't complain about
         | 'Chinchilla' because chinchillas are adorable. PaLMs aren't
         | adorable, but at least it's neutral.
        
         | MrBuddyCasino wrote:
         | Its not so bad. If they were radio astronomers they'd call it
         | Very Big Neuronal Language Model. IBM would call it Watson
         | Advanced AI. If they were a gamer accessory company they'd call
         | it DeepTek Ultra Pro VDH-Max AI A320M. Chinchilla is nice and
         | fluffy.
        
         | farmin wrote:
         | It's the name of a town in QLD.
        
         | binarymax wrote:
         | Large language models have a (recent) history of silly names.
         | BERT, BART, ELMO, RoBERTa, BIGBIRD, PaLM, Megatron etc. Might
         | as well go full nonsense.
        
           | DSingularity wrote:
           | A touch of irony that cutting edge research on language can't
           | produce better names.
        
           | omarhaneef wrote:
           | True. I will add that it is customary to justify it by
           | demonstrating it is some sort of acronym or contraction.
        
             | yeetsfromhellL2 wrote:
             | It's a recursive, selective acronym
             | C                   CH                  CHI
             | CHIN                CHINC               CHINCH
             | CHINCHI             CHINCHIL            CHINCHILL       ==>
             | CHINCHILLA           HINCHILLA           INCHILLA
             | NCHILLA           CHILLA           HILLA           ILLA
             | LLA           LA           A
        
               | omarhaneef wrote:
               | I know what recursive means, I know what selective means,
               | I know what an acronym is, and I think I see the pattern
               | in that picture, but when I put it all together I am
               | lost.
               | 
               | Alternatively, is this a joke and the "recursive,
               | selective acronym" can be used to justify any word?
        
               | veonik wrote:
               | A                   AR                  ARB
               | ARBI                ARBIT               ARBITR
               | ARBITRA             ARBITRAR       ==>  ARBITRARY
               | RBITRARY            BITRARY            ITRARY
               | TRARY            RARY            ARY            RY
               | Y
               | 
               | Yup, seems it works for any word.
        
           | MisterTea wrote:
           | My theory is since no one reads literature anymore, timeless,
           | interesting and unique names from history and other cultures
           | are lost to a deluge of soon to be forgotten gag, pop-culture
           | and meme names. Perhaps this is why we have Chinchilla and
           | not Oberon.
        
             | jankeymeulen wrote:
             | Like the Oberon OS and programming language?
        
           | jstx1 wrote:
           | Image models too - the Inception paper from 2014 directly
           | refers to knowyourmeme.com and the "we need to go deeper"
           | meme from the movie Inception -
           | https://knowyourmeme.com/memes/we-need-to-go-deeper - it's
           | the first reference in the paper [1] and it's also why the
           | model is called that way.
           | 
           | [1] https://arxiv.org/pdf/1409.4842.pdf
        
       | ShamelessC wrote:
       | Seems the link is down. Found a decent synopsis/discussion on
       | lesswrong.
       | 
       | https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-scalin...
       | 
       | > On March 29th, DeepMind published a paper, "Training Compute-
       | Optimal Large Language Models", that shows that essentially
       | everyone -- OpenAI, DeepMind, Microsoft, etc. -- has been
       | training large language models with a deeply suboptimal use of
       | compute.
       | 
       | > Following the new scaling laws that they propose for the
       | optimal use of compute, DeepMind trains a new, 70-billion
       | parameter model that outperforms much larger language models,
       | including the 175-billion parameter GPT-3 and DeepMind's own
       | 270-billion parameter "Gopher".
        
         | gyang wrote:
         | I think there remains an immense amount of such suboptimality
         | still hanging from the tree, so to speak.
         | 
         | For example, our recent paper "Tensor Programs V: Tuning Large
         | Neural Networks via Zero-Shot Hyperparameter Transfer"[1] shows
         | that even learning rate and initialization used by existing
         | models are _deeply wrong_. By just picking them correctly
         | (which involves some really beautiful mathematics), we can
         | effectively double the model size of the GPT-3 6.7B model (to
         | be comparable in quality to the 13B model across the suite of
         | benchmark tasks).
         | 
         | Large neural networks behave in a way we are only beginning to
         | understand well just because each empirical probe of any such
         | model is so much more expensive and time consuming than typical
         | models. But principled theory here can have a lot of leverage
         | by pointing out the right direction to look, as it did in our
         | work.
         | 
         | [1] http://arxiv.org/abs/2203.03466
        
           | p1esk wrote:
           | What do you think about the concept of "critical batch size"?
           | https://openai.com/blog/science-of-ai/
        
             | gyang wrote:
             | I think the concept makes sense. The basic insight, that
             | the right batch size depends on the difficulty and
             | noisiness of a task, is already used by teams. For example,
             | the PaLM paper from last week increased its batch size
             | throughout training.
             | 
             | But as far as I know, the more precise predictions of
             | optimal batch size aren't used much, probably because it's
             | expensive to measure accurately, or because the predictive
             | equation isn't accurate enough to begin with. I wonder if
             | we can "transfer" the optimal batch size from a smaller
             | setting (smaller model or data) to the full setting, like
             | in our paper. This would make it much more practical.
        
         | eigenvalue wrote:
         | According to the LessWrong post, the smaller model trained on
         | more data performs better on most of the tasks, but it's worse
         | on "college level math" questions. I wonder why that is. Is it
         | because the extra capacity of the larger model was used to
         | basically memorize theorems? Or is it because the extra "brain
         | power" let it model the math better? Oddly, one of the tasks
         | that the smaller most outperformed the larger model on is "high
         | school level math"! Very counterintuitive, and I am curious if
         | there are any big takeaways lurking in that disparity.
        
           | ShamelessC wrote:
           | Gwern responded to a similar question in the comments
           | section.
           | 
           | (parent)
           | 
           | > the fact that data and compute need to scale proportionally
           | seems... like a big point in favor of NNs as
           | memorizers/interpolators.
           | 
           | (gwern)
           | 
           | > Surely it's the opposite? The more bang you get out of each
           | parameter, the less it looks like 'just' (whatever that
           | means) memorization/interpolation. When you needed to
           | increase parameters a lot, disproportionately, to cope with
           | some more data, that does not speak well of abstractions or
           | understanding. (If I can train a 1t model to get the same
           | loss as what I thought was going to take a 100t model, why
           | would I think that that 100t model must be
           | memorizing/interpolating less?) Let's take your claim to its
           | logical extreme: suppose we discovered tomorrow a scaling law
           | that made parameters near-constant (log, let's say); would
           | that not suggest that those parameters are super useful and
           | it's doing an amazing job of learning the underlying
           | algorithm and is not memorizing/interpolating?
        
             | sillysaurusx wrote:
             | This isn't addressing their question. And Gwern's goal here
             | is to (incorrectly) try to get rid of the idea that models
             | are just memorizing and interpolating, when in fact
             | memorization and interpolation is what we all do, including
             | models. He's just bothered by the idea that people think of
             | models as less than magic.
             | 
             | On the other hand, https://twitter.com/model_mechanic/statu
             | s/151297688118364569... is admittedly pretty magical, even
             | if the basis of that magic is memorization and
             | interpolation.
        
               | VirusNewbie wrote:
               | Why do you say they just memorize and interpret? I can
               | teach GPT-2 new things, including new objects and their
               | physical properties and it does a good job with that.
               | That also means it has definitely not just regurgitated a
               | matching sentence back to me.
        
               | replygirl wrote:
               | when i see a new object for the first time, i MEMORIZE
               | what i INTERPRET as its identifying traits, and ask
               | someone who has already MEMORIZED what that object is to
               | INTERPRET a concept with which i can associate those
               | traits. the next time i encounter an object with those
               | traits i can then recall the associations, then compose
               | those trait-level interpretations into an interpretation
               | of an object.
               | 
               | at a fundamental level that's all this is, compositions
               | of associated memorizations and interpretations, which
               | map to compositions of sentence parts the machine can
               | regurgitate back to you
        
               | rictic wrote:
               | To rebut someone's argument you must address the argument
               | and not just talk about them and their motivations
               | 
               | From your comment a reader will understand that you think
               | they're just memorizing and interpolating and that you
               | disagree with gwern on this point, but you've given your
               | reader nothing that argues in favor of your position
               | 
               | Why should someone believe that models are just
               | memorizing and interpolating?
        
               | yldedly wrote:
               | It's impossible for a piecewise linear function to be
               | anything other than linear outside the training sample.
               | They are by their definition unable to do anything but
               | interpolate.
        
           | danuker wrote:
           | It might just be by chance: the initial weights of one model
           | could have been lucky in some areas, and unlucky in others.
           | There's no way to tell other than training again, which is a
           | costly proposition.
        
             | eigenvalue wrote:
             | That seems pretty unlikely to me actually. As the models
             | and training data get much bigger, I think the initial
             | weights become less important (at least assuming your
             | random weights have certain desirable statistical
             | properties, which they do by construction usually).
        
           | [deleted]
        
           | adamsmith143 wrote:
           | Probably right. Most people dump on these language models for
           | this reason but it would be absurd for a HS student to have
           | to re-derive the quadratic equation every time they worked on
           | an Algebra problem so naturally you memorize it. Why should
           | it be any different for a language model?
        
             | eutectic wrote:
             | I never memorized the quadratic formula, and I did OK.
        
               | whimsicalism wrote:
               | Did you go to school in the US in the last 2-3 decades?
        
               | replygirl wrote:
               | Once you start calculus they let you use a real
               | calculator
        
               | whimsicalism wrote:
               | That may be true, but in the US there are typically math
               | courses before calculus.
        
               | replygirl wrote:
               | But then we get a calculator.
        
               | whimsicalism wrote:
               | Even then, it is typically not a symbolic calculator so
               | if your answer is a closed form function of variables,
               | you're SOL with a TI-84.
        
               | adamsmith143 wrote:
               | Maybe we went to radically different schools but I
               | certainly had to calculate by hand using the quadratic
               | formula countless times where calculators were not
               | allowed to be used.
               | 
               | Anyway it distracts from the point so it's not relevant.
        
         | VikingCoder wrote:
         | 70 billion parameters... Is each of those a 4-byte float?
         | 
         | So, is that 280 billion bytes of just parameters?
        
           | sudosysgen wrote:
           | I'm fairly confident each of those is a 2-byte float, but yes
           | that's over 100 GB of parameters.
        
             | sillysaurusx wrote:
             | Welcome to the party! I joined ML because I realized I
             | could help. You can too. I bet you're both already thinking
             | of clever ways to deal with massive models from an
             | infrastructure standpoint. That's just one of hundreds of
             | interesting problems.
        
               | native_samples wrote:
               | Is 100GB of parameters really that large? 128GB of RAM on
               | a server class machine is not unusual. Seems such a model
               | could fit entirely in RAM.
        
               | andbberger wrote:
               | GPU memory is generally much smaller and more expensive
        
               | kristjansson wrote:
               | To elaborate on the sibling comment: main memory is much
               | bigger, but CPUs are much, much slower. It would be a
               | challenge to merely run a model like this on CPU, and
               | totally infeasible to train one. So the challenge is to
               | fit into the memory of a single GPU you can afford,
               | coordinate multiple GPUs, or efficiently page from main
               | memory into GPU.
        
             | Delitio wrote:
             | Is there any source which explains what billion of
             | parameters actually are?
             | 
             | In my mind a parameter is: language, dialect, perhaps
             | context parameters (food, dinner, lunch, travel) and if we
             | than talk about language and audio perhaps sound waves,
             | gender.
             | 
             | Or are context parameters which gives you insight? Like a
             | billion of parameters are literally something like
             | travel=false, travel-europe=true people speaking=e, age,
             | height,
        
               | nl wrote:
               | It's rare a single parameter maps to a human
               | understandable concept. Occasionally someone finds one
               | that does map fairly well, for example this case back in
               | 2017: https://openai.com/blog/unsupervised-sentiment-
               | neuron/#senti...
        
               | jefft255 wrote:
               | The parameters are the number of weights in a neural
               | network, in this case.
        
               | matt123456789 wrote:
               | A parameter is a scalar value, most of which are in the
               | attention matrices and feedforward matrices, you also
               | hear these called "weights". Any intro to DL course will
               | cover these in detail. I recommend started with Andrew
               | Ng's Coursera class on Intro to Machine Learning,
               | although there may be better ones out there now.
        
               | Delitio wrote:
               | Input parameter vs. weights then?
               | 
               | I see tx
        
               | lostmsu wrote:
               | These networks (text models) usually have around a few
               | thousand inputs.
        
               | brrrrrm wrote:
               | A good visual introduction to neural networks can be
               | found here: https://playground.tensorflow.org
               | 
               | A parameter is a "weight" in this case (the lines drawn
               | from neuron to neuron). The neurons are effectively
               | runtime values or "activations." Parameters (weights) are
               | updated during training and then set as constant during
               | "inference" (also called "prediction").
               | 
               | There's unfortunately a ton of jargon and different
               | groups use different words almost exclusively.
        
               | dotnet00 wrote:
               | Parameters are just floating point numbers, at most they
               | can be seen as degrees of freedom or kind of like the
               | order of a polynomial used in curve fitting.
               | 
               | They're too abstract to assign much meaning to individual
               | parameters, as our understanding of why their values are
               | exactly the way they are is extremely limited.
        
       ___________________________________________________________________
       (page generated 2022-04-11 23:00 UTC)