[HN Gopher] DeepMind's New Language Model, Chinchilla ___________________________________________________________________ DeepMind's New Language Model, Chinchilla Author : georgehill Score : 195 points Date : 2022-04-11 12:41 UTC (10 hours ago) (HTM) web link (www.marktechpost.com) (TXT) w3m dump (www.marktechpost.com) | g051051 wrote: | Is there a good reference as to what a "parameter" is in this | context? I've looked a few times, but the explanations don't make | any sense to me. | guipsp wrote: | You can think of a parameter as a number you can tweak while | training. This network has 70B such numbers. | sirk390 wrote: | And if every parameter is one byte, the minimum, it will take | at least 70gb to save or share this model. So it's still way | to big to package directly in a app. | cshimmin wrote: | From the paper, they are using bfloat16, so I guess two | bytes. But distributing and "packaging into an app" are not | at all of practical interest for these kinds of models. You | (a consumer) would interact via some API service, with the | model running on a hardware-accelerated compute cloud. | | In any case, during training (where the model is run in | possibly large batches), and even during inference, the | size of the parameters is completely dwarfed by the | intermediate tensor representations. | brrrrrm wrote: | > even during inference, the size of the parameters is | completely dwarfed by the intermediate tensor | representations | | What makes you say this? | cshimmin wrote: | It's especially true for models that do some kind of | weight sharing, which is very common (CNNs, RNNs, | transformers, etc). For a concrete example, consider a | layer from an image convolutional network, which maps | from a 3-dim colorspace to a 128-dim feature space. | Assuming a 5x5 kernel that's about 10k parameters. | However, after applying this layer, you go from having an | (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W | are the height and width of the image, and B is the | number of images in the batch. If you're working with | even moderately high resolution images, the memory | required for these intermediate tensors at each layer is | much larger than the parameters. | | Something similar applies for RNNs (same weights applied | at each element of a sequence), GNNs and transformers | (same weights applied at each _pair_ of data). | lostmsu wrote: | Have you seen modern games? | sva_ wrote: | I doubt they load that amount of data in memory | replygirl wrote: | I'm thinking about upgrading from 64gb to 128gb so i can | use all my Cities: Skylines assets in the same map | lostmsu wrote: | Right, they usually stream assets as they are requested. | Large models do the same. | cshimmin wrote: | It's a degree of freedom of the learnable model. For example in | a "vanilla" neural network layer (MLP), which maps from M to N | feature dimensions will contain an MxN matrix of learnable | parameters that model the connections between the M inputs to | the N outputs. Every time the model is updated during | backpropagation, the loss gradient which has to be computed has | the same dimensionality as the number of parameters. Also, | generally more parameters means more operations in the forward | pass. Therefore, a model with more parameters in general will | require more FLOPs per iteration of training. The main point of | this paper is that you can actually do better by training a | smaller model for longer, rather than a bigger model for less | time, assuming you have a fixed FLOP budget. | zamalek wrote: | The other thing with more parameters is that it gives the NN | more ability to overfit. That means that instead of, say, | learning what a dog is, it instead memorises all the | sentences containing "dog" that it has ever seen. | mangoham wrote: | Cached version since the original is down (I'm assuming it's down | due to load issues and not due to the author taking it down). | https://webcache.googleusercontent.com/search?q=cache:PLSLy9... | ritwikgupta wrote: | Off-topic to Chinchilla, but relevant to the source site: | MarkTechPost consistently borderline plagiarizes articles and | shares them on their website as "paper summaries". They copy- | paste from the source material and change some of the wording | around as to appear original. My work, as well as other work from | Berkeley AI Research, has been posted in this manner on their | site. | | This seems highly unethical, and I'm surprised how they continue | to operate. | andreyk wrote: | To add to this - they do this regularly, multiple times per | week. While they do link to and acknowledge the source work, | they do not make clear their writing is quoted or nearly | quoted. | brrrrrm wrote: | Thanks for the heads up! In that case, I'd prefer not to share | this link with peers. Do you have an alternative source with | similar high-level content to share? | lstamour wrote: | Tough to say. Technically | https://arxiv.org/pdf/2203.15556.pdf has the same content, it | just isn't highlighted the same way. | boplicity wrote: | Fill out a DMCA notice: | | https://abuse.cloudflare.com/ | | Cloudflare will forward it to their host, I believe, who will | then ask that they remove the infringing material, or provide a | counter claim. | parhamn wrote: | I don't know about this site, and I agree its unethical. But it | does make me realize that I much prefer using language of the | paper directly as opposed to having a non-expert poorly | translate what your paper said. Especially given how papers put | a lot of time in the accuracy and specificity of their language | and word choices. | | Would it also annoy you if they screwed up the interpretation | of what you wrote? Is the alternative less reach of your work? | For hard core research the tradeoffs are tougher it seems. If | it is just a matter of non-nevermind, thats strictly messed up. | realYitzi wrote: | We better get used to it. Because news companies will say an AI | wrote it. No law allows suing an AI for plagiarism. Go prove | something is not an AI. | nudpiedo wrote: | No one sues the car, the dog or the children, but the owner, | responsible, parent, etc. | georgehill wrote: | OP here - Thanks for sharing. I wasn't aware of this but | despite this behavior, they are getting 600k visits. | | https://www.similarweb.com/website/marktechpost.com/#overvie... | isaacfrond wrote: | They trained over 400 language models ranging from 70 million to | over 16 billion parameters on 5 to 500 billion tokens while | staying under a given compute budget. The results are modelled, | and they pick the best one. Turns out the having a bit fewer | tokens improves performance. | gbasin wrote: | Thank you :) | sirk390 wrote: | Is outperforming GPT-3 still a good reference? It seems there are | many models outperforming GPT-3 in the superglue benchmark: | https://super.gluebenchmark.com/leaderboard/ GPT-3 is in position | #21, with 71.8% score. The best model is at 91.2%. Note the human | baseline in #6 with 89.8% | WithinReason wrote: | > Is outperforming GPT-3 still a good reference? | | It is if you outperform it with much fewer parameters | changoplatanero wrote: | Aren't most of the models at the top not suitable for text | generation? That's what makes gpt different from Bert | colordrops wrote: | What are the models at the top used for? Excuse my ignorance. | priansh wrote: | Mostly mask fill, but Transformers can be fine tuned to | downstream tasks relatively easily (T5 was built for | translation but is used for autocomplete in many cases) | gfodor wrote: | would you mind sharing some references (or even just | googleable terms) for this process of fine tuning? | redredrobot wrote: | It's a good reference because people are familiar with GPT-3. | The paper mostly compares Chinchilla to LaMDA, Jurassic, | Gopher, MT-NLG, and GPT-3. In the broader tech industry and | even to a certain extent within the AI field, GPT-3 is the only | one that most people know by name. | screye wrote: | Note that this isn't an apples-to-apples comparison. The GPT-3 | position is for a few-shot use-case that has not been trained | for this particular task. When fine-tuned, GPT-3 would be | expected to perform a lot better. Lastly, GPT-3 is currently | operating on the text-002 models, and the 3rd version of GPT-3 | is generally the one considered current. These benchmarks are | for the original GPT3 model. | wiz21c wrote: | I understand I can query such a model, one query at a time. But | are there way to query these models with several queries in a row | such that the N+1-th query benefit from the knowledge that was | used to answer the N first questions ? Basically, following a | conversation. For example, youtube subtitles can badly translate | some terms but if "it" had in mind the overall subject of the | video, then it'd probably pick the correct word... | rolisz wrote: | Yes. That's how you use GPT3: for the 2nd token, you feed in | your prompt and the first token it returned. Then you feed it | your prompt and the first two output tokens, and so on. | [deleted] | hwers wrote: | Can't wait for DeepMind to take a stab at outcompeting dall-e. | mrfusion wrote: | Does this imply we will run out of data to keep up with larger | model sizes? | | Is there much more data out there than what they're already | using? | adamsmith143 wrote: | Probably not an issue just yet, think of how much data is | generated by Twitter on a daily basis for example. | zarzavat wrote: | If you want to teach your kid to learn English, and they came | back to you and said _" Dad/mum, I finished reading the | entire internet but I still don't understand English fully"_, | would you say _" OK son, now go and stare at the Twitter | firehouse until you grok perfect English"_ ? | | It's clear that these models have orders of magnitude too | much data already. | | It somewhat reminds me of the proposals for larger and larger | colliders in the hopes of seeing new physics that is always | one collider in the future. | lostmsu wrote: | I disagree with this take because you grok English not only | from the text you read, but also from the context of | physical world around you. And that context is enormous: | assuming 8000x8000x2 vision with 3 color 1 byte channels at | 24fps without compression, you get 3e+17 bytes (300 | petabytes) of data along with your reading per year. | ralfd wrote: | Blind children can learn english fine though. And there | are areas highly unmaterial (mathematics) which people | still reason about. | lostmsu wrote: | You ignored the point. I only brought sight as an example | (though, admittedly, it is the largest data inflow). | mijoharas wrote: | > It somewhat reminds me of the proposals for larger and | larger colliders in the hopes of seeing new physics that is | always one collider in the future. | | I agree with your main point, but think this analogy isn't | an apt one. If you want to see what particles are created | at higher energies you kinda need the bigger particle | accelerators. (This isn't to say that we shouldn't be | investigating lower energy collisions, but at a certain | point you do need "bigger colliders" to see new things) | nullc wrote: | > It's clear that these models have orders of magnitude too | much data already. | | I have a toy disproof for your claim that this is clear. | | Imagine that you are training a ML system using oracle | access to Mum. The ML training system can request 10 | million representative samples of Mum output, and then we | could judge if the ML system has adequately reproduced Mum. | | Now also imagine that Mum frequently tells people that Mum | knows a 23 letter secret and while mum won't tell people | what is outright, she'll answer queries like if a guess is | lexographically higher or lower. We could even imagine that | the ML has seen Mum's side of some interactions with her | doing that. | | Would the ML know Mum's secret? No. | | Would a child that could interact with Mum? Yes-- after at | most ceil(log_alphabet(23)) queries at most, if the child | is efficient. | | Learning in an interactive context is not the same as | learning from written material, so you can't be sure that | the fact that children learn english from less text means | that a non-interactive ML system could english from the | same amount. Q.E.D. | | Now, if someone figures out how to efficiently train these | natural language models with reinforcement learning... | adamsmith143 wrote: | The general point is that there is a huge volume of | training data generated daily not that Twitter is a great | source of it. Though I believe that GPT-3 for example was | trained on the Common Crawl dataset which would contain | both Twitter and Reddit. | | >It's clear that these models have orders of magnitude too | much data already. | | Seems like a strange claim. The scaling laws are showing | that you can still make gains with more data and more | parameters. | | >It somewhat reminds me of the proposals for larger and | larger colliders in the hopes of seeing new physics that is | always one collider in the future. | | This is literally true though, couldn't find the Higgs | without the LHC and most GUT candidates would only start | being ruled out at high energy levels. | gwern wrote: | Common Crawl actually does not contain Twitter, you can | go check the indexes https://github.com/ikreymer/cdx- | index-client . Twitter is extremely aggressive about | scraping/caching, and I guess that blocks CC. Models like | GPT-3 still know a decent amount of Twitter material, and | I figure that this is due to tweets being excerpts or | mirrored manually in non-Twitter.com URLs (eg all the | Twitter-mirroring bots on Reddit). | zarzavat wrote: | > Seems like a strange claim. The scaling laws are | showing that you can still make gains with more data and | more parameters. | | But then we've given up on matching human intelligence | which is all about working efficiently with _small_ | training data, and certainly training a human does not | need anywhere near as much data as GPT-3. | | GPT-3 was interesting as a proof-of-concept of what | happens when you use a gigantic amount of training data. | We don't need a bigger one until we can figure out how to | make a smaller one that is just as effective. | | If scaling laws are telling us to keep putting even more | training data into the thing, then the conclusion should | be that the architecture is just not working out. | adamsmith143 wrote: | >But then we've given up on matching human intelligence | which is all about working efficiently with small | training data, and certainly training a human does not | need anywhere near as much data as GPT-3. | | I don't think we should really take so much inspiration | from the brain. We didn't make airplanes work by building | bird machines so why should we do that here. | | >GPT-3 was interesting as a proof-of-concept of what | happens when you use a gigantic amount of training data. | We don't need a bigger one until we can figure out how to | make a smaller one that is just as effective. | | This feels like a non sequitor. We can certainly keep | making larger models and we will, because we can continue | to make performance gains doing so. | | >If scaling laws are telling us to keep putting even more | training data into the thing, then the conclusion should | be that the architecture is just not working out. | | I don't think anyone in the field would agree to this | point. Researchers see an easy avenue to gain better | performance so they take it. Deepmind's model shows you | can get similar results with more refined architecture, | but this was released well after GPT-3. When teams | significantly advance the state of the art with a much | smaller model I think we should take notice but that | hasn't happened yet. | teraflop wrote: | On the other hand, consider the difficulty of taking massive | amounts of data from the modern web and filtering out the | subset that was actually generated by humans, rather than | previous generations of language models. | adamsmith143 wrote: | Definitely an interesting future problem. I'm sure OpenAI | and others are thinking about it but I don't think these | models are ubiquitous enough to have much impact just yet. | axg11 wrote: | Some estimates: | | - 500M tweets per day | | - 30 words/tokens per tweet | | - 40% of all tweets thrown away due to being | duplicate/spam/bots | | = 9B tokens generated per day | replygirl wrote: | There's a ton of data that can be exponentially more useful, | but we'll need networks that can (analogously) be late to work | enough times to get fired, or experience heartbreak in | succession while misunderstanding why prior heartbreak | happened, or hallucinate stray cats when they're walking around | the neighborhood at night | kelseyfrog wrote: | It implies our models are wrong. | | Consider that a human adolescence is ~9.46x10^6 minutes and a | fast speaking rate is ~200words/minute. That sets an upper | bound of 1.9 billion words heard during adolescence. ie: human | adults are trained on a corpus of less than 1.9B words. | | To some extent, more data can offset worse models, but I don't | think that's the regieme we're currently in. GPT-3 was trained | (on among other languages) 181 billion English words - or about | 100 times more words than a human will hear by the time they | reach adulthood. How is the human brain able to achieve a | higher level of success with 1% of the data? | | 1. | https://github.com/openai/gpt-3/blob/master/dataset_statisti... | Symmetry wrote: | My understanding is that the binding constraint in training | these models is the quantity of computation they consume. | While a human makes do with drastically less input data, we | also have drastically more computational resources in our | heads to work on the problem than Google is using to train | its models. | gwern wrote: | > How is the human brain able to achieve a higher level of | success with 1% of the data? | | The most obvious answer is "the human brain uses a shit-ton | more compute", for 18+ years as well. | | We spend data, which we have in abundance, to save on | compute, which we do not. Even at the most generous low-end | estimates of the human brain's computing power, we are only | barely there; on the high-end estimates that people in love | with the ineffable mysteries of the brain love to cite, we | are multiple orders of magnitude away from even the biggest | supercomputers matching the brain. So no matter which way you | slice it, we are extremely compute-poor. | | Feeding a lot of data through an extremely lightweight | optimizer like first-order SGDs is one way to cope with | lacking compute: | https://www.gwern.net/docs/ai/scaling/2013-bottou.pdf Bottou | asks why (even in 2013!) is SGD so hard to dethrone when we | can empirically see plenty of optimizers like second-order | gradient descent algorithms which can beat SGD quite solidly? | His observation is that while they are much better than SGD | in terms of iterations or _n_, they lose in compute/wallclock | because SGD can just go-brrrr through the data much faster | than they can. | nynx wrote: | Yeah, there are ~100B neurons, ~1Q synapses, but how much | compute is the brain actually using over time? | | Some quick googling gives this: | | - Generation of an action potential seems to use ~2.5x10^-7 | J [0] | | - The brain consumes around 20W during normal activity | | This seems to imply that there are around 8x10^7, call it | 10^8, activations per second [1]. | | Apparently, the average neuron has 1000 synapses. Let's say | each synapse requires 10 mulacc operations per activation. | Doing that math gives about 10^12 FLOPs/s [2]. | | Integrate that over 18 years, and you get roughly 5.7x10^20 | FLOPs [3]. | | PaLM required 2.56x10^24 FLOPs to train [4]. So, we have | (way more than) enough compute, we're just not using it | efficiently. We're wasting a lot of FLOPs on dense matrix | multiplication. | | There's plenty of wiggle room in these calculations. I | checked over the math, but I'd appreciate if someone would | let me know if I've missed something. | [0]: | https://link.springer.com/article/10.1007/s11571-018-9503-3 | [1]: https://www.wolframalpha.com/input?i2d=true&i=Divide%5 | B20+W%2C2.5%E2%80%89%C3%97%E2%80%89Power%5B10%2C%E2%88%927% | 5D+Joules%5D [2]: https://www.wolframalpha.com/inpu | t?i2d=true&i=Power%5B10%2C8%5D+Hz+*+1000+*+10+flop | [3]: https://www.wolframalpha.com/input?i2d=true&i=Power%5B | 10%2C12%5D+Divide%5BFLOP%2Cs%5D+*+18+years [4]: | https://blog.heim.xyz/palm-training- | cost/#:~:text=PaLM%20(2022)-,2.5e24,-10x*** | nynx wrote: | Yeah, this implies backpropagation is deeply suboptimal. | kelseyfrog wrote: | That is certainly a possibility. The other (non-mutually | exclusive) implications may also be that human language | acquisition benefits from being part of a multi-task model. | Or that the problem has been overreduced ie: human language | acquisition cannot simply be distilled into a words- | in->words-out problem and that vision/hearing are actually | integral parts of language acquisition that cannot be left | out. Or that model arch still has major improvements to be | made and attention is not all you need, for example. | fpgaminer wrote: | > and that vision/hearing are actually integral parts of | language acquisition | | Deaf-blind authors would beg to differ. | | But yes, a human brain is exposed to lots of other | sensory input, and we know from other research that | multi-modal models can learn shared representations that | benefit from the knowledge of each domain. | | In Transformer's favor, at least, they are far closer to | tabula rasa than the human brain is and likely have to | dedicate a lot of their training time to things that are | otherwise "baked" into human brains. For example, humans | come pre-packaged with V1 and V2 as part of their visual | system, but CNNs and ViTs have to learn those filter | packages from scratch. | | I agree with you though. Human brains are able to take | single instances of experiences and build a wealth of | understanding from them in ways that even modern | Transformer architectures are not yet able. | kristintynski wrote: | It seems like internal language (thinking in language) is | also a way our brains train themselves too? I've probably | thought 100x more words than I've spoken. | snovv_crash wrote: | This would map to a sort of semi-supervised approach. For | a lot of problems this has shown to drastically reduce | the data requirements, but can bump up compute. | | All those conversations in the shower were actually | regularizers! | ianbutler wrote: | This is exciting if only because as we discover more compute | optimal models that out perform the behemoths that have been | state of the art it opens up the ability for smaller independent | groups to train and release their own versions, more fully | democratizing AI. Looking forward to a group like Eluther or | Hugging Face releasing a version of this. | adamsmith143 wrote: | >This is exciting if only because as we discover more compute | optimal models that out perform the behemoths that have been | state of the art it opens up the ability for smaller | independent groups to train and release their own versions, | more fully democratizing AI. | | I think I support this in principle but it seems like the | scaling curves keep going so it's easier to just make larger | models with more data. | | >Looking forward to a group like Eluther or Hugging Face | releasing a version of this | | Both of those groups have access to dozens if not hundreds of | Cloud GPUs, I'd hardly call them small. | | It would be impossible to replicate these models as say an | independent researcher or even in an academic research group | outside of maybe Stanford/Berkeley/MIT/etc. and I'd even doubt | their ability to replicate models like this based purely on | Cost alone. | ianbutler wrote: | Small is relative -- and to Google, Facebook and Microsoft | they're positively tiny. Perfect is the enemy of good or some | such and I think this is a move in the right direction even | if I can't personally train this on my 3090. | mark_l_watson wrote: | The design of the original Transformer model in the Attention is | all You Need paper was predicated on efficiency (all layers the | same size, combining word/token embedding with position in the | input stream harmonic embedding). It is good to see improvements! | narrator wrote: | I'd love to take a language model, load it up, and train it on | everything I write in online learning mode. Does one need some | massive hardware to do online learning with these models instead | of just running the distilled final models? | alpineidyll3 wrote: | If these things get put on specialized hardware for inference | with much lower energy costs, the world will never be the same. | hwers wrote: | Imagine any diffusion-style text-to-image model on specialized | ASIC hardware. | astrange wrote: | That's what an ANE/TPU is. | | If you mean putting the model weights into gates directly, | it'd be useless because users would get bored of the model as | soon as they figured out what its style looked like. Also, | large models can memorize their training data so eventually | you'll get it to output something copyrighted. | lobstey wrote: | the biggest problem first of all might be the memory | requirements given so many parameters. It couldn't be as cheap | as a high end computer in the foreseeable future. | f38zf5vdt wrote: | There is probably a space-time trade off that needs to be | explored in this space. It might be possible to preload the | some of the most likely tokens to be selected next into the | cache and/or RAM. These are glorified auto-complete | algorithms that are poorly understood, as DeepMind's | optimizations appear to show. For the English language, it is | probable that there are only so many possible grammatically | correct selections for the next token, for example. | visarga wrote: | Glorified autocomplete? Autocomplete can guess the next | word .. sometimes, GPT-3 goes hundreds of words ahead. On | generic topics it can be hard to distinguish from human | text. | | And it can't cache tokens because all tokens are evaluated | in the context of all the other tokens, so they don't have | the same representations when they reoccur at different | positions. | f38zf5vdt wrote: | They're evaluated in the context of the last 2^n many | tokens, for many models it is 1024, 2048, or 4096 tokens | as a scanning window. The tokens (words and sometimes | punctuation) are represented by integer values, so the | last 2^n many tokens would certainly qualify for storage | in a cache. Then next token selection only has so many | possible assignable selections in any given language | model because of grammatical limitations. This is only | one such optimization, there could also be optimizations | around the likelihood of certain words to be used given | the presence of certain previous tokens, and so on. | | But, yes, tokens are chosen one word as a time based on | the previous content, similar to earlier auto-completion | algorithms. | priansh wrote: | I've been saying this for years, language models are the ML | equivalent of the billionaire space race, it's just a bunch | of orgs with unlimited funding spending millions of dollars | on compute to get more parameters than their rivals. It | could be decades before we start to see them scale down or | make meaningful optimizations. This paper is a good start | but I'd be willing to bet everyone will ignore it and | continue breaking the bank. | | Can you say that about any other task in ML? When | Inceptionv3 came out I was able to run the model pretty | comfortable on a 1060. Even pix2pix and most GANs fit | comfortably in commercial compute, and the top of the line | massive models can still run inference on a 3090. It's so | unbelievably ironic that one of the major points | Transformers aimed to solve when introduced was the compute | inefficiency of recurrent networks, and it's devolved into | "how many TPUs can daddy afford" instead. | native_samples wrote: | Is that fair? My Pixel phone seems to run nothing but ML | models of various kinds and they run _locally_ which is | madness, pure madness. It can recognize songs and my | speech without talking to the cloud at all. That 's | pretty much the definition of optimization! | galcerte wrote: | I have to ask, why call it that? I had a chuckle once I saw the | name. | redredrobot wrote: | It outperforms the Gopher model | cshimmin wrote: | Yeah, similar "thematic" naming to MacOS versions. I don't | know why the original one was called Gopher, though. | goodside wrote: | Because it retrieves facts from memory in a way that's | analogized to a gopher retrieving objects. | gwern wrote: | There were a lot of complaints about earlier models being | named, say, 'Meena'. (It's very sexist, you know, to name a | chatbot a female name.) People won't complain about | 'Chinchilla' because chinchillas are adorable. PaLMs aren't | adorable, but at least it's neutral. | MrBuddyCasino wrote: | Its not so bad. If they were radio astronomers they'd call it | Very Big Neuronal Language Model. IBM would call it Watson | Advanced AI. If they were a gamer accessory company they'd call | it DeepTek Ultra Pro VDH-Max AI A320M. Chinchilla is nice and | fluffy. | farmin wrote: | It's the name of a town in QLD. | binarymax wrote: | Large language models have a (recent) history of silly names. | BERT, BART, ELMO, RoBERTa, BIGBIRD, PaLM, Megatron etc. Might | as well go full nonsense. | DSingularity wrote: | A touch of irony that cutting edge research on language can't | produce better names. | omarhaneef wrote: | True. I will add that it is customary to justify it by | demonstrating it is some sort of acronym or contraction. | yeetsfromhellL2 wrote: | It's a recursive, selective acronym | C CH CHI | CHIN CHINC CHINCH | CHINCHI CHINCHIL CHINCHILL ==> | CHINCHILLA HINCHILLA INCHILLA | NCHILLA CHILLA HILLA ILLA | LLA LA A | omarhaneef wrote: | I know what recursive means, I know what selective means, | I know what an acronym is, and I think I see the pattern | in that picture, but when I put it all together I am | lost. | | Alternatively, is this a joke and the "recursive, | selective acronym" can be used to justify any word? | veonik wrote: | A AR ARB | ARBI ARBIT ARBITR | ARBITRA ARBITRAR ==> ARBITRARY | RBITRARY BITRARY ITRARY | TRARY RARY ARY RY | Y | | Yup, seems it works for any word. | MisterTea wrote: | My theory is since no one reads literature anymore, timeless, | interesting and unique names from history and other cultures | are lost to a deluge of soon to be forgotten gag, pop-culture | and meme names. Perhaps this is why we have Chinchilla and | not Oberon. | jankeymeulen wrote: | Like the Oberon OS and programming language? | jstx1 wrote: | Image models too - the Inception paper from 2014 directly | refers to knowyourmeme.com and the "we need to go deeper" | meme from the movie Inception - | https://knowyourmeme.com/memes/we-need-to-go-deeper - it's | the first reference in the paper [1] and it's also why the | model is called that way. | | [1] https://arxiv.org/pdf/1409.4842.pdf | ShamelessC wrote: | Seems the link is down. Found a decent synopsis/discussion on | lesswrong. | | https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-scalin... | | > On March 29th, DeepMind published a paper, "Training Compute- | Optimal Large Language Models", that shows that essentially | everyone -- OpenAI, DeepMind, Microsoft, etc. -- has been | training large language models with a deeply suboptimal use of | compute. | | > Following the new scaling laws that they propose for the | optimal use of compute, DeepMind trains a new, 70-billion | parameter model that outperforms much larger language models, | including the 175-billion parameter GPT-3 and DeepMind's own | 270-billion parameter "Gopher". | gyang wrote: | I think there remains an immense amount of such suboptimality | still hanging from the tree, so to speak. | | For example, our recent paper "Tensor Programs V: Tuning Large | Neural Networks via Zero-Shot Hyperparameter Transfer"[1] shows | that even learning rate and initialization used by existing | models are _deeply wrong_. By just picking them correctly | (which involves some really beautiful mathematics), we can | effectively double the model size of the GPT-3 6.7B model (to | be comparable in quality to the 13B model across the suite of | benchmark tasks). | | Large neural networks behave in a way we are only beginning to | understand well just because each empirical probe of any such | model is so much more expensive and time consuming than typical | models. But principled theory here can have a lot of leverage | by pointing out the right direction to look, as it did in our | work. | | [1] http://arxiv.org/abs/2203.03466 | p1esk wrote: | What do you think about the concept of "critical batch size"? | https://openai.com/blog/science-of-ai/ | gyang wrote: | I think the concept makes sense. The basic insight, that | the right batch size depends on the difficulty and | noisiness of a task, is already used by teams. For example, | the PaLM paper from last week increased its batch size | throughout training. | | But as far as I know, the more precise predictions of | optimal batch size aren't used much, probably because it's | expensive to measure accurately, or because the predictive | equation isn't accurate enough to begin with. I wonder if | we can "transfer" the optimal batch size from a smaller | setting (smaller model or data) to the full setting, like | in our paper. This would make it much more practical. | eigenvalue wrote: | According to the LessWrong post, the smaller model trained on | more data performs better on most of the tasks, but it's worse | on "college level math" questions. I wonder why that is. Is it | because the extra capacity of the larger model was used to | basically memorize theorems? Or is it because the extra "brain | power" let it model the math better? Oddly, one of the tasks | that the smaller most outperformed the larger model on is "high | school level math"! Very counterintuitive, and I am curious if | there are any big takeaways lurking in that disparity. | ShamelessC wrote: | Gwern responded to a similar question in the comments | section. | | (parent) | | > the fact that data and compute need to scale proportionally | seems... like a big point in favor of NNs as | memorizers/interpolators. | | (gwern) | | > Surely it's the opposite? The more bang you get out of each | parameter, the less it looks like 'just' (whatever that | means) memorization/interpolation. When you needed to | increase parameters a lot, disproportionately, to cope with | some more data, that does not speak well of abstractions or | understanding. (If I can train a 1t model to get the same | loss as what I thought was going to take a 100t model, why | would I think that that 100t model must be | memorizing/interpolating less?) Let's take your claim to its | logical extreme: suppose we discovered tomorrow a scaling law | that made parameters near-constant (log, let's say); would | that not suggest that those parameters are super useful and | it's doing an amazing job of learning the underlying | algorithm and is not memorizing/interpolating? | sillysaurusx wrote: | This isn't addressing their question. And Gwern's goal here | is to (incorrectly) try to get rid of the idea that models | are just memorizing and interpolating, when in fact | memorization and interpolation is what we all do, including | models. He's just bothered by the idea that people think of | models as less than magic. | | On the other hand, https://twitter.com/model_mechanic/statu | s/151297688118364569... is admittedly pretty magical, even | if the basis of that magic is memorization and | interpolation. | VirusNewbie wrote: | Why do you say they just memorize and interpret? I can | teach GPT-2 new things, including new objects and their | physical properties and it does a good job with that. | That also means it has definitely not just regurgitated a | matching sentence back to me. | replygirl wrote: | when i see a new object for the first time, i MEMORIZE | what i INTERPRET as its identifying traits, and ask | someone who has already MEMORIZED what that object is to | INTERPRET a concept with which i can associate those | traits. the next time i encounter an object with those | traits i can then recall the associations, then compose | those trait-level interpretations into an interpretation | of an object. | | at a fundamental level that's all this is, compositions | of associated memorizations and interpretations, which | map to compositions of sentence parts the machine can | regurgitate back to you | rictic wrote: | To rebut someone's argument you must address the argument | and not just talk about them and their motivations | | From your comment a reader will understand that you think | they're just memorizing and interpolating and that you | disagree with gwern on this point, but you've given your | reader nothing that argues in favor of your position | | Why should someone believe that models are just | memorizing and interpolating? | yldedly wrote: | It's impossible for a piecewise linear function to be | anything other than linear outside the training sample. | They are by their definition unable to do anything but | interpolate. | danuker wrote: | It might just be by chance: the initial weights of one model | could have been lucky in some areas, and unlucky in others. | There's no way to tell other than training again, which is a | costly proposition. | eigenvalue wrote: | That seems pretty unlikely to me actually. As the models | and training data get much bigger, I think the initial | weights become less important (at least assuming your | random weights have certain desirable statistical | properties, which they do by construction usually). | [deleted] | adamsmith143 wrote: | Probably right. Most people dump on these language models for | this reason but it would be absurd for a HS student to have | to re-derive the quadratic equation every time they worked on | an Algebra problem so naturally you memorize it. Why should | it be any different for a language model? | eutectic wrote: | I never memorized the quadratic formula, and I did OK. | whimsicalism wrote: | Did you go to school in the US in the last 2-3 decades? | replygirl wrote: | Once you start calculus they let you use a real | calculator | whimsicalism wrote: | That may be true, but in the US there are typically math | courses before calculus. | replygirl wrote: | But then we get a calculator. | whimsicalism wrote: | Even then, it is typically not a symbolic calculator so | if your answer is a closed form function of variables, | you're SOL with a TI-84. | adamsmith143 wrote: | Maybe we went to radically different schools but I | certainly had to calculate by hand using the quadratic | formula countless times where calculators were not | allowed to be used. | | Anyway it distracts from the point so it's not relevant. | VikingCoder wrote: | 70 billion parameters... Is each of those a 4-byte float? | | So, is that 280 billion bytes of just parameters? | sudosysgen wrote: | I'm fairly confident each of those is a 2-byte float, but yes | that's over 100 GB of parameters. | sillysaurusx wrote: | Welcome to the party! I joined ML because I realized I | could help. You can too. I bet you're both already thinking | of clever ways to deal with massive models from an | infrastructure standpoint. That's just one of hundreds of | interesting problems. | native_samples wrote: | Is 100GB of parameters really that large? 128GB of RAM on | a server class machine is not unusual. Seems such a model | could fit entirely in RAM. | andbberger wrote: | GPU memory is generally much smaller and more expensive | kristjansson wrote: | To elaborate on the sibling comment: main memory is much | bigger, but CPUs are much, much slower. It would be a | challenge to merely run a model like this on CPU, and | totally infeasible to train one. So the challenge is to | fit into the memory of a single GPU you can afford, | coordinate multiple GPUs, or efficiently page from main | memory into GPU. | Delitio wrote: | Is there any source which explains what billion of | parameters actually are? | | In my mind a parameter is: language, dialect, perhaps | context parameters (food, dinner, lunch, travel) and if we | than talk about language and audio perhaps sound waves, | gender. | | Or are context parameters which gives you insight? Like a | billion of parameters are literally something like | travel=false, travel-europe=true people speaking=e, age, | height, | nl wrote: | It's rare a single parameter maps to a human | understandable concept. Occasionally someone finds one | that does map fairly well, for example this case back in | 2017: https://openai.com/blog/unsupervised-sentiment- | neuron/#senti... | jefft255 wrote: | The parameters are the number of weights in a neural | network, in this case. | matt123456789 wrote: | A parameter is a scalar value, most of which are in the | attention matrices and feedforward matrices, you also | hear these called "weights". Any intro to DL course will | cover these in detail. I recommend started with Andrew | Ng's Coursera class on Intro to Machine Learning, | although there may be better ones out there now. | Delitio wrote: | Input parameter vs. weights then? | | I see tx | lostmsu wrote: | These networks (text models) usually have around a few | thousand inputs. | brrrrrm wrote: | A good visual introduction to neural networks can be | found here: https://playground.tensorflow.org | | A parameter is a "weight" in this case (the lines drawn | from neuron to neuron). The neurons are effectively | runtime values or "activations." Parameters (weights) are | updated during training and then set as constant during | "inference" (also called "prediction"). | | There's unfortunately a ton of jargon and different | groups use different words almost exclusively. | dotnet00 wrote: | Parameters are just floating point numbers, at most they | can be seen as degrees of freedom or kind of like the | order of a polynomial used in curve fitting. | | They're too abstract to assign much meaning to individual | parameters, as our understanding of why their values are | exactly the way they are is extremely limited. ___________________________________________________________________ (page generated 2022-04-11 23:00 UTC)