[HN Gopher] The AI Scaling Hypothesis ___________________________________________________________________ The AI Scaling Hypothesis Author : andreyk Score : 92 points Date : 2022-10-07 16:29 UTC (6 hours ago) (HTM) web link (lastweekin.ai) (TXT) w3m dump (lastweekin.ai) | kelseyfrog wrote: | There's a deeper more troubling problem being exposed here - deep | learning systems are at least an order of magnitude less data | efficient than the systems they hope to replicate. | | GPT-3 175B is trained with 499 Billion tokens[1]. Let's assume | token = word for the sake of this argument[2]. The average adult | person reads at a rate of 238wmp[3]. Then a human who reads | 24hrs/day from birth until their 18th birthday would read a total | of 2.2B billion words[4], or 0.45% of the words GPT-3 was trained | on. | | Human's simply do much more with much less. So what gives? I | don't disagree that we still haven't reached the end of what | scaling can do, but there is a creeping suspicion that we've | gotten something fundamentally wrong on the way there. | | 1. https://lambdalabs.com/blog/demystifying-gpt-3/ | | 2. GPT-based models use BPE and while we would dive into the | actual dictionary of tokens and make a word-token relationship, | we both agree that although this isn't a 1-to-1 relationship it | won't change the conclusion | https://huggingface.co/docs/transformers/tokenizer_summary | | 3. https://psyarxiv.com/xynwg/ | | 4. 238*60*24*365*18=2,251,670,400 | chaxor wrote: | You're right about reference [2], which can alter things by ~1 | order of magnitude (words are usually ~3-10 tokens). | Additionally as others have pointed out, we don't live | _entirely in the text world_. So, we have the nice benefit of | understanding objects from visual and proprioceptive inputs, | which is huge. The paucity of data argument made well-known by | Noam Chomsky et al is certainly worth discussing in academia; | however, I am not as moved by these arguments of the stark | differences in input required between humans and ML as I once | was. In image processing for example, sending 10k images in | rapid succession with no other proprioceptive inputs, time | dependencies, or agent-driven exploration of spaces puts these | systems at an enormous disadvantage to learn certain phenomenon | (classes of objects or otherwise). | | Of course there are differences between the systems, but I'm | beginning to be more skeptical that saying that the newer ML | system can't learn as much as biological systems given the | _same input_ (obviously this is where a lot is hidden). | kelseyfrog wrote: | Thank you for the tokens-to-words factor! Much appreciated. | | I'm definitely in agreement that multi-task models represent | an ability to learn more than any one specialized model, but | I think it's a bit of an open question whether multi-task | learning alone can fully close the digital-biological gap. Of | course I'd be very happy to be proven wrong on this though by | empirical evidence in my lifetime :) | [deleted] | gamegoblin wrote: | Humans take in a tremendously high bitrate of data via other | senses and are able to _connect_ those to the much lower amount | of language input such that the language can go much further. | | GPT-3 is learning everything it knows about the entire universe | _just from text_. | | Imagine we received a 1TB information dump from a civilization | that lives in an alternate universe with entirely different | physics. How much could we learn just from this information | dump? | | And from our point of view, it could be absurdly exotic. Maybe | their universe doesn't have gravity or electromagnetic | radiation. Maybe the life forms in that universe spontaneously | merge consciousnesses with other life forms and separate | randomly, so whatever writing we have received is in a style | that assumes the reader can effortlessly deduce that the author | is actually a froth of many consciousnesses. And in the grand | spectrum of how weird things could get, this "exotic" universe | I have described is really basically identical to our own, | because my imagination is limited. | | Learning about a whole exotic universe from just an info dump | is the task of GPT-3. For instance, tons of our writing takes | for granted that solid objects don't pass through each other. I | dropped the book. Where is the book? On the floor. Very few | bits of GPT-3's training set includes the statements "a book is | a solid object", "the floor is a solid object", "solid objects | don't pass through each other", but it can infer this principle | and others like it. | | From this point of view, its shortcomings make a lot of sense. | Some things GPT fails at are obvious to us having grown up in | this universe. I imagine we're going to see an explosion of | intelligence once researchers figure out how to feed AI systems | large swaths of YouTube and such, because then they will have a | much higher bandwidth way to learn about the universe and how | things interact, connecting language to physical reality. | saynay wrote: | One of the more interesting things I have seen recently is | the combination of different domains in models / datasets. | The top network of Stable Diffusion combines text-based | descriptions with image-based descriptions, where the model | learns to represent either text or images in the same | embedding; a picture, or a caption for that picture, lead to | similar embeddings. | | Effectively, this can broaden the context the network can | learn. There are relationships that are readily apparent to | something that learned images that might not be apparent to | something trained only on text, or vis-versa. | | It will be interesting to see where that goes. Will it be | possible to make a singular multi-domain encoder, that can | take a wide range of inputs and create an embedding (an | "mental model" of the input), and have this one model be | usable as the input for a wide variety of tasks? Can | something trained on multi-domains learn new concepts faster | than a network that is single-domain? | Teever wrote: | I would love to see a model trained on blueprints or a | model trained on circuit diagrams. | | text2blueprint or wav2schematic could produce some | interesting things. | Jensson wrote: | They haven't even figured out basic math, so not sure | what you would expect to find there. They aren't smart | enough to generate structure that doesn't already exist. | visarga wrote: | Depends on the method. Evolutionary methods can | absolutely find structure that we missed, and they often | go hand in hand with learning. Like AlphaGo move 37. | cma wrote: | Google's Imagen was trained on about as many images as a 6 | year old would have seen over their lifetime at 24fps and a | whole lot more text. It can draw a lot better and probably | has a better visual vocabulary but is also way outclassed in | many ways. | | Paucity of the stimulus is a real problem and may mean our | starting point architecture from genetics has a lot of | learning built in than just a bunch of uninitialized weights | randomly connected. A newborn animal can often get up and | walk right away in many species. | | https://www.youtube.com/watch?v=oTNA8vFUMEc | | Humans have a giant head at birth and muscles too weak, but | can swim around like little seals pretty quickly after birth. | visarga wrote: | > our starting point architecture from genetics has a lot | of learning built in | | I don't doubt that evolution provided us with great priors | to help us be fast learners, but there are two more things | to consider. | | One is scale - the brain is still 10,000x more complex than | large language models. We know that smaller models need | more training data, thus our brain being many orders of | magnitude larger than GPT-3 naturally learns faster. | | The second is social embedding - we are not isolated, our | environment is made of human beings, similarly an AI would | need to be trained as part of human society, or even as | part of an AI society, but not alone. | gamegoblin wrote: | Definitely. I do think video is _much_ more important than | images, because video implicitly encodes physics, which is | a huge deal. | | And, as you say, there are probably some | structural/architectural improvements to be made in the | neural network as well. The mammalian brain has had a few | hundred million years to evolve such a structure. | | It also remains unclear how important learning causal | influence is. These networks are essentially "locked in" | from inception. They can only take the world in. Whereas | animals actively probe and influence their world to learn | causality. | [deleted] | akiselev wrote: | The mammalian brain have had a few hundred million years | to evolve _neural plasticity_ [1] which is the key | function missing in AI. The brain's structure isn't set | in stone but develops over one's lifetime and can even | carry out major restructuring on a short time scale in | some cases of massive brain damage. | | Neural plasticity is the algorithm running on top of our | neural networks that optimizes their structure as we | learn so not only do we get more data, but our brains get | better tailored to handle that kind of data. This process | continues from birth to death and physical | experimentation in youth is a key part of that | development, as is social experimentation in social | animals. | | I think "it remains unclear" only to the ML field, from | the perspective of neuroscientists, current neural | networks aren't even superficially at the complexity of | axon-dendrite connections with ion channels and threshold | potentials, let alone the whole system. | | A family member's doctoral thesis was on the potentiation | of signals and based on my understanding if it, every | neuron takes part in the process with its own "memory" of | sorts and the potentiation she studied was just one tiny | piece of the neural plasticity story. We'd need to turn | every component in the hidden layers of a neural network | into it's own massive NN with its own memory to even | begin to approach that kind of complexity. | | [1] https://en.m.wikipedia.org/wiki/Neuroplasticity | alasdair_ wrote: | This is a fantastically good point. I think things will get | even more interesting once the ML tools have access to more | than just text, audio and image/video information. They will | be able to draw inferences that humans will generally be | unaware of. For example, maybe something happens in the | infrared range that humans are generally oblivious to, or | maybe inferences can be drawn based on how radio waves bounce | around an object. | | "The universe" according to most human experience misses SO | much information and it will be interesting to see what | happens once we have agents that can use all this extra stuff | in realtime and "see" things we cannot. | visarga wrote: | The hypothesis that you can't learn some things from text - | you need real life experience, is intuitive and I used to | think it's true. But there are interesting results from just | a few days ago saying that text by itself is also enough: | | > We test a stronger hypothesis: that the conceptual | representations learned by text only models are functionally | equivalent (up to a linear transformation) to those learned | by models trained on vision tasks. Specifically, we show that | the image representations from vision models can be | transferred as continuous prompts to frozen LMs by training | only a single linear projection. | | Linearly Mapping from Image to Text Space - | https://arxiv.org/abs/2209.15162 | ummonk wrote: | The claim isn't that you can't learn it from text, but | rather that this is why models require so much text to | train on - because they're learning the stuff that humans | learn from video. | thrown_22 wrote: | > Humans take in a tremendously high bitrate of data via | other senses and are able to connect those to the much lower | amount of language input such that the language can go much | further. | | They don't. Human bitrates are quite low all things | considered. The eyes which by far produce them most | information only have a bitrate equivalent to ~2kbps: | | http://www.princeton.edu/~wbialek/our_papers/ruyter+laughlin. | .. | | The rest of the input nerves don't bring us over 20kpbs. | | The average image recognition system has access to more data | and can tell the difference between a cat and a banana. A | human has somewhat more capability than that. | andreyk wrote: | I think comparing to humans is a bit of a distraction, unless | what you care about is replicating the way human intelligence | works in AI. The mechanisms by which learning is done (in these | cases self-supervised and supervised learning) are not at all | the same as humans have, so it's unsurprising the qualitative | aspects are different. | | It may be argued we need more human-like learning mechanisms. | Then again, if we need internet-scale data to achieve human- | level general intelligence, so what? If it works it works. Of | course, the comparison has some value in terms of knowing what | can be improved and so on, especially for RL. But I wouldn't | call this a 'troubling problem'. | MonkeyMalarky wrote: | Humans also have millions of years of evolution that have | effectively pre-trained the structure and learning ability of | the brain. A baby isn't born knowing a language but is born | with the ability to efficiently learn them. | peteradio wrote: | Indeed, there is a certain hardcoding that can efficiently | synthesize language. Doesn't that beg the question... what is | the missing hardcoding for AI that could enable it to | synthesize via much smaller samples? | myownpetard wrote: | There is a great paper, Weight Agnostic Neural Networks | [0], that explores this topic. They experiment with using a | single shared weight for a network while using an | evolutionary algorithm to find architectures that are | themselves biased towards being effective on specific | problems. | | The upshot is that once you've found an architecture that | is already biased towards solving a specific problem, then | the training of the weights is faster and results in better | performance. | | From the abstract, "...In this work, we question to what | extent neural network architectures alone, without learning | any weight parameters, can encode solutions for a given | task.... We demonstrate that our method can find minimal | neural network architectures that can perform several | reinforcement learning tasks without weight training. On a | supervised learning domain, we find network architectures | that achieve much higher than chance accuracy on MNIST | using random weights." | | [0] https://arxiv.org/abs/1906.04358 | Der_Einzige wrote: | This btw is an example of a whole field called "extreme | learning" | | https://en.m.wikipedia.org/wiki/Extreme_learning_machine | visarga wrote: | The brain has about 1T synapses and GPT-3 has 175B parameters, | even though a parameter is much simpler than a synapse. So the | scale of the brain is at least 5700x that of GPT-3. It seems | normal to have to compensate by using 200x more training data. | 6gvONxR4sf7o wrote: | What's missing is interaction/causation, and the reason is that | we can scale things more easily without interaction in the data | gathering loop. Training a model with data gathering in the | loop requires gathering more data every time the model takes a | learning step. It's slow and expensive. Training a model on | pre-existing data is much simpler, and it's unclear whether | we've reached the limits of that yet. | | My prediction is we'll get 'good enough for prod' without | interactive data, which will let us put interactive systems in | the real world at scale, at which point the field's focus will | be able to shift. | | One way to look at it is active learning. We all know the game | where I think of a number between 0 and 100 and you have to | guess it, and I'll tell you if it's higher or lower. You'll | start by guessing 50, then maybe 25, and so on, bisecting the | intervals. If you want to get within +/-1 of the number I'm | thinking of, you need about six data points. On the other hand, | if you don't do this interactively, and just gather a bunch of | data before seeing any answers, to get within +/-1, and need 50 | data points. The interactivity means you can refine your | questions in response to whatever you've learned, saving huge | amounts of time. | | Another way to look at it is like randomized controlled trials. | To learn a compact idea (more X means more Y), you can | randomize X and gather just enough data on Y to be confident | that the relationship isn't a coincidence. The alternative | (observational causal inference) is famously harder. You have | to look at a bunch of X's and Y's, and also all the Z's that | might affect them, and then get enough data to be confident in | this entire structure you've put together involving lots of | variables. | | The way ML has progressed is really a function of what's easy. | If you want a model to learn to speak english, do you want it | to be embodied in the real world for two years with humans | teaching it full time how the world and language relate? Or is | it faster to just show it a terabyte of english? | | tl;dr observational learning is much much harder than | interactive learning, but we can scale observational learning | in ways we can't scale interactive learning. | gxt wrote: | Because the whole industry is wrong. ML is incapable of general | intelligence, because that's not what intelligence is. ML is | the essential component with which one interfaces with the | universe, but it's not intelligence, and never will be. | Symmetry wrote: | Humans are using less data but we throw drastically more | compute at the problem during learning. | edf13 wrote: | Simple - humans aren't learning by reading and understanding a | word at a time (or token)... | | They are taking many thousands (millions?) of inputs every | minute for their surroundings | edf13 wrote: | Just reminded myself of Jonny 5 needing input... | | https://youtu.be/Y9lwQKv71FY | idiotsecant wrote: | >deep learning systems are at least an order of magnitude less | data efficient than the systems they hope to replicate. | | While true on the surface, you have to also consider that there | is a _vast_ quantity of training data expressed in our DNA. Our | 'self' is a conscious thought, sure, but it's also unconscious | action and instinct, all of which is indirect lived experience | of our forebear organisms. The ones that had a slightly better | twitch response to the feel of an insect crawling on their arm | were able to survive the incident, etc. Our 'lizard brains' are | the result of the largest set of training data we could | possibly imagine - the evolutionary history of life on earth. | c3534l wrote: | Brains do not actually work very similarly to artificial neural | networks. The connectionist approach is no longer favored, and | human brains are not arranged in regular grids of fully | interconnected layers. ANNs were inspired by how people thought | the brain worked more than 50 years ago. Of course, ANNs are | meant to work and solve practical problems with the technology | we have. They're not simulations. | machina_ex_deus wrote: | I agree. If you look at animals it's also clear that scaling | hypothesis breaks somewhen, as all measures of brain size | (brain mass ratio, etc.) fail to capture intelligence. And | animals have natural neutral networks. | | If you think about it, neutral networks have roamed the earth | for millions of years - including generic algorithm for | optimizing the hardware. And yet only extremely recently | something like humans happened. Why? | | The amount of training and processing power which happened | naturally through evolution beats current AI research by | several orders of magnitude. Yes, evolution isn't intelligent | design. But the current approach to AI isn't intelligent design | either. | godelski wrote: | As a ML Vision researcher, I find these scaling hypothesis claims | quite ridiculous. I understand that the NLP world has made large | strides by adding more attention layers, but I'm not an NLP | person and I suspect there's more than just more layers. We won't | even talk about the human brain and just address that "scaling is | sufficient" hypothesis. | | With vision, pointing to Parti and DALLE as scaling is quite | dumb. They perform similarly but are DRASTICALLY different in | size. Parti has configurations with 350M, 750M, 3B, and 20B | parameters. DALLE has 3.5. Imagen uses T5-XXL which alone has 11B | parameters, just in the text part. | | Not only this, there are major architecture changes. If scaling | was all you needed then all these networks would still be using | CNNs. But we shifted to transformers. THEN we have shifted to | diffusion based models. Not to mention that Parti, DALLE, and | Parti have different architectures. It isn't just about scale. | Architecture matters here. | | And to address concerns, diffusion (invented decades ago) didn't | work because we just scaled it up. It worked because of | engineering. It was largely ignored previously because no one got | it to work better than GANs. I think this lesson should really | stand out. That we need to consider the advantages and | disadvantages of different architectures and learn how to make | ALL of them work effectively. In that manner we can combine them | to work in ideal ways. Even Le Cun is coming to this point of | view despite previously being on the scaling side. | | But maybe you NLP folks disagree. But the experience in vision is | far more rich than just scaling. | panabee wrote: | this is well articulated. another key point: dall-e 2 uses 70% | _fewer_ parameters than dalle-e 1 while offering far higher | quality. | | from wikipedia (https://en.wikipedia.org/wiki/DALL-E): | | DALL-E's model is a multimodal implementation of GPT-3 with 12 | billion parameters which "swaps text for pixels", trained on | text-image pairs from the Internet. DALL-E 2 uses 3.5 billion | parameters, a smaller number than its predecessor. | andreyk wrote: | I agree - I think scaling laws and scaling hypothesis are quite | distinct personally. Scaling hypothesis is 'just go bigger with | what we have and we'll get AGI', vs scaling laws are 'for these | tasks and these models types, these are the empirical trends in | performance we see'. I think scaling laws are still really | valuable for vision research, but as you say we should not just | abandon thinking about things beyond scaling even if we observe | good scaling trends. | godelski wrote: | Yeah I agree with this position. It is also what I see within | my own research. But also in my own research I see the vast | importance of architecture search. This may not be what the | public sees, but I think it is well known to the research | community or anyone with hands on experience with these types | of models. | andreyk wrote: | Co-author here, happy to answer any questions/chat about stuff we | did not cover in this overview! | puttycat wrote: | Hi! Great post. See my comment below about scaling down. | andreyk wrote: | Thanks! I'll take a look, those do look interesting. | benlivengood wrote: | It would be great to see more focus on Chinchilla's result that | most large models were quite undertrained with respect to | optimal reduction in test loss. | andreyk wrote: | agreed, we did not discuss that sufficiently | [deleted] | 3vidence wrote: | I think something that has concerned me with the concept of | scaling to AGI is the concept of "adversarial examples". Small | tweaks that can be made to cause unpredictable behavior in the | system. At a high level these are caused by unexpected paths in | high dimensional model weight space that don't align with our | intuition. This problem in general seems to get worse as the size | of the weights grow. | | From a value perspective a very high fidelity model with | extremely unexpected behavior seems really low value since you | need a human there full time to make sure that the model doesn't | go haywire that 1-5% of the time | mjburgess wrote: | "Scaling" means increasing the number of parameters. _Parameters_ | are just the database of the system. At 300GB of parameters, we | 're talking models which remember compressed versions of all | books ever written. | | This is not a path to "AGI", this is just building a search | engine with a little better querying power. | | "AI" systems today are little more than superpositions of google | search results, with their parameters being a compression of | billions of images/documents. | | This isnt even on the road to intelligence, let alone an instance | of it. "General intelligence" does not solve problems by | induction over billions of examples of their prior solutions. | | And exponential scaling in the amount of such remembering | required is a fatal trajectory for AI, and likewise and | indication that it doesnt deserve the term. | | No intelligence is exponential in an answer-space, indeed, i'd | say that's *the whole point* of intelligence! | | We already know that if you compress all possible {(Question, | Answer)} pairs, you can "solve" any problem trivially. | MathYouF wrote: | The tone of this betrays a possibly more argumentative than | collaborative conversation style than that which I may want to | engage with further (as seems common I've noticed amongst anti- | connectionists), but I did find one point intersting for | discussion. | | > Parameters are just the database of the system. | | Would any equations parameters be considered just the database | then? C in E=MC^2, 2 in a^2+b^2=c^2? | | I suppose those numbers are basically a database, but the | relationships (connections) they have to the other variables | (inputs) represent a demonstrable truth about the universe. | | To some degree every parameter in a nn is also representing | some truth about the universe. How general and compact that | representation is currently is not known (likely less than we'd | like of both traits). | jsharf wrote: | I'm not anti-connectionist, but if I were to put myself in | their shoes, I'd respond by pointing out that in E=MC^2, C is | a value which directly correlates with empirical results. If | all of humanity were to suddenly disappear, a future advanced | civilization would re-discover the same constant, though | maybe with different units. Their neural networks, on the | other hand, probably would be meaningfully different. | | Also, the C in E=MC^2 has units which define what it means in | physical terms. How can you define a "unit" for a neural | network's output? | | Now, my thoughts on this are contrary to what I've said so | far. Even though neural network outputs aren't easily defined | currently, there's some experimental results showing neurons | in neural networks demonstrating symbolic-like higher-level | behavior: | | https://openai.com/blog/multimodal-neurons/ | | Part of the confusion likely comes from how neural networks | represent information -- often by superimposing multiple | different representations. A very nice paper from Anthropic | and Harvard delved into this recently: | | https://transformer-circuits.pub/2022/toy_model/index.html | ctoth wrote: | Related: Polysemanticity and Capacity in Neural Networks | https://arxiv.org/abs/2210.01892 | mjburgess wrote: | There's a very literal sense in which NN parameters are just | a db. As in, it's fairly trivial to get copyrighted verbatim | output from a trained NN (eg., quake source code from git co- | pilot, etc.). | | "Connectionists" always want to reduce everything to formulae | with no natural semantics and then equivocate this with | science. Science isnt mathematics. Mathematics is just a | short hand for a description of the world _made true_ by the | semantics of that description. | | E=mc^2 isnt true because it's a polynomial, and it doesnt | _mean_ a polynomial, and it doesnt have "polynomial | properties" because it isnt _about_ mathematics. It 's | _about_ the world. | | E stands for energy, m for mass, and c for a geometric | constant of spacetime. If they were to stand for other | properties of the world, in general, the formulae would be | false. | | I find this "connectionist supernaturalism" about mathematics | deeply irritating, it has all the hubris and numerology of | religions but wandering around in a stolen lab coat. Hence | the tone. | | What can one say or feel in the face of the overtaking of | science by pseudoscience? It seems plausible to say now, | today, more pseudoscientific papers are written than | scientific ones. A generation of researchers are doing little | more than analysing ink-blot patterns and calling them | "models". | | The insistence, without explanation, that this is a | reasonable activity pushes one past tolerance on these | matters. It's exasperating... from psychometrics to AI, the | whole world of intellectual life has been taken over by a | pseudoscientific analysis of non-experimental post-hoc | datasets. | politician wrote: | This discussion (the GP and your response) perhaps suggests | that a way to evaluate the intelligence of an AI may need to | be more than the generation of some content, but also | citations and supporting work for that content. I guess I'm | suggesting that the field could benefit from a shift towards | explainability-first models. | benlivengood wrote: | That's why the Chinchilla paper (given a single paragraph in | the article) is so important; it gives a scaling equation that | puts a limit on the effect of increasing parameters. Generally | for the known transformer models the reduction in loss from | having infinite parameters is significantly less that the | reduction in loss from training on infinite data. Most large | models are very undertrained. | aaroninsf wrote: | Everything of interest in ML networks is occurring in the | abstractions that emerge in training in deep multi-layer | networks. | | At the crudest level this immediately provides for more than | canned lookup as asserted; analogical reasoning is a much- | documented emergent property. | | But analogies are merely the simplest, first-order abstraction, | which are easy for humans to point at. | | Inference and abstraction across multiple levels, means the | behaviors of these systems is utterly unlike simple stores. One | clear demonstration of this is the effective "compression" of | image gen networks. They don't compress images. For lack of any | better vocabulary, they understand them well enough produce | them. | | The hot topic is precisely whether there are boundaries to what | sorts of implicit reasoning can occur through scale, and, what | other architectures need to be present to effect agency and | planning of the kind hacked at in traditional symbolic systems | AI. | | It might be worthwhile to read contemporary work to get up to | speed. Things are already a lot weirder than we have had time | to internalize. | AstralStorm wrote: | Can they be said to understand the images if a style transfer | model they produce is image dependent with an unstable | threshold boundary? Or when they make an error similar to | pareidolia all the time, seeing faces where there are none? | When they do not understand how to paint even roughly fake | text? | swid wrote: | 300 GB is nothing compared to the vastness of information in | the universe (hence it fitting on a disk). AI is approximating | a function, and the function they are now learning to | approximate is us. | | From [1], with my own editing... | | When comparing the difference between now and human performance | | > ...[huamns] can achieve closer to 0.7 bits per character . | What is in that missing >0.4? | | > Well--everything! Everything that the model misses. While | just babbling random words was good enough at the beginning, at | the end, it needs to be able to reason our way through the most | difficult textual scenarios requiring causality or commonsense | reasoning... every time that it lacks the theory of mind to | compress novel scenes describing the Machiavellian scheming of | a dozen individuals at dinner jockeying for power as they | talk... | | > If we trained a model which reached that loss of <0.7, which | could predict text indistinguishable from a human, whether in a | dialogue ...how could we say that it doesn't truly understand | everything? | | [1] https://www.gwern.net/Scaling-hypothesis | andreyi wrote: | puttycat wrote: | Good overview. | | At the other extreme, some recent works [1,2] show why it's | sometime better to scale down instead of up, especially for some | humanlike capabilities like generalization: | | [1] | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00489... | | [2] https://arxiv.org/abs/1906.04358 | MathYouF wrote: | If greater parameterization leads to memorization rather than | generalisation it's likely a failure in our current | architectures and loss formulations rather than an inherent | benefit of "fewer parameters" improving generalizaiton. Other | animals do not generalize better than humans despite having | fewer neurons (or their generalizaitons betray a | misunderstanding of the number and depth of subcategories there | are for things, like when a dog barks at everything that passes | by the window). ___________________________________________________________________ (page generated 2022-10-07 23:00 UTC)