[HN Gopher] The AI Scaling Hypothesis
       ___________________________________________________________________
        
       The AI Scaling Hypothesis
        
       Author : andreyk
       Score  : 92 points
       Date   : 2022-10-07 16:29 UTC (6 hours ago)
        
 (HTM) web link (lastweekin.ai)
 (TXT) w3m dump (lastweekin.ai)
        
       | kelseyfrog wrote:
       | There's a deeper more troubling problem being exposed here - deep
       | learning systems are at least an order of magnitude less data
       | efficient than the systems they hope to replicate.
       | 
       | GPT-3 175B is trained with 499 Billion tokens[1]. Let's assume
       | token = word for the sake of this argument[2]. The average adult
       | person reads at a rate of 238wmp[3]. Then a human who reads
       | 24hrs/day from birth until their 18th birthday would read a total
       | of 2.2B billion words[4], or 0.45% of the words GPT-3 was trained
       | on.
       | 
       | Human's simply do much more with much less. So what gives? I
       | don't disagree that we still haven't reached the end of what
       | scaling can do, but there is a creeping suspicion that we've
       | gotten something fundamentally wrong on the way there.
       | 
       | 1. https://lambdalabs.com/blog/demystifying-gpt-3/
       | 
       | 2. GPT-based models use BPE and while we would dive into the
       | actual dictionary of tokens and make a word-token relationship,
       | we both agree that although this isn't a 1-to-1 relationship it
       | won't change the conclusion
       | https://huggingface.co/docs/transformers/tokenizer_summary
       | 
       | 3. https://psyarxiv.com/xynwg/
       | 
       | 4. 238*60*24*365*18=2,251,670,400
        
         | chaxor wrote:
         | You're right about reference [2], which can alter things by ~1
         | order of magnitude (words are usually ~3-10 tokens).
         | Additionally as others have pointed out, we don't live
         | _entirely in the text world_. So, we have the nice benefit of
         | understanding objects from visual and proprioceptive inputs,
         | which is huge. The paucity of data argument made well-known by
         | Noam Chomsky et al is certainly worth discussing in academia;
         | however, I am not as moved by these arguments of the stark
         | differences in input required between humans and ML as I once
         | was. In image processing for example, sending 10k images in
         | rapid succession with no other proprioceptive inputs, time
         | dependencies, or agent-driven exploration of spaces puts these
         | systems at an enormous disadvantage to learn certain phenomenon
         | (classes of objects or otherwise).
         | 
         | Of course there are differences between the systems, but I'm
         | beginning to be more skeptical that saying that the newer ML
         | system can't learn as much as biological systems given the
         | _same input_ (obviously this is where a lot is hidden).
        
           | kelseyfrog wrote:
           | Thank you for the tokens-to-words factor! Much appreciated.
           | 
           | I'm definitely in agreement that multi-task models represent
           | an ability to learn more than any one specialized model, but
           | I think it's a bit of an open question whether multi-task
           | learning alone can fully close the digital-biological gap. Of
           | course I'd be very happy to be proven wrong on this though by
           | empirical evidence in my lifetime :)
        
         | [deleted]
        
         | gamegoblin wrote:
         | Humans take in a tremendously high bitrate of data via other
         | senses and are able to _connect_ those to the much lower amount
         | of language input such that the language can go much further.
         | 
         | GPT-3 is learning everything it knows about the entire universe
         | _just from text_.
         | 
         | Imagine we received a 1TB information dump from a civilization
         | that lives in an alternate universe with entirely different
         | physics. How much could we learn just from this information
         | dump?
         | 
         | And from our point of view, it could be absurdly exotic. Maybe
         | their universe doesn't have gravity or electromagnetic
         | radiation. Maybe the life forms in that universe spontaneously
         | merge consciousnesses with other life forms and separate
         | randomly, so whatever writing we have received is in a style
         | that assumes the reader can effortlessly deduce that the author
         | is actually a froth of many consciousnesses. And in the grand
         | spectrum of how weird things could get, this "exotic" universe
         | I have described is really basically identical to our own,
         | because my imagination is limited.
         | 
         | Learning about a whole exotic universe from just an info dump
         | is the task of GPT-3. For instance, tons of our writing takes
         | for granted that solid objects don't pass through each other. I
         | dropped the book. Where is the book? On the floor. Very few
         | bits of GPT-3's training set includes the statements "a book is
         | a solid object", "the floor is a solid object", "solid objects
         | don't pass through each other", but it can infer this principle
         | and others like it.
         | 
         | From this point of view, its shortcomings make a lot of sense.
         | Some things GPT fails at are obvious to us having grown up in
         | this universe. I imagine we're going to see an explosion of
         | intelligence once researchers figure out how to feed AI systems
         | large swaths of YouTube and such, because then they will have a
         | much higher bandwidth way to learn about the universe and how
         | things interact, connecting language to physical reality.
        
           | saynay wrote:
           | One of the more interesting things I have seen recently is
           | the combination of different domains in models / datasets.
           | The top network of Stable Diffusion combines text-based
           | descriptions with image-based descriptions, where the model
           | learns to represent either text or images in the same
           | embedding; a picture, or a caption for that picture, lead to
           | similar embeddings.
           | 
           | Effectively, this can broaden the context the network can
           | learn. There are relationships that are readily apparent to
           | something that learned images that might not be apparent to
           | something trained only on text, or vis-versa.
           | 
           | It will be interesting to see where that goes. Will it be
           | possible to make a singular multi-domain encoder, that can
           | take a wide range of inputs and create an embedding (an
           | "mental model" of the input), and have this one model be
           | usable as the input for a wide variety of tasks? Can
           | something trained on multi-domains learn new concepts faster
           | than a network that is single-domain?
        
             | Teever wrote:
             | I would love to see a model trained on blueprints or a
             | model trained on circuit diagrams.
             | 
             | text2blueprint or wav2schematic could produce some
             | interesting things.
        
               | Jensson wrote:
               | They haven't even figured out basic math, so not sure
               | what you would expect to find there. They aren't smart
               | enough to generate structure that doesn't already exist.
        
               | visarga wrote:
               | Depends on the method. Evolutionary methods can
               | absolutely find structure that we missed, and they often
               | go hand in hand with learning. Like AlphaGo move 37.
        
           | cma wrote:
           | Google's Imagen was trained on about as many images as a 6
           | year old would have seen over their lifetime at 24fps and a
           | whole lot more text. It can draw a lot better and probably
           | has a better visual vocabulary but is also way outclassed in
           | many ways.
           | 
           | Paucity of the stimulus is a real problem and may mean our
           | starting point architecture from genetics has a lot of
           | learning built in than just a bunch of uninitialized weights
           | randomly connected. A newborn animal can often get up and
           | walk right away in many species.
           | 
           | https://www.youtube.com/watch?v=oTNA8vFUMEc
           | 
           | Humans have a giant head at birth and muscles too weak, but
           | can swim around like little seals pretty quickly after birth.
        
             | visarga wrote:
             | > our starting point architecture from genetics has a lot
             | of learning built in
             | 
             | I don't doubt that evolution provided us with great priors
             | to help us be fast learners, but there are two more things
             | to consider.
             | 
             | One is scale - the brain is still 10,000x more complex than
             | large language models. We know that smaller models need
             | more training data, thus our brain being many orders of
             | magnitude larger than GPT-3 naturally learns faster.
             | 
             | The second is social embedding - we are not isolated, our
             | environment is made of human beings, similarly an AI would
             | need to be trained as part of human society, or even as
             | part of an AI society, but not alone.
        
             | gamegoblin wrote:
             | Definitely. I do think video is _much_ more important than
             | images, because video implicitly encodes physics, which is
             | a huge deal.
             | 
             | And, as you say, there are probably some
             | structural/architectural improvements to be made in the
             | neural network as well. The mammalian brain has had a few
             | hundred million years to evolve such a structure.
             | 
             | It also remains unclear how important learning causal
             | influence is. These networks are essentially "locked in"
             | from inception. They can only take the world in. Whereas
             | animals actively probe and influence their world to learn
             | causality.
        
               | [deleted]
        
               | akiselev wrote:
               | The mammalian brain have had a few hundred million years
               | to evolve _neural plasticity_ [1] which is the key
               | function missing in AI. The brain's structure isn't set
               | in stone but develops over one's lifetime and can even
               | carry out major restructuring on a short time scale in
               | some cases of massive brain damage.
               | 
               | Neural plasticity is the algorithm running on top of our
               | neural networks that optimizes their structure as we
               | learn so not only do we get more data, but our brains get
               | better tailored to handle that kind of data. This process
               | continues from birth to death and physical
               | experimentation in youth is a key part of that
               | development, as is social experimentation in social
               | animals.
               | 
               | I think "it remains unclear" only to the ML field, from
               | the perspective of neuroscientists, current neural
               | networks aren't even superficially at the complexity of
               | axon-dendrite connections with ion channels and threshold
               | potentials, let alone the whole system.
               | 
               | A family member's doctoral thesis was on the potentiation
               | of signals and based on my understanding if it, every
               | neuron takes part in the process with its own "memory" of
               | sorts and the potentiation she studied was just one tiny
               | piece of the neural plasticity story. We'd need to turn
               | every component in the hidden layers of a neural network
               | into it's own massive NN with its own memory to even
               | begin to approach that kind of complexity.
               | 
               | [1] https://en.m.wikipedia.org/wiki/Neuroplasticity
        
           | alasdair_ wrote:
           | This is a fantastically good point. I think things will get
           | even more interesting once the ML tools have access to more
           | than just text, audio and image/video information. They will
           | be able to draw inferences that humans will generally be
           | unaware of. For example, maybe something happens in the
           | infrared range that humans are generally oblivious to, or
           | maybe inferences can be drawn based on how radio waves bounce
           | around an object.
           | 
           | "The universe" according to most human experience misses SO
           | much information and it will be interesting to see what
           | happens once we have agents that can use all this extra stuff
           | in realtime and "see" things we cannot.
        
           | visarga wrote:
           | The hypothesis that you can't learn some things from text -
           | you need real life experience, is intuitive and I used to
           | think it's true. But there are interesting results from just
           | a few days ago saying that text by itself is also enough:
           | 
           | > We test a stronger hypothesis: that the conceptual
           | representations learned by text only models are functionally
           | equivalent (up to a linear transformation) to those learned
           | by models trained on vision tasks. Specifically, we show that
           | the image representations from vision models can be
           | transferred as continuous prompts to frozen LMs by training
           | only a single linear projection.
           | 
           | Linearly Mapping from Image to Text Space -
           | https://arxiv.org/abs/2209.15162
        
             | ummonk wrote:
             | The claim isn't that you can't learn it from text, but
             | rather that this is why models require so much text to
             | train on - because they're learning the stuff that humans
             | learn from video.
        
           | thrown_22 wrote:
           | > Humans take in a tremendously high bitrate of data via
           | other senses and are able to connect those to the much lower
           | amount of language input such that the language can go much
           | further.
           | 
           | They don't. Human bitrates are quite low all things
           | considered. The eyes which by far produce them most
           | information only have a bitrate equivalent to ~2kbps:
           | 
           | http://www.princeton.edu/~wbialek/our_papers/ruyter+laughlin.
           | ..
           | 
           | The rest of the input nerves don't bring us over 20kpbs.
           | 
           | The average image recognition system has access to more data
           | and can tell the difference between a cat and a banana. A
           | human has somewhat more capability than that.
        
         | andreyk wrote:
         | I think comparing to humans is a bit of a distraction, unless
         | what you care about is replicating the way human intelligence
         | works in AI. The mechanisms by which learning is done (in these
         | cases self-supervised and supervised learning) are not at all
         | the same as humans have, so it's unsurprising the qualitative
         | aspects are different.
         | 
         | It may be argued we need more human-like learning mechanisms.
         | Then again, if we need internet-scale data to achieve human-
         | level general intelligence, so what? If it works it works. Of
         | course, the comparison has some value in terms of knowing what
         | can be improved and so on, especially for RL. But I wouldn't
         | call this a 'troubling problem'.
        
         | MonkeyMalarky wrote:
         | Humans also have millions of years of evolution that have
         | effectively pre-trained the structure and learning ability of
         | the brain. A baby isn't born knowing a language but is born
         | with the ability to efficiently learn them.
        
           | peteradio wrote:
           | Indeed, there is a certain hardcoding that can efficiently
           | synthesize language. Doesn't that beg the question... what is
           | the missing hardcoding for AI that could enable it to
           | synthesize via much smaller samples?
        
             | myownpetard wrote:
             | There is a great paper, Weight Agnostic Neural Networks
             | [0], that explores this topic. They experiment with using a
             | single shared weight for a network while using an
             | evolutionary algorithm to find architectures that are
             | themselves biased towards being effective on specific
             | problems.
             | 
             | The upshot is that once you've found an architecture that
             | is already biased towards solving a specific problem, then
             | the training of the weights is faster and results in better
             | performance.
             | 
             | From the abstract, "...In this work, we question to what
             | extent neural network architectures alone, without learning
             | any weight parameters, can encode solutions for a given
             | task.... We demonstrate that our method can find minimal
             | neural network architectures that can perform several
             | reinforcement learning tasks without weight training. On a
             | supervised learning domain, we find network architectures
             | that achieve much higher than chance accuracy on MNIST
             | using random weights."
             | 
             | [0] https://arxiv.org/abs/1906.04358
        
               | Der_Einzige wrote:
               | This btw is an example of a whole field called "extreme
               | learning"
               | 
               | https://en.m.wikipedia.org/wiki/Extreme_learning_machine
        
         | visarga wrote:
         | The brain has about 1T synapses and GPT-3 has 175B parameters,
         | even though a parameter is much simpler than a synapse. So the
         | scale of the brain is at least 5700x that of GPT-3. It seems
         | normal to have to compensate by using 200x more training data.
        
         | 6gvONxR4sf7o wrote:
         | What's missing is interaction/causation, and the reason is that
         | we can scale things more easily without interaction in the data
         | gathering loop. Training a model with data gathering in the
         | loop requires gathering more data every time the model takes a
         | learning step. It's slow and expensive. Training a model on
         | pre-existing data is much simpler, and it's unclear whether
         | we've reached the limits of that yet.
         | 
         | My prediction is we'll get 'good enough for prod' without
         | interactive data, which will let us put interactive systems in
         | the real world at scale, at which point the field's focus will
         | be able to shift.
         | 
         | One way to look at it is active learning. We all know the game
         | where I think of a number between 0 and 100 and you have to
         | guess it, and I'll tell you if it's higher or lower. You'll
         | start by guessing 50, then maybe 25, and so on, bisecting the
         | intervals. If you want to get within +/-1 of the number I'm
         | thinking of, you need about six data points. On the other hand,
         | if you don't do this interactively, and just gather a bunch of
         | data before seeing any answers, to get within +/-1, and need 50
         | data points. The interactivity means you can refine your
         | questions in response to whatever you've learned, saving huge
         | amounts of time.
         | 
         | Another way to look at it is like randomized controlled trials.
         | To learn a compact idea (more X means more Y), you can
         | randomize X and gather just enough data on Y to be confident
         | that the relationship isn't a coincidence. The alternative
         | (observational causal inference) is famously harder. You have
         | to look at a bunch of X's and Y's, and also all the Z's that
         | might affect them, and then get enough data to be confident in
         | this entire structure you've put together involving lots of
         | variables.
         | 
         | The way ML has progressed is really a function of what's easy.
         | If you want a model to learn to speak english, do you want it
         | to be embodied in the real world for two years with humans
         | teaching it full time how the world and language relate? Or is
         | it faster to just show it a terabyte of english?
         | 
         | tl;dr observational learning is much much harder than
         | interactive learning, but we can scale observational learning
         | in ways we can't scale interactive learning.
        
         | gxt wrote:
         | Because the whole industry is wrong. ML is incapable of general
         | intelligence, because that's not what intelligence is. ML is
         | the essential component with which one interfaces with the
         | universe, but it's not intelligence, and never will be.
        
         | Symmetry wrote:
         | Humans are using less data but we throw drastically more
         | compute at the problem during learning.
        
         | edf13 wrote:
         | Simple - humans aren't learning by reading and understanding a
         | word at a time (or token)...
         | 
         | They are taking many thousands (millions?) of inputs every
         | minute for their surroundings
        
           | edf13 wrote:
           | Just reminded myself of Jonny 5 needing input...
           | 
           | https://youtu.be/Y9lwQKv71FY
        
         | idiotsecant wrote:
         | >deep learning systems are at least an order of magnitude less
         | data efficient than the systems they hope to replicate.
         | 
         | While true on the surface, you have to also consider that there
         | is a _vast_ quantity of training data expressed in our DNA. Our
         | 'self' is a conscious thought, sure, but it's also unconscious
         | action and instinct, all of which is indirect lived experience
         | of our forebear organisms. The ones that had a slightly better
         | twitch response to the feel of an insect crawling on their arm
         | were able to survive the incident, etc. Our 'lizard brains' are
         | the result of the largest set of training data we could
         | possibly imagine - the evolutionary history of life on earth.
        
         | c3534l wrote:
         | Brains do not actually work very similarly to artificial neural
         | networks. The connectionist approach is no longer favored, and
         | human brains are not arranged in regular grids of fully
         | interconnected layers. ANNs were inspired by how people thought
         | the brain worked more than 50 years ago. Of course, ANNs are
         | meant to work and solve practical problems with the technology
         | we have. They're not simulations.
        
         | machina_ex_deus wrote:
         | I agree. If you look at animals it's also clear that scaling
         | hypothesis breaks somewhen, as all measures of brain size
         | (brain mass ratio, etc.) fail to capture intelligence. And
         | animals have natural neutral networks.
         | 
         | If you think about it, neutral networks have roamed the earth
         | for millions of years - including generic algorithm for
         | optimizing the hardware. And yet only extremely recently
         | something like humans happened. Why?
         | 
         | The amount of training and processing power which happened
         | naturally through evolution beats current AI research by
         | several orders of magnitude. Yes, evolution isn't intelligent
         | design. But the current approach to AI isn't intelligent design
         | either.
        
       | godelski wrote:
       | As a ML Vision researcher, I find these scaling hypothesis claims
       | quite ridiculous. I understand that the NLP world has made large
       | strides by adding more attention layers, but I'm not an NLP
       | person and I suspect there's more than just more layers. We won't
       | even talk about the human brain and just address that "scaling is
       | sufficient" hypothesis.
       | 
       | With vision, pointing to Parti and DALLE as scaling is quite
       | dumb. They perform similarly but are DRASTICALLY different in
       | size. Parti has configurations with 350M, 750M, 3B, and 20B
       | parameters. DALLE has 3.5. Imagen uses T5-XXL which alone has 11B
       | parameters, just in the text part.
       | 
       | Not only this, there are major architecture changes. If scaling
       | was all you needed then all these networks would still be using
       | CNNs. But we shifted to transformers. THEN we have shifted to
       | diffusion based models. Not to mention that Parti, DALLE, and
       | Parti have different architectures. It isn't just about scale.
       | Architecture matters here.
       | 
       | And to address concerns, diffusion (invented decades ago) didn't
       | work because we just scaled it up. It worked because of
       | engineering. It was largely ignored previously because no one got
       | it to work better than GANs. I think this lesson should really
       | stand out. That we need to consider the advantages and
       | disadvantages of different architectures and learn how to make
       | ALL of them work effectively. In that manner we can combine them
       | to work in ideal ways. Even Le Cun is coming to this point of
       | view despite previously being on the scaling side.
       | 
       | But maybe you NLP folks disagree. But the experience in vision is
       | far more rich than just scaling.
        
         | panabee wrote:
         | this is well articulated. another key point: dall-e 2 uses 70%
         | _fewer_ parameters than dalle-e 1 while offering far higher
         | quality.
         | 
         | from wikipedia (https://en.wikipedia.org/wiki/DALL-E):
         | 
         | DALL-E's model is a multimodal implementation of GPT-3 with 12
         | billion parameters which "swaps text for pixels", trained on
         | text-image pairs from the Internet. DALL-E 2 uses 3.5 billion
         | parameters, a smaller number than its predecessor.
        
         | andreyk wrote:
         | I agree - I think scaling laws and scaling hypothesis are quite
         | distinct personally. Scaling hypothesis is 'just go bigger with
         | what we have and we'll get AGI', vs scaling laws are 'for these
         | tasks and these models types, these are the empirical trends in
         | performance we see'. I think scaling laws are still really
         | valuable for vision research, but as you say we should not just
         | abandon thinking about things beyond scaling even if we observe
         | good scaling trends.
        
           | godelski wrote:
           | Yeah I agree with this position. It is also what I see within
           | my own research. But also in my own research I see the vast
           | importance of architecture search. This may not be what the
           | public sees, but I think it is well known to the research
           | community or anyone with hands on experience with these types
           | of models.
        
       | andreyk wrote:
       | Co-author here, happy to answer any questions/chat about stuff we
       | did not cover in this overview!
        
         | puttycat wrote:
         | Hi! Great post. See my comment below about scaling down.
        
           | andreyk wrote:
           | Thanks! I'll take a look, those do look interesting.
        
         | benlivengood wrote:
         | It would be great to see more focus on Chinchilla's result that
         | most large models were quite undertrained with respect to
         | optimal reduction in test loss.
        
           | andreyk wrote:
           | agreed, we did not discuss that sufficiently
        
         | [deleted]
        
       | 3vidence wrote:
       | I think something that has concerned me with the concept of
       | scaling to AGI is the concept of "adversarial examples". Small
       | tweaks that can be made to cause unpredictable behavior in the
       | system. At a high level these are caused by unexpected paths in
       | high dimensional model weight space that don't align with our
       | intuition. This problem in general seems to get worse as the size
       | of the weights grow.
       | 
       | From a value perspective a very high fidelity model with
       | extremely unexpected behavior seems really low value since you
       | need a human there full time to make sure that the model doesn't
       | go haywire that 1-5% of the time
        
       | mjburgess wrote:
       | "Scaling" means increasing the number of parameters. _Parameters_
       | are just the database of the system. At 300GB of parameters, we
       | 're talking models which remember compressed versions of all
       | books ever written.
       | 
       | This is not a path to "AGI", this is just building a search
       | engine with a little better querying power.
       | 
       | "AI" systems today are little more than superpositions of google
       | search results, with their parameters being a compression of
       | billions of images/documents.
       | 
       | This isnt even on the road to intelligence, let alone an instance
       | of it. "General intelligence" does not solve problems by
       | induction over billions of examples of their prior solutions.
       | 
       | And exponential scaling in the amount of such remembering
       | required is a fatal trajectory for AI, and likewise and
       | indication that it doesnt deserve the term.
       | 
       | No intelligence is exponential in an answer-space, indeed, i'd
       | say that's *the whole point* of intelligence!
       | 
       | We already know that if you compress all possible {(Question,
       | Answer)} pairs, you can "solve" any problem trivially.
        
         | MathYouF wrote:
         | The tone of this betrays a possibly more argumentative than
         | collaborative conversation style than that which I may want to
         | engage with further (as seems common I've noticed amongst anti-
         | connectionists), but I did find one point intersting for
         | discussion.
         | 
         | > Parameters are just the database of the system.
         | 
         | Would any equations parameters be considered just the database
         | then? C in E=MC^2, 2 in a^2+b^2=c^2?
         | 
         | I suppose those numbers are basically a database, but the
         | relationships (connections) they have to the other variables
         | (inputs) represent a demonstrable truth about the universe.
         | 
         | To some degree every parameter in a nn is also representing
         | some truth about the universe. How general and compact that
         | representation is currently is not known (likely less than we'd
         | like of both traits).
        
           | jsharf wrote:
           | I'm not anti-connectionist, but if I were to put myself in
           | their shoes, I'd respond by pointing out that in E=MC^2, C is
           | a value which directly correlates with empirical results. If
           | all of humanity were to suddenly disappear, a future advanced
           | civilization would re-discover the same constant, though
           | maybe with different units. Their neural networks, on the
           | other hand, probably would be meaningfully different.
           | 
           | Also, the C in E=MC^2 has units which define what it means in
           | physical terms. How can you define a "unit" for a neural
           | network's output?
           | 
           | Now, my thoughts on this are contrary to what I've said so
           | far. Even though neural network outputs aren't easily defined
           | currently, there's some experimental results showing neurons
           | in neural networks demonstrating symbolic-like higher-level
           | behavior:
           | 
           | https://openai.com/blog/multimodal-neurons/
           | 
           | Part of the confusion likely comes from how neural networks
           | represent information -- often by superimposing multiple
           | different representations. A very nice paper from Anthropic
           | and Harvard delved into this recently:
           | 
           | https://transformer-circuits.pub/2022/toy_model/index.html
        
             | ctoth wrote:
             | Related: Polysemanticity and Capacity in Neural Networks
             | https://arxiv.org/abs/2210.01892
        
           | mjburgess wrote:
           | There's a very literal sense in which NN parameters are just
           | a db. As in, it's fairly trivial to get copyrighted verbatim
           | output from a trained NN (eg., quake source code from git co-
           | pilot, etc.).
           | 
           | "Connectionists" always want to reduce everything to formulae
           | with no natural semantics and then equivocate this with
           | science. Science isnt mathematics. Mathematics is just a
           | short hand for a description of the world _made true_ by the
           | semantics of that description.
           | 
           | E=mc^2 isnt true because it's a polynomial, and it doesnt
           | _mean_ a polynomial, and it doesnt have  "polynomial
           | properties" because it isnt _about_ mathematics. It 's
           | _about_ the world.
           | 
           | E stands for energy, m for mass, and c for a geometric
           | constant of spacetime. If they were to stand for other
           | properties of the world, in general, the formulae would be
           | false.
           | 
           | I find this "connectionist supernaturalism" about mathematics
           | deeply irritating, it has all the hubris and numerology of
           | religions but wandering around in a stolen lab coat. Hence
           | the tone.
           | 
           | What can one say or feel in the face of the overtaking of
           | science by pseudoscience? It seems plausible to say now,
           | today, more pseudoscientific papers are written than
           | scientific ones. A generation of researchers are doing little
           | more than analysing ink-blot patterns and calling them
           | "models".
           | 
           | The insistence, without explanation, that this is a
           | reasonable activity pushes one past tolerance on these
           | matters. It's exasperating... from psychometrics to AI, the
           | whole world of intellectual life has been taken over by a
           | pseudoscientific analysis of non-experimental post-hoc
           | datasets.
        
           | politician wrote:
           | This discussion (the GP and your response) perhaps suggests
           | that a way to evaluate the intelligence of an AI may need to
           | be more than the generation of some content, but also
           | citations and supporting work for that content. I guess I'm
           | suggesting that the field could benefit from a shift towards
           | explainability-first models.
        
         | benlivengood wrote:
         | That's why the Chinchilla paper (given a single paragraph in
         | the article) is so important; it gives a scaling equation that
         | puts a limit on the effect of increasing parameters. Generally
         | for the known transformer models the reduction in loss from
         | having infinite parameters is significantly less that the
         | reduction in loss from training on infinite data. Most large
         | models are very undertrained.
        
         | aaroninsf wrote:
         | Everything of interest in ML networks is occurring in the
         | abstractions that emerge in training in deep multi-layer
         | networks.
         | 
         | At the crudest level this immediately provides for more than
         | canned lookup as asserted; analogical reasoning is a much-
         | documented emergent property.
         | 
         | But analogies are merely the simplest, first-order abstraction,
         | which are easy for humans to point at.
         | 
         | Inference and abstraction across multiple levels, means the
         | behaviors of these systems is utterly unlike simple stores. One
         | clear demonstration of this is the effective "compression" of
         | image gen networks. They don't compress images. For lack of any
         | better vocabulary, they understand them well enough produce
         | them.
         | 
         | The hot topic is precisely whether there are boundaries to what
         | sorts of implicit reasoning can occur through scale, and, what
         | other architectures need to be present to effect agency and
         | planning of the kind hacked at in traditional symbolic systems
         | AI.
         | 
         | It might be worthwhile to read contemporary work to get up to
         | speed. Things are already a lot weirder than we have had time
         | to internalize.
        
           | AstralStorm wrote:
           | Can they be said to understand the images if a style transfer
           | model they produce is image dependent with an unstable
           | threshold boundary? Or when they make an error similar to
           | pareidolia all the time, seeing faces where there are none?
           | When they do not understand how to paint even roughly fake
           | text?
        
         | swid wrote:
         | 300 GB is nothing compared to the vastness of information in
         | the universe (hence it fitting on a disk). AI is approximating
         | a function, and the function they are now learning to
         | approximate is us.
         | 
         | From [1], with my own editing...
         | 
         | When comparing the difference between now and human performance
         | 
         | > ...[huamns] can achieve closer to 0.7 bits per character .
         | What is in that missing >0.4?
         | 
         | > Well--everything! Everything that the model misses. While
         | just babbling random words was good enough at the beginning, at
         | the end, it needs to be able to reason our way through the most
         | difficult textual scenarios requiring causality or commonsense
         | reasoning... every time that it lacks the theory of mind to
         | compress novel scenes describing the Machiavellian scheming of
         | a dozen individuals at dinner jockeying for power as they
         | talk...
         | 
         | > If we trained a model which reached that loss of <0.7, which
         | could predict text indistinguishable from a human, whether in a
         | dialogue ...how could we say that it doesn't truly understand
         | everything?
         | 
         | [1] https://www.gwern.net/Scaling-hypothesis
        
       | andreyi wrote:
        
       | puttycat wrote:
       | Good overview.
       | 
       | At the other extreme, some recent works [1,2] show why it's
       | sometime better to scale down instead of up, especially for some
       | humanlike capabilities like generalization:
       | 
       | [1]
       | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00489...
       | 
       | [2] https://arxiv.org/abs/1906.04358
        
         | MathYouF wrote:
         | If greater parameterization leads to memorization rather than
         | generalisation it's likely a failure in our current
         | architectures and loss formulations rather than an inherent
         | benefit of "fewer parameters" improving generalizaiton. Other
         | animals do not generalize better than humans despite having
         | fewer neurons (or their generalizaitons betray a
         | misunderstanding of the number and depth of subcategories there
         | are for things, like when a dog barks at everything that passes
         | by the window).
        
       ___________________________________________________________________
       (page generated 2022-10-07 23:00 UTC)