[HN Gopher] Pathways Language Model (PaLM): Scaling to 540B para...
       ___________________________________________________________________
        
       Pathways Language Model (PaLM): Scaling to 540B parameters
        
       Author : homarp
       Score  : 115 points
       Date   : 2022-04-04 16:55 UTC (6 hours ago)
        
 (HTM) web link (ai.googleblog.com)
 (TXT) w3m dump (ai.googleblog.com)
        
       | alphabetting wrote:
       | the joke explanations on page 38 of the full paper linked here
       | are blowing my mind. it's crazy how far language models have come
       | 
       | https://storage.googleapis.com/pathways-language-model/PaLM-...
        
         | Xenixo wrote:
         | Wow.
         | 
         | The anti joke explanation was also very impressive.
        
           | modeless wrote:
           | This may be the most impressive thing I've seen a language
           | model do so far. That's incredible. The future is going to be
           | very weird.
        
         | nqzero wrote:
         | this thing is already more human than i am
        
         | WithinReason wrote:
         | I like how it appears that it had to convert from imperial to
         | metric before it could make an inference:
         | 
         |  _300 miles per hour is about 480 km /h. This is about the
         | speed of a commercial airplane. [...]_
        
           | sib wrote:
           | Also, that's a very slow commercial airplane. (Unless talking
           | about an old turboprop?)
        
         | TaylorAlexander wrote:
         | Your comment prompted me to tweet an image of that section,
         | complete with alt text (as much as can fit). If anyone cares to
         | see it in tweet form.
         | 
         | https://twitter.com/tlalexander/status/1511089810752126984
        
         | severine wrote:
         | So... we are training models here on HN, specially if we follow
         | the site's guidelines! Makes you think... which makes _it_
         | think!
         | 
         | Wow, interesting times, indeed.
        
         | theincredulousk wrote:
         | Haven't looked further, but I'm wondering about that. Is that
         | the result of training to be able to explain that specific
         | joke, or is it generalized?
         | 
         | In the past these things have been misleading. Some impressive
         | capability ends up being far more narrow than implied, so it's
         | kind of like just storing information and retrieving it with
         | extra steps.
        
           | whimsicalism wrote:
           | From the example, it seems hard to imagine that it has been
           | trained to explain this specific joke.
           | 
           | I understand language model skepticism is very big on HN, but
           | this is impressive.
        
             | mjburgess wrote:
             | How much of human written history can be compressed and
             | aproximately stored in 504Bn parameters?
             | 
             | It seems to me bascially certain that no compressed
             | representation of text can be an understanding of langugae,
             | so necessarily, any statistical algorithm here is always
             | using coincidental tricks. That it takes 500bn parameters
             | to do it, i think, is a clue that we dont even really need.
             | 
             | Words mean what we do with them -- you need to be here in
             | the world with us, to understand what we mean. There is
             | nothing in the patterns of our usage of words which
             | provides their semantics, so the whole field of
             | distributional analysis precludes this superstition.
             | 
             | You cannot, by mere statistical analysis of patterns in
             | mere text, understand the nature of the world. But it is
             | precisely this we communicate in text. We succeed because
             | we are both in the world, not because "w" occuring before
             | "d" somehow communicates anything.
             | 
             | Apparent correlations in text are meaningful to us, because
             | we created them, and we _have_ their semantics. The system
             | _must_ by is nature be a mere remebering.
        
               | hackinthebochs wrote:
               | >Words mean what we do with them -- you need to be here
               | in the world with us, to understand what we mean
               | 
               | This is like saying "humans can't fly because flight
               | requires flapping wings under your own power". Sure, its
               | true given the definition this statement is employing,
               | but so what? Nothing of substance is learned by
               | definition. We certainly are not learning about any
               | fundamental limitations of humans from such a definition.
               | Similarly, defining understanding language as "the
               | association of symbols with things/behaviors in the
               | world" demonstrates nothing of substance about the limits
               | of language models.
               | 
               | But beyond that, its clear to me the definition itself is
               | highly questionable. There are many fields where the vast
               | majority of uses of language do not directly correspond
               | with things or behaviors in the world. Pure math is an
               | obvious example. The understanding of pure math is a
               | purely abstract enterprise, one constituted by
               | relationships between other abstractions, bottoming out
               | at arbitrary placeholders (e.g. the number one is an
               | arbitrary placeholder situated in a larger arithmetical
               | structure). By your definition, a language model without
               | any contact with the world can understand purely abstract
               | systems as well as any human. But this just implies
               | there's something to understanding beyond merely
               | associations of symbols with things/behaviors in the
               | physical world.
        
               | whimsicalism wrote:
               | > It seems to me bascially certain that no compressed
               | representation of text can be an understanding of
               | langugae, so necessarily, any statistical algorithm here
               | is always using coincidental tricks. That it takes 500bn
               | parameters to do it, i think, is a clue that we dont even
               | really need.
               | 
               | I think your premise contains your conclusion, which
               | while common, is something you should strive to avoid.
               | 
               | I do think your opinion is a good example of the
               | prevailing sentiment on Hacker News. To me, it seems to
               | come from a discomfort with the fact that even "we"
               | emerge out of the basic interactions of basic building
               | blocks. Our brain has been able to build world knowledge
               | "merely by" analysis of electrical impulses being
               | transmitted to it on wires.
        
               | mjburgess wrote:
               | I have no discomfort with the notion that our bodies,
               | which grow in response to direct causal contact with our
               | environment, contain in-their-structure the generative
               | capbaility for knoweldge, imagination, skill, growth --
               | and so on.
               | 
               | I have no discomfort with the basically schiozphrenic
               | notion that the shapes of words have something to do with
               | the nature of the world. I just think its a kind of
               | insantity which absolutely destroys our ability to reason
               | carefully about the use of these systems.
               | 
               | That "tr" occurs before "ee" says as much about "trees"
               | as "leaves are green" says -- it is only that *we* have
               | the relevant semantics that the latter is meaningful when
               | interpreted in the light of our "environmental history"
               | recorded in our bodies, and given weight and utility by
               | our imaginations.
               | 
               | The structure of text is not the structure of the world.
               | This thesis is mad. Its a scientific thesis. It is
               | trivial to test it. It is trivial to wholey discred it.
               | It's pseudoscience.
               | 
               | No one here is a scientist and no one treats any of this
               | as science. Where's the criteria for the emprical
               | adequecy of NLP systems as models of language? Specifying
               | any, conducting actual hypothesis tests, and establishing
               | a _theory_ of how NLP systems model language -- this
               | would immediately reveal the smoke-and-mirros.
               | 
               | The work to reveal the statistical tricks underneath them
               | takes years, and no one has much motivation to do it. The
               | money lies in this sales pitch, and this is no science.
               | This is no scientific method.
        
               | whimsicalism wrote:
               | Agree to disagree. I think you are opining about things
               | that you are lacking fundamental knowledge on.
               | 
               | > The structure of text is not the structure of the
               | world. This thesis is mad. Its a scientific thesis. It is
               | trivial to test it. It is trivial to wholey discred it.
               | It's pseudoscience.
               | 
               | It's unclear what you even mean by that. Are the
               | electrical impulses coming to our brain the "structure of
               | the world"?
        
               | rafaelero wrote:
               | Ok, boomer.
        
               | tux1968 wrote:
               | Then wouldn't you have to believe that people who are
               | born blind and deaf, or unable to walk, do not really
               | "understand", since they're not connected to the world in
               | the same way as those born without those limits?
        
       | NiceElephant wrote:
       | I wonder if AI is a technology that will move from "local
       | producers" to a more centralized setup, where everybody just buys
       | it as a service, because it becomes too complicated to operate it
       | by yourself.
       | 
       | What are examples in history where this has happened before? The
       | production of light, heat and movement comes to mind, that, with
       | the invention of electricity, moved from people's homes and
       | businesses to (nuclear) power plants, which can only be operated
       | by a fairly large team of specialists.
       | 
       | Anybody has other examples?
        
         | eunos wrote:
         | Hosting from your owning servers on home, localized data
         | centers to global cloud companies.
        
           | NiceElephant wrote:
           | Yeah, this kinda goes in the same direction, but in this
           | case, as well as for example agriculture, I feel it is mostly
           | for convenience. You could still do it at home if you wanted
           | to, in contrast to operating a nuclear power plant. I thought
           | chip-making might be another example, but I'm not sure that
           | was ever decentralized in its early days.
        
         | napoleon_thepig wrote:
         | This is kind of already happening with services like Google
         | cloud translation.
        
         | castratikron wrote:
         | Will we see an intelligence too cheap to meter?
         | 
         | https://www.atlanticcouncil.org/blogs/energysource/is-power-...
        
       | WithinReason wrote:
       | Based on their 3rd figure, it would take an approximately 100x
       | larger model (and more data) to surpass the performance of the
       | best humans
        
         | drusepth wrote:
         | Its performance on answers to chained inference questions (on
         | page 38 of https://storage.googleapis.com/pathways-language-
         | model/PaLM-...) has already surpassed the performance of this
         | human.
        
         | danuker wrote:
         | I placed a transparent plastic ruler on the screen to come to
         | the same conclusion, then I saw your comment.
        
           | WithinReason wrote:
           | Your methodology is much more sophisticated than mine
        
             | sib wrote:
             | My dad, who worked on jet engine production many decades
             | ago, would refer to MIL SPEC EYEBALL Mk I. (I _think_ he
             | was kidding...)
        
       | londons_explore wrote:
       | This huge language model was trained 'from scratch' - ie. before
       | the first batch of data went into the training process, the model
       | state was simply initialized using random noise.
       | 
       | I believe we are near the end of that. As models get more and
       | more expensive to train, we'll see future huge models being
       | 'seeded' with weights from previous models. Eventually nation-
       | state levels of effort will be used to further train such
       | networks to then distribute results to industry to use.
       | 
       | A whole industry will be built around licensing 'seeds' to build
       | ML models on - you'll have to pay fees to all the 'ancestors' of
       | models you use.
        
       | [deleted]
        
       | lukasb wrote:
       | Does anyone know what the units are on the "performance
       | improvement over SOTA" chart?
        
         | lukasb wrote:
         | Turns out it's a composite of "normalized task-specific
         | metrics", details in the paper. Shrug. Numbers go up!
        
         | r-zip wrote:
         | I was wondering the same. Without better y-axis labeling, it's
         | not that informative of a graphic.
        
           | whymauri wrote:
           | Poetic that the top post right now is (partially) about how
           | science communication over-simplifying figures results in a
           | popular misunderstanding of science, leading readers to
           | believe that conducting research is easier than it actually
           | is.
        
       | The_rationalist wrote:
       | not a single super large language model has beaten the state of
       | the art in the key NLP tasks (POS tag, dep tag, coreference, wsd,
       | ner, etc) They are always only used for higher level tasks, which
       | is tragic.
        
         | oofbey wrote:
         | Why is that tragic? Classic NLP tasks are IMHO kinda pointless.
         | Nobody _actually_ cares about parse trees, etc. These things
         | were useful when that was the best we could do with ML, because
         | they allowed us to accomplish genuinely-useful NLP tasks by
         | writing code that uses things like parse trees, NER, etc. But
         | why bother with parse trees and junk like that if you can just
         | get the model to answer the question you actually care about?
        
       | dgreensp wrote:
       | When can we try using it?? :)
        
         | ausbah wrote:
         | i wonder if pruning and other methods that reduce size
         | drastically while not compromising on performance would be
         | possible
        
         | gjstein wrote:
         | Would love an answer on this too. It would be even better not
         | just to _try_ using this, but also be able to run it locally,
         | something that has been impossible for GPT-3.
        
           | whimsicalism wrote:
           | This is not something that will be possible to run locally.
           | 
           | If you had 1 bit per parameter (not realistic), it would
           | still take ~100 GB of RAM just to load into memory.
        
           | arkano wrote:
           | Does it look like it would be possible to run locally?
        
       | [deleted]
        
       | sidcool wrote:
       | What do 540 billion parameters mean in this case?
        
         | minimaxir wrote:
         | 540B float32 values in the model. (although since this model
         | was trained via TPUs, likely bfloat16s instead)
        
       | londons_explore wrote:
       | Please Google.... Please include in your papers _non-cherry-
       | picked_ sample outputs! And explicitly say that they aren 't
       | cherry picked.
       | 
       | I understand that there is a chance that the output could be
       | offensive/illegal. If necessary you can censor a few outputs, but
       | make clear in the paper you've done that. It's better to do that
       | than just show us the best picked outputs and pretend all outputs
       | are as good.
        
       | modeless wrote:
       | This is why Google built TPUs. This alone justifies the whole
       | program by itself. This level of natural language understanding,
       | once it is harnessed for applications and made efficient enough
       | for wide use, is going to revolutionize literally everything
       | Google does. Owning the chips that can do this is incredibly
       | valuable and companies that are stuck purchasing or renting
       | whatever Nvidia makes are going to be at a disadvantage.
        
       | endisneigh wrote:
       | The model is insane, but could this realistically be used in
       | production?
        
         | motoboi wrote:
         | Yes. You don't need a model in ram memory, NVME disks are fine.
        
           | ekelsen wrote:
           | That would have very slow inference latency if you had to
           | read the model off disk for every token.
        
             | 1024core wrote:
             | 540B parameters means ~1TB of floating bytes (assuming
             | BFLOAT16). Quadruple that for other associated stuff, and
             | you'd need a machine with 4TB of RAM.
        
               | endisneigh wrote:
               | right - and even if you did run happen to have a machine
               | with 4TB of ram - what type of latency would you have on
               | a single machine running this as a service? how many
               | machines would you need for google translate performance?
               | 
               | doesn't seem like you can run this as a service, yet.
        
         | anentropic wrote:
         | I too am curious what kind of hardware resources are needed to
         | run the model once it is trained
        
         | gk1 wrote:
         | Why not? I'm curious if you're picturing any specific
         | roadblocks in mind. OpenAI makes their large models available
         | through an API, removing any issues with model hosting and
         | operations.
        
           | minimaxir wrote:
           | Latency, mostly.
           | 
           | The GPT-3 APIs were _very_ slow on release, and even with the
           | current APIs it still takes a couple seconds to get results
           | from the 175B model.
        
       | mountainriver wrote:
       | Google for all their flaws really is building the future of AI.
       | This is incredibly impressive and makes me think we are
       | relatively close to GAI
        
       | nickvincent wrote:
       | Crazy impressive! A question about the training data: anyone
       | familiar with this line of work know what social media platforms
       | the "conversation" data component of the training set came from?
       | There's a datasheet that points to prior work
       | https://arxiv.org/abs/2001.09977, which sounds like it could be
       | reddit, HN, or a similar platform?
        
       | gk1 wrote:
       | Is there an equivalent to Moore's Law for language models? It
       | feels like every week an even bigger (and supposedly better)
       | model is announced.
        
         | visarga wrote:
         | Scaling Laws for Neural Language Models -
         | https://arxiv.org/abs/2001.08361
        
           | lucidrains wrote:
           | revised scaling laws https://arxiv.org/abs/2203.15556
           | https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-
           | scalin...
        
       | visarga wrote:
       | Interesting that they used Chain of Thought Prompting[1] for
       | improved reasoning so soon after its publication. Also related to
       | DeepMind AlphaCode which generates code and filters results by
       | unit tests, while Chain of Thought Prompting filters by checking
       | for correct answer at the end.
       | 
       | Seems like language models can generate more training data for
       | language models in an iterative manner.
       | 
       | [1] https://arxiv.org/abs/2201.11903
        
         | nullc wrote:
         | The general technique is pretty obvious, I discussed and
         | demonstrated it in some HN comments with GPT2 and GPT3 a couple
         | times in the last couple years, and suggested some speculative
         | extensions (which might be totally unworkable, unfortunately
         | these networks are too big for me to attempt to train to try it
         | out) https://news.ycombinator.com/item?id=24005638
        
           | gwern wrote:
           | In fact, people had already shown it working with GPT-3
           | before you wrote your comment:
           | https://twitter.com/kleptid/status/1284069270603866113
           | https://twitter.com/kleptid/status/1284098635689611264 Seeing
           | how much smarter it could be with dialogue was very exciting
           | back then, when people were still super-skeptical.
           | 
           | The followup work has also brought out a lot of interesting
           | points: why didn't anyone get that working with GPT-2, and
           | why wouldn't your GPT-2 suggestion have worked? Because
           | inner-monologue capabilities seem to only emerge at some
           | point past 100b-parameters (and/or equivalent level of
           | compute), furnishing one of the most striking examples of
           | emergent capability-spikes in large NNs. GPT-2 is just _way_
           | too small, and if you had tried, you would 've concluded
           | inner-monologue doesn't work. It doesn't work, and it keeps
           | on not working... until suddenly it does work.
        
         | make3 wrote:
         | The chain of thought paper is from Google, so they've known
         | about it internally for a while potentially
        
       ___________________________________________________________________
       (page generated 2022-04-04 23:00 UTC)