[HN Gopher] Pathways Language Model (PaLM): Scaling to 540B para... ___________________________________________________________________ Pathways Language Model (PaLM): Scaling to 540B parameters Author : homarp Score : 115 points Date : 2022-04-04 16:55 UTC (6 hours ago) (HTM) web link (ai.googleblog.com) (TXT) w3m dump (ai.googleblog.com) | alphabetting wrote: | the joke explanations on page 38 of the full paper linked here | are blowing my mind. it's crazy how far language models have come | | https://storage.googleapis.com/pathways-language-model/PaLM-... | Xenixo wrote: | Wow. | | The anti joke explanation was also very impressive. | modeless wrote: | This may be the most impressive thing I've seen a language | model do so far. That's incredible. The future is going to be | very weird. | nqzero wrote: | this thing is already more human than i am | WithinReason wrote: | I like how it appears that it had to convert from imperial to | metric before it could make an inference: | | _300 miles per hour is about 480 km /h. This is about the | speed of a commercial airplane. [...]_ | sib wrote: | Also, that's a very slow commercial airplane. (Unless talking | about an old turboprop?) | TaylorAlexander wrote: | Your comment prompted me to tweet an image of that section, | complete with alt text (as much as can fit). If anyone cares to | see it in tweet form. | | https://twitter.com/tlalexander/status/1511089810752126984 | severine wrote: | So... we are training models here on HN, specially if we follow | the site's guidelines! Makes you think... which makes _it_ | think! | | Wow, interesting times, indeed. | theincredulousk wrote: | Haven't looked further, but I'm wondering about that. Is that | the result of training to be able to explain that specific | joke, or is it generalized? | | In the past these things have been misleading. Some impressive | capability ends up being far more narrow than implied, so it's | kind of like just storing information and retrieving it with | extra steps. | whimsicalism wrote: | From the example, it seems hard to imagine that it has been | trained to explain this specific joke. | | I understand language model skepticism is very big on HN, but | this is impressive. | mjburgess wrote: | How much of human written history can be compressed and | aproximately stored in 504Bn parameters? | | It seems to me bascially certain that no compressed | representation of text can be an understanding of langugae, | so necessarily, any statistical algorithm here is always | using coincidental tricks. That it takes 500bn parameters | to do it, i think, is a clue that we dont even really need. | | Words mean what we do with them -- you need to be here in | the world with us, to understand what we mean. There is | nothing in the patterns of our usage of words which | provides their semantics, so the whole field of | distributional analysis precludes this superstition. | | You cannot, by mere statistical analysis of patterns in | mere text, understand the nature of the world. But it is | precisely this we communicate in text. We succeed because | we are both in the world, not because "w" occuring before | "d" somehow communicates anything. | | Apparent correlations in text are meaningful to us, because | we created them, and we _have_ their semantics. The system | _must_ by is nature be a mere remebering. | hackinthebochs wrote: | >Words mean what we do with them -- you need to be here | in the world with us, to understand what we mean | | This is like saying "humans can't fly because flight | requires flapping wings under your own power". Sure, its | true given the definition this statement is employing, | but so what? Nothing of substance is learned by | definition. We certainly are not learning about any | fundamental limitations of humans from such a definition. | Similarly, defining understanding language as "the | association of symbols with things/behaviors in the | world" demonstrates nothing of substance about the limits | of language models. | | But beyond that, its clear to me the definition itself is | highly questionable. There are many fields where the vast | majority of uses of language do not directly correspond | with things or behaviors in the world. Pure math is an | obvious example. The understanding of pure math is a | purely abstract enterprise, one constituted by | relationships between other abstractions, bottoming out | at arbitrary placeholders (e.g. the number one is an | arbitrary placeholder situated in a larger arithmetical | structure). By your definition, a language model without | any contact with the world can understand purely abstract | systems as well as any human. But this just implies | there's something to understanding beyond merely | associations of symbols with things/behaviors in the | physical world. | whimsicalism wrote: | > It seems to me bascially certain that no compressed | representation of text can be an understanding of | langugae, so necessarily, any statistical algorithm here | is always using coincidental tricks. That it takes 500bn | parameters to do it, i think, is a clue that we dont even | really need. | | I think your premise contains your conclusion, which | while common, is something you should strive to avoid. | | I do think your opinion is a good example of the | prevailing sentiment on Hacker News. To me, it seems to | come from a discomfort with the fact that even "we" | emerge out of the basic interactions of basic building | blocks. Our brain has been able to build world knowledge | "merely by" analysis of electrical impulses being | transmitted to it on wires. | mjburgess wrote: | I have no discomfort with the notion that our bodies, | which grow in response to direct causal contact with our | environment, contain in-their-structure the generative | capbaility for knoweldge, imagination, skill, growth -- | and so on. | | I have no discomfort with the basically schiozphrenic | notion that the shapes of words have something to do with | the nature of the world. I just think its a kind of | insantity which absolutely destroys our ability to reason | carefully about the use of these systems. | | That "tr" occurs before "ee" says as much about "trees" | as "leaves are green" says -- it is only that *we* have | the relevant semantics that the latter is meaningful when | interpreted in the light of our "environmental history" | recorded in our bodies, and given weight and utility by | our imaginations. | | The structure of text is not the structure of the world. | This thesis is mad. Its a scientific thesis. It is | trivial to test it. It is trivial to wholey discred it. | It's pseudoscience. | | No one here is a scientist and no one treats any of this | as science. Where's the criteria for the emprical | adequecy of NLP systems as models of language? Specifying | any, conducting actual hypothesis tests, and establishing | a _theory_ of how NLP systems model language -- this | would immediately reveal the smoke-and-mirros. | | The work to reveal the statistical tricks underneath them | takes years, and no one has much motivation to do it. The | money lies in this sales pitch, and this is no science. | This is no scientific method. | whimsicalism wrote: | Agree to disagree. I think you are opining about things | that you are lacking fundamental knowledge on. | | > The structure of text is not the structure of the | world. This thesis is mad. Its a scientific thesis. It is | trivial to test it. It is trivial to wholey discred it. | It's pseudoscience. | | It's unclear what you even mean by that. Are the | electrical impulses coming to our brain the "structure of | the world"? | rafaelero wrote: | Ok, boomer. | tux1968 wrote: | Then wouldn't you have to believe that people who are | born blind and deaf, or unable to walk, do not really | "understand", since they're not connected to the world in | the same way as those born without those limits? | NiceElephant wrote: | I wonder if AI is a technology that will move from "local | producers" to a more centralized setup, where everybody just buys | it as a service, because it becomes too complicated to operate it | by yourself. | | What are examples in history where this has happened before? The | production of light, heat and movement comes to mind, that, with | the invention of electricity, moved from people's homes and | businesses to (nuclear) power plants, which can only be operated | by a fairly large team of specialists. | | Anybody has other examples? | eunos wrote: | Hosting from your owning servers on home, localized data | centers to global cloud companies. | NiceElephant wrote: | Yeah, this kinda goes in the same direction, but in this | case, as well as for example agriculture, I feel it is mostly | for convenience. You could still do it at home if you wanted | to, in contrast to operating a nuclear power plant. I thought | chip-making might be another example, but I'm not sure that | was ever decentralized in its early days. | napoleon_thepig wrote: | This is kind of already happening with services like Google | cloud translation. | castratikron wrote: | Will we see an intelligence too cheap to meter? | | https://www.atlanticcouncil.org/blogs/energysource/is-power-... | WithinReason wrote: | Based on their 3rd figure, it would take an approximately 100x | larger model (and more data) to surpass the performance of the | best humans | drusepth wrote: | Its performance on answers to chained inference questions (on | page 38 of https://storage.googleapis.com/pathways-language- | model/PaLM-...) has already surpassed the performance of this | human. | danuker wrote: | I placed a transparent plastic ruler on the screen to come to | the same conclusion, then I saw your comment. | WithinReason wrote: | Your methodology is much more sophisticated than mine | sib wrote: | My dad, who worked on jet engine production many decades | ago, would refer to MIL SPEC EYEBALL Mk I. (I _think_ he | was kidding...) | londons_explore wrote: | This huge language model was trained 'from scratch' - ie. before | the first batch of data went into the training process, the model | state was simply initialized using random noise. | | I believe we are near the end of that. As models get more and | more expensive to train, we'll see future huge models being | 'seeded' with weights from previous models. Eventually nation- | state levels of effort will be used to further train such | networks to then distribute results to industry to use. | | A whole industry will be built around licensing 'seeds' to build | ML models on - you'll have to pay fees to all the 'ancestors' of | models you use. | [deleted] | lukasb wrote: | Does anyone know what the units are on the "performance | improvement over SOTA" chart? | lukasb wrote: | Turns out it's a composite of "normalized task-specific | metrics", details in the paper. Shrug. Numbers go up! | r-zip wrote: | I was wondering the same. Without better y-axis labeling, it's | not that informative of a graphic. | whymauri wrote: | Poetic that the top post right now is (partially) about how | science communication over-simplifying figures results in a | popular misunderstanding of science, leading readers to | believe that conducting research is easier than it actually | is. | The_rationalist wrote: | not a single super large language model has beaten the state of | the art in the key NLP tasks (POS tag, dep tag, coreference, wsd, | ner, etc) They are always only used for higher level tasks, which | is tragic. | oofbey wrote: | Why is that tragic? Classic NLP tasks are IMHO kinda pointless. | Nobody _actually_ cares about parse trees, etc. These things | were useful when that was the best we could do with ML, because | they allowed us to accomplish genuinely-useful NLP tasks by | writing code that uses things like parse trees, NER, etc. But | why bother with parse trees and junk like that if you can just | get the model to answer the question you actually care about? | dgreensp wrote: | When can we try using it?? :) | ausbah wrote: | i wonder if pruning and other methods that reduce size | drastically while not compromising on performance would be | possible | gjstein wrote: | Would love an answer on this too. It would be even better not | just to _try_ using this, but also be able to run it locally, | something that has been impossible for GPT-3. | whimsicalism wrote: | This is not something that will be possible to run locally. | | If you had 1 bit per parameter (not realistic), it would | still take ~100 GB of RAM just to load into memory. | arkano wrote: | Does it look like it would be possible to run locally? | [deleted] | sidcool wrote: | What do 540 billion parameters mean in this case? | minimaxir wrote: | 540B float32 values in the model. (although since this model | was trained via TPUs, likely bfloat16s instead) | londons_explore wrote: | Please Google.... Please include in your papers _non-cherry- | picked_ sample outputs! And explicitly say that they aren 't | cherry picked. | | I understand that there is a chance that the output could be | offensive/illegal. If necessary you can censor a few outputs, but | make clear in the paper you've done that. It's better to do that | than just show us the best picked outputs and pretend all outputs | are as good. | modeless wrote: | This is why Google built TPUs. This alone justifies the whole | program by itself. This level of natural language understanding, | once it is harnessed for applications and made efficient enough | for wide use, is going to revolutionize literally everything | Google does. Owning the chips that can do this is incredibly | valuable and companies that are stuck purchasing or renting | whatever Nvidia makes are going to be at a disadvantage. | endisneigh wrote: | The model is insane, but could this realistically be used in | production? | motoboi wrote: | Yes. You don't need a model in ram memory, NVME disks are fine. | ekelsen wrote: | That would have very slow inference latency if you had to | read the model off disk for every token. | 1024core wrote: | 540B parameters means ~1TB of floating bytes (assuming | BFLOAT16). Quadruple that for other associated stuff, and | you'd need a machine with 4TB of RAM. | endisneigh wrote: | right - and even if you did run happen to have a machine | with 4TB of ram - what type of latency would you have on | a single machine running this as a service? how many | machines would you need for google translate performance? | | doesn't seem like you can run this as a service, yet. | anentropic wrote: | I too am curious what kind of hardware resources are needed to | run the model once it is trained | gk1 wrote: | Why not? I'm curious if you're picturing any specific | roadblocks in mind. OpenAI makes their large models available | through an API, removing any issues with model hosting and | operations. | minimaxir wrote: | Latency, mostly. | | The GPT-3 APIs were _very_ slow on release, and even with the | current APIs it still takes a couple seconds to get results | from the 175B model. | mountainriver wrote: | Google for all their flaws really is building the future of AI. | This is incredibly impressive and makes me think we are | relatively close to GAI | nickvincent wrote: | Crazy impressive! A question about the training data: anyone | familiar with this line of work know what social media platforms | the "conversation" data component of the training set came from? | There's a datasheet that points to prior work | https://arxiv.org/abs/2001.09977, which sounds like it could be | reddit, HN, or a similar platform? | gk1 wrote: | Is there an equivalent to Moore's Law for language models? It | feels like every week an even bigger (and supposedly better) | model is announced. | visarga wrote: | Scaling Laws for Neural Language Models - | https://arxiv.org/abs/2001.08361 | lucidrains wrote: | revised scaling laws https://arxiv.org/abs/2203.15556 | https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new- | scalin... | visarga wrote: | Interesting that they used Chain of Thought Prompting[1] for | improved reasoning so soon after its publication. Also related to | DeepMind AlphaCode which generates code and filters results by | unit tests, while Chain of Thought Prompting filters by checking | for correct answer at the end. | | Seems like language models can generate more training data for | language models in an iterative manner. | | [1] https://arxiv.org/abs/2201.11903 | nullc wrote: | The general technique is pretty obvious, I discussed and | demonstrated it in some HN comments with GPT2 and GPT3 a couple | times in the last couple years, and suggested some speculative | extensions (which might be totally unworkable, unfortunately | these networks are too big for me to attempt to train to try it | out) https://news.ycombinator.com/item?id=24005638 | gwern wrote: | In fact, people had already shown it working with GPT-3 | before you wrote your comment: | https://twitter.com/kleptid/status/1284069270603866113 | https://twitter.com/kleptid/status/1284098635689611264 Seeing | how much smarter it could be with dialogue was very exciting | back then, when people were still super-skeptical. | | The followup work has also brought out a lot of interesting | points: why didn't anyone get that working with GPT-2, and | why wouldn't your GPT-2 suggestion have worked? Because | inner-monologue capabilities seem to only emerge at some | point past 100b-parameters (and/or equivalent level of | compute), furnishing one of the most striking examples of | emergent capability-spikes in large NNs. GPT-2 is just _way_ | too small, and if you had tried, you would 've concluded | inner-monologue doesn't work. It doesn't work, and it keeps | on not working... until suddenly it does work. | make3 wrote: | The chain of thought paper is from Google, so they've known | about it internally for a while potentially ___________________________________________________________________ (page generated 2022-04-04 23:00 UTC)