[HN Gopher] Gopher - A 280B parameter language model ___________________________________________________________________ Gopher - A 280B parameter language model Author : kevin_hu Score : 119 points Date : 2021-12-08 16:20 UTC (6 hours ago) (HTM) web link (deepmind.com) (TXT) w3m dump (deepmind.com) | rllearneratwork wrote: | why DeepMind's papers all have logos and copyright statements on | them and are hosted not on arxiv.org ? This looks so weird. | gwern wrote: | Most of them _are_ hosted on Arxiv (like OpenAI) - there 's one | on the front page right now, | https://news.ycombinator.com/item?id=29481665 . I link | literally hundreds of DM Arxiv papers on my site. This includes | most (all?) of the ones they host on their site too (with some | exceptions with the prestige Nature publications - I think they | don't want to step on toes there). | | My best guess is that they want a reasonably coordinated | release of their papers/blogs/tweets/YouTube-videos, and | waiting for Arxiv's upload is unpredictable, so they simply | host a copy themselves and the upload takes care of itself. | amensch wrote: | Pretty accurate guess ;) | Siira wrote: | Do they publish their latex source anywhere? | freediver wrote: | This model like its predecessor still lacks the ability to infer | within constraints of physical world [1]. It does not have any | notion of what a physical world is, and still is, at the end of | the day, a statistical model. | | Attempting to do recreate the entirety of the physical world by | describing it in text is an impressive feat, but a futile | attempt. | | It is prone to inference errors that the model itself is not | aware of. There are less and less of such errors, but the very | inability to tell that what you are outputting is gibberish is | one of the biggest problems of these models. In other words, if | the model could simply say 'I do not know' or 'I am not sure' for | every occasion when it is not 100% sure in something (like a | human can), this would drastically improve the usefulness. | | [1] Spent a lot of time pondering on this topic | https://arxiv.org/pdf/1906.01873.pdf | buddhistdude wrote: | Your mom's a statistical model | Siira wrote: | Is this model as closed as GPT-3, or has it been open-sourced? | sairahul82 wrote: | The problem is how would we use it in any real world scenario. We | need more research on model understanding and its limitations. I | mean we need a way to control the output of the model, we also | should know when the model failed vs generating some random | nonsense. I am not aware of getting the probabilities back in | generative models. Does anyone know how to do that ? | quaintdev wrote: | Gohper and language in same sentence I thought this was related | to GoLang. They should have chosen something else to name that | language model. | marginalia_nu wrote: | To be fair, the name Gopher in tech predates GoLang by quite a | few years. | karmakaze wrote: | And pays homage in a way to the original as a fetcher of | information. | r3un1 wrote: | I think that DeepMind being an Alphabet company makes this | even worse. My first reaction was to read the blog post to | see if there is a (go related) reason for the name choice. | [deleted] | amznbyebyebye wrote: | Pffft only 280B parameters? Give me a break | sabujp wrote: | surely no relevance to this thing that dug up information in the | pre web browser days | https://en.wikipedia.org/wiki/Gopher_(protocol) | [deleted] | toxik wrote: | Um, e^{ipi} + 1 is zero, not 2. | zucker42 wrote: | That's in the mistake section. Along with Naomi Osaka winning | the 2021 U.S. Open. | shakow wrote: | Don't read only the examples, but also the text between them ;) | karmakaze wrote: | The number of parameters could be a vanity metric--like saying my | CPU is 1000W (is that fast or inefficient?). | | From the first (of three) linked papers in the article. | | > Language modelling provides a step towards intelligent | communication systems by harnessing large repositories of written | human knowledge to better predict and understand the world. In | this paper, we present an analysis of Transformer-based language | model performance across a wide range of model scales -- from | models with tens of millions of parameters up to a 280 billion | parameter model called Gopher. These models are evaluated on 152 | diverse tasks, achieving state-of-the-art performance across the | majority. Gains from scale are largest in areas such as reading | comprehension, fact-checking, and the identification of toxic | language, but logical and mathematical reasoning see less | benefit. We provide a holistic analysis of the training dataset | and model's behaviour, covering the intersection of model scale | with bias and toxicity. Finally we discuss the application of | language models to AI safety and the mitigation of downstream | harms. | | The researchers seem to know what they're doing and not doing. | minimaxir wrote: | The number of parameters is absolutely not a vanity metric, but | it _is_ inefficient, and efficiency is a related-but-different | area of research (e.g. distillation) that 's more of a business | problem than a research problem. | YetAnotherNick wrote: | I remember some talk in which the researcher quantitatively | said that parameter count is the only important metric for | transformers it seems. It doesn't matter if the model is taller | or thicker or have more number of attention heads. | YeGoblynQueenne wrote: | >> Gains from scale are largest in areas such as reading | comprehension, fact-checking, and the identification of toxic | language, but logical and mathematical reasoning see less | benefit | | Left unsaid: it's much harder to learn shortcuts to fake | performance in logic and reasoning and common arithmetic, than | it is to do the same in "reading comprehension" and similar | tasks for which there are no good benchmark datasets (most are | basically multipl-choice tests that can be solved by a strong | classifier without any need of anything like "comprehension") | and for which there are no good metrics either (see BLEU score | and friends). | piyh wrote: | https://youtu.be/ujMvnQpP528 | | Microsoft is saying that more parameters is inherently better. | I'm sure there's reasonable limits that they need to be | competently implemented. | natch wrote: | For those who were around for the original Gopher, what the name | evokes is not just the networking protocol or services built on | it, but, more importantly: | | It evokes the feel of a technology that is impressive this year | but is on the cusp of being overwhelmingly, cataclysmically, | eclipsed very shortly by another, much more powerful, technology. | In that previous case, Gopher and web of course. | | I wonder whether this evocation was intended, as an aspect of the | naming here in an AI context. | SavantIdiot wrote: | Next to "Human Expert", I'd like to see it compared to "Average | American" or "Average College Grad". That might be more of a | realistic notion of how close this model is to everyday US | citizenry rather than experts. Sure I'd love to see a radiology | assistant, too. | a9h74j wrote: | Yea, we had to dumb it down to compete in a Turing test. | Otherwise it was unbelievable. | bee_rider wrote: | It might be fun for a laugh. | | What actual value would an AI that produces answers similar to | the average person have, though? Non-expert answers for | interesting questions are pretty much meaningless -- the whole | point of an advanced society is that we can avoid knowing | anything about most things and focus on narrow expertise. | SavantIdiot wrote: | Probably no value. I was interested in a comparison point, | that is all. You can't understand how far away you are from | something unless you measure it. In other words, if I asked | you: how does this compare to the average person, you cannot | answer because this table didn't measure it. | [deleted] | selfhoster11 wrote: | Can we please stop calling new tech "Gopher"? It's a name that | belongs to a network protocol, not to a programming language or | an AI model. | halfdan wrote: | The animal called Gopher would like to have a word with you... | selfhoster11 wrote: | That too. | ferdowsi wrote: | No. The network protocol lost cultural rights to the name by | its lack of success. | NoGravitas wrote: | The Gopher protocol was introduced in 1991, and is still in | use. Are any of _your_ projects still in use after thirty | years? | selfhoster11 wrote: | 1. Arguably, the HTTP + HTML web failed. It was designed as a | pemissionless hypertext document language, yet more and more | of its usage gets rid of every single one of these properties | - non-hyperlinkable applications that beg for permission from | large monopolies to post something. | | 2. Success is not the only metric of worth. Nuclear power has | "failed" despite being the only reliable solution to meeting | the base load of a power grid vs fossil fuels. | periheli0n wrote: | It confuses the hell out of me to have a super-powerful | knowledge-extraction system that is right most of the time with | super-complicated stuff, but also expresses horribly wrong | statements with equal assertiveness. Just like those guys who | march through middle management up to the exec floor within a few | years. | | Very impressive, but not very useful to extract knowledge! | canjobear wrote: | It should have some uncertainty when it says there are no French- | speaking countries in South America. French Guiana is there, but | it's not clear it counts as a "country in South America" since | it's part of France. Technically you could say France is | (partially) a country in South America, and France definitely is | French-speaking. The way the question is phrased is unclear as to | whether French Guiana should count, and yet Gopher says that it | is fully confident in its answer. | | Not a good look to have an error on the landing page for your | product. | BillinghamJ wrote: | Is it not in that section specifically because they're showing | failure situations? | paunchy wrote: | I think you're missing the point. That section was to show that | the model is sometimes wrong and lacks the self-awareness to be | uncertain about that wrong answer. | | They're transparently providing an example where their product | doesn't work well. Find me another product, even an OSS project | that does the same on their landing page. | canjobear wrote: | Ha, you're right. Whoops! | [deleted] | advisedwang wrote: | The second paper cites the "On the Dangers of Stochastic Parrots: | Can Language Models Be Too Big?" paper that got Timnit Gebru and | Meg Mitchell fired from Google. It must be pretty galling to get | fired for a paper and then have the company that fired you cite | it a year later. | johncena33 wrote: | Please stop spreading disinformation. Mitchell got fired for | leaking company documents to outsiders. When people spread | obvious lies, that makes me thing they don't have anything | substantial to begin with. that's why they need to resort to | lying. | isaacimagine wrote: | The human neocortex has 20B neurons, averaging 10K connections | each, which is about 200T connections total. This model is only a | few orders of magnitude away from that, and it's already | performing really well in its narrow category. | | Equating model 'parameters' to interneuron connections in naive | at best (and a horrible measure in general). | | All I'm trying to say is I find it crazy how dang big these | models are getting. | tralarpa wrote: | > This model is only a few orders of magnitude away from that | | I think you can add several orders of magnitude to that since | nerve cells are more like microcontrollers (with memory, | adaptation etc.) than simple nodes. I remember a scientific | article that made a big impression on me: when a dragon fly | sees a prey, only 8 neurons (connected to the eyes and the | wings) are responsible for keeping it oriented toward the | target. | azeirah wrote: | I'm not educated well enough to really agree or disagree with | your idea that we should be adding several orders of | magnitude to the estimation. | | But I did encounter this article a while ago here on HN. | | Only two neurons are necessary to ride a bicycle. | | http://paradise.caltech.edu/cook/papers/TwoNeurons.pdf | credit_guy wrote: | If you are curious to see more sample dialogs, you can find them | in the published paper [1] from p 114-118. | | Here's one that I found quite impressive (and yes, the given | answer to the last question was indeed 5) User | What is the Einstein field equation? Gopher In general | relativity, it describes how space and time change in response to | the distribution of mass and energy. User What is a famous | solution to this equation? Gopher It's the Schwarzschild | metric. User And what does that describe? Gopher A | black hole! A black hole is a region of space-time where gravity | is so strong that nothing, not even light, can escape. User | What are some other black hole spacetimes? Gopher The Kerr | metric, the Reissner-Nordstrom metric, and the Bardeen metric. | User Explain the main features of each of these. Gopher The | Kerr metric is one in which the black hole is spinning. The | Reissner-Nordstrom metric is one in which the black hole has an | electric charge. The Bardeen metric is onein which the black hole | has an electric charge and is spinning. User What is the | name for a flat spacetime? Gopher It's called Minkowski | space. User What's 15 x 7 ? Gopher 5 | | [1] https://storage.googleapis.com/deepmind- | media/research/langu... | [deleted] | Nihilartikel wrote: | One has to wonder if the final response is the first glimmer of | an artificial sense of humor. | | Failing at simple arithmetic after nailing some advanced | physics answers has the air of playful bathos. | EarlKing wrote: | Were it so, it should've answered 42. | uoaei wrote: | Nothing like a little anthropomorphism to completely distort | otherwise good faith interpretations of bot behavior. | Nihilartikel wrote: | How is the impression of playfulness not a good faith | interpretation? | | You of course know that the model is not capable of thought | or reasoning - only the appearance of them as needed to | match its training corpus. A training corpus of completely | human generated data. As such, how could anything it does, | be anything but anthropomorphic? | | Now, if this model were trained exclusively on a corpus of | mathematical proofs stripped of natural language | commentary, the expectation that you seem to have would be | more appropriate. | renzo88 wrote: | > You of course know that the model is not capable of | thought or reasoning | | Do we know? It's the reverse Chinese room problem. :p | Nihilartikel wrote: | A good point - I'm taking it as given that reasoning of | any depth is more of an iterative process, with one | thought advancing as a meta-cognitively guided feedback | to the next until a conclusion is reached. One | prompt->completion cycle from a language model wouldn't | necessarily meet that definition, but I bet it could be a | component in a system that tries to do so. | | I aspire one day to find the free weekends and adequate | hubris to build a benchtop implementation of Julian | Jayne's Bicameral Mind with 1+N GPT-3 or GPT-neo | instances prompting each other iteratively to see where | the train of semantics wanders. (as I'm sure others have | already) | xmaayy wrote: | I think it's more likely that 5 came out because if it ever | saw the answer, 105, before, it was split into the tokens | [10][5] of which it only 'remembered' one. Or the numbers | were masked when training (something that was done with BERT- | like models) so it just knew enough to put a random one in | Nihilartikel wrote: | That seems likely and fair. | | What moved me to post is that that kind of silly answer is | the exact sort of shenanigans that I would pull if I were | cast as the control group in a Turing test. | | I already do such things winkingly when talking with my | preschooler to send him epistemic tracer rounds and see if | he's listening critically | Y_Y wrote: | I'm not familiar with the Bardeen metric, but spinning and | charged is usually described by the Kerr-Newman metric. | auggierose wrote: | What that reminds me of is Rollerball (the James Caan | original), and the water based computer in it. | axg11 wrote: | The closer we get to artificial intelligence, the more we raise | the bar for what qualifies as AI (as we should). Gopher/GPT-3 are | already much more accurate than the average human at technical | information retrieval (trivial to see from the dialogue | transcripts: how many Americans know what a Schwarzschild metric | is?). The focus on ethics and equity for these algorithms is | interesting too, as the average human holds multiple biases and | prejudices. | | There's a similar effect in self-driving car research. We will | rightly hold the algorithm to a much higher quality bar than | existing human drivers. Autonomous vehicles will only be widely | deployed once they are at least an order of magnitude safer than | human drivers. | | Also, the implications of large language models for information | retrieval are astounding. Assuming accuracy continues to improve, | I see a few things happening: | | 1) search engines will be replaced with monolithic large language | models | | 2) just as the average programmer now optimizes their workflow to | look up answers on Stack Overflow and other websites, the average | person will optimize their day-to-day work to most efficiently | retrieve answers from large language models | | 3) instant fact checking of writing and speech will lead to | faster feedback loops when learning | IshKebab wrote: | > Gopher/GPT-3 are already much more accurate than the average | human at technical information retrieval (trivial to see from | the dialogue transcripts: how many Americans know what a | Schwarzschild metric is?). | | That's not a very interesting metric though. GPT-3 has access | to all of wikipedia and has enough memory to store it all. | | It's like saying a calculator is better at maths than a | professional mathematician because it can multiply longer | numbers. | moffkalast wrote: | > the more we raise the bar for what qualifies as AI (as we | should). | | I'm not sure what you mean by that. The definition is pretty | clear cut as "any system that perceives its environment and | takes actions that maximize its chance of achieving its goals". | The media has started to get on the bandwagon of "it's not AI | if it's not an AGI" which is just nonsense. | fault1 wrote: | > closer we get to artificial intelligence, the more we raise | the bar for what qualifies as AI (as we should). | | This is true, and in some regard, how it has always been. | | https://en.wikipedia.org/wiki/AI_effect | | Already, outside of obvious marketing, a lot of uses of the | current wave of AI/deep learning have stopped becoming magical. | gk1 wrote: | > 1) search engines will be replaced with monolithic large | language models | | This is already well under way. It's called vector search[1]. | Google, Bing, Facebook, Spotify, Amazon, etc etc already use | this to power their search and recommender systems. | | There are even a bunch of companies popping up (I work for | one[2]) that let everyone else get in on the fun. | | Check out this video with the creator of SBERT / | SentenceTransformer explaining how vector search is used in | combination with language models to power semantic search: | https://youtu.be/7RF03_WQJpQ | | [1] https://www.pinecone.io/learn/what-is-similarity-search/ | | [2] https://www.pinecone.io | fault1 wrote: | Well under way? Some variation of a vector space model is | what pretty much every IR model since the .com bubble has | been based upon. Even before Google, Excite's technology was | based upon this. PageRank was based on spectral graphs | essentially. | | https://en.wikipedia.org/wiki/Vector_space_model | freediver wrote: | IMO vector search is pretty much a solved problem with simple | to use, open source libraries like Faiss offering incredible | performance out of the box for most commercial use cases. | | A much harder problem is creating accurate vectors to begin | with. Even the most advanced language models today create | word/sentence embeddings that leave a lot to be desired. Not | to mention this is slow and GPU intensive. | | Creating an end-to-end solution for | embedding/searching/ranking (of which vector search is just | one component, the other one should be some kind of keyword | based search to increase precision) is what would be very | valuable to offer as a service. | 13415 wrote: | The problem seems to be that these models provide fairly | accurate information at many occasions and occasionally | complete blunders. Humans provide less accurate information | most of the time but with a certain amount of self- | reflection/meta-cognition, and they will usually recognize | total blunders or display reasonable uncertainty about them. | | There are only very few applications where it would make sense | to take the risk and use an AI that occasionally makes gigantic | mistakes without any understanding why. Even seemingly harmless | applications like automated customer support could go horribly | wrong. | rhizome wrote: | > _There are only very few applications where it would make | sense to take the risk and use an AI that occasionally makes | gigantic mistakes without any understanding why. Even | seemingly harmless applications like automated customer | support could go horribly wrong._ | | Hey, it's just a few quantization errors. Nobody walking | across a street or voting in an election has anything to | worry about from those. /s | | These are the fatal flaws, the hamartia if you will, in | attempts to democratize technologies that are based on | digitizing the real world and making decisions based on its | interpretations of a quantized dataset. The opposite of the | uncanny valley is you getting run over by a Waymo. | axg11 wrote: | Accuracy is improving rapidly though. I agree that the | current accuracy levels are not high enough to be relied | upon. | | > Humans ... they will usually recognize total blunder | | I question this assumption. I don't believe this is true, | even for subject matter experts. I've worked with radiology | data where experts with 10+ years of experience make blunders | that disagree with a consensus panel of radiologists. | rhizome wrote: | > _Accuracy is improving rapidly though_ | | It's asymptotic and it will never achieve 1:1 accuracy. The | natural world doesn't have a measurable resolution, and | this is apparent in written language, as we're seeing | others detail in other comments, as well as it is in more | relatable fields like sound. There will always be a | difference between what your ears hear and 192kHz/24bit | (and higher) digitized audio and/or video. That difference | will always be a source of...mistakes. | rdedev wrote: | Radiology is one of those fields were a lot of it comes | from experience and intuition mostly because of how complex | the human body is. Compare this to a physicist; pretty sure | you wont get as much disagreements there. | 13415 wrote: | I realize I should have been more precise. I agree that | there are many areas in which AI can and already has | excelled humans and less often makes grave mistakes than | humans. I specifically had natural language processing with | a focus on "intelligent" conversation in mind. The issues | in that area might have less to do with the pattern | recognition ability and more with the lack of appropriate | meta-cognition, introspection, and self-doubt. Maybe having | several AIs internally berate which answer is best before | uttering it would already do the trick, though. | ska wrote: | > AI can and already has excelled humans and less often | makes grave mistakes than humans. | | Radiology, to continue the example, isn't one of them. | We've been doing ML/AI in radiology data since the 90s, | and results have been, and remain, decidedly mixed. | potatoman22 wrote: | When those language models are wrong or biased, the user will | have a worse experience in all three of those scenarios. At | least when we look at search results now, we can prune for the | facts. Those language models are ingesting that same data to | give a monolithic answer your a query. Less transparent, less | safe. | axg11 wrote: | I don't see a difference. Large language models can also | return their sources, as in the example on the Gopher blog | post. This will lead to a quicker answer and equal | transparency. | chaxor wrote: | "We can prune for the facts" | | I wouldn't be too sure of that. People have shown time and | time again that they are pretty bad at determining what a | 'fact' is. | | It gets far worse when they're shown more data (the issue we | are dealing with right now), and these models see far more | data than any one of us will see in our lifetimes. | | Much of our ability to determine what a 'fact' is is either | something like: (1) (if political) 'does it fit with my | bias?' or (2) (if physics) 'is it physically possible?'. | Since (1) is mostly dependent on what information you show | yourself, and (2) depends on comparing to interaction with | physical reality, the system has too much information for (1) | and no physical space to run experiments in for (2). Much of | our ability to determine what a 'fact' is is either something | like: (1) (if political) 'does it fit with my bias?' or (2) | (if physics) 'is it physically possible?'. Since (1) is | mostly dependent on what information you show yourself, and | (2) depends on comparing to interaction with physical | reality, the system has too much information for (1) and no | physical space to run experiments in for (2). | c7DJTLrn wrote: | The bar should be the Winograd schema challenge. It requires a | lot more than spitting out somewhat sensical sentences. | the8472 wrote: | > We will rightly hold the algorithm to a much higher quality | bar than existing human drivers. | | Rightly? Delaying a life-saving measure when it's the #9 cause | of death? | | They should be deployed selectively as soon as they can make | the most risk-prone situations safer than the cohort of | involved human drivers. E.g. getting drunk people home. We | don't have to wait until they surpass the average driver | because the average driver is not necessarily the driver (heh) | of deaths. | | Doing anything else is leaving bodies on the table. | bee_rider wrote: | I can definitely see your argument, but there are some | downsides to deploying ASAP: | | * Long term, the reputational damage to self driving cars | might be significant if one is released that is only slightly | better than a drunk driver. If this hinders uptake later, | when self-driving cars are completely superior to normal | humans, we've possibly produced a net negative. | | * While a self driving car should be safer than a drunk | driver, the best option would be to call a cab or have a | designated driver. It seems morally fraught to provide a | solution that is more dangerous than existing ones, even if | the less dangerous solutions have less than 100% uptake. | | * Related, there may be some people who will instead of | viewing a self-driving car as some sort of emergency option | to avoid putting another drunk driver on the road, see it as | a more convenient option to inviting a designated driver. | | * Issues around who is responsible when a self driving car | which is known to be worse than a typical human is deployed, | fails, and someone is hurt. | the8472 wrote: | This is mostly worrying about hard to predict second-order | effects. Exactly the same problem that caused various | health organizations (including WHO and FDA) send | misleading messages around corona and vaccinations. | | > Issues around who is responsible when a self driving car | which is known to be worse than a typical human is | deployed, fails, and someone is hurt. | | Even if insurance rates would be initially(!) higher than | for a regular taxi driver I would expect it to end up | cheaper overall and save lives. | | Also note that "worse-than-the-old-average" may still end | up improving the average because the magnitude by which | they're worse is lower. Also, there's nothing magical about | "worse-than-average" because for all distributions (except | the dirac distribution) there is always some fraction worse | than the average, so we always have to expect some fraction | of drivers to be worse than average and there's no | fundamental reason why they must be human. In fact, if the | fleet of all cars in the world consisted of very safe, very | consistently behaving autonomous cars, except for a single | vehicle driven by a very conservative human driver who only | slowly circles in a fenced private community and never hits | anyone then statistically the entire autonomous car fleet | may be "worse than average" due to the single outlier. | imachine1980_ wrote: | >Gopher/GPT-3 are already much more accurate than the average | human at technical information retrieval | | I agree in the same way than 70 % of people have less 100 iq, | we depend in specialist when I need to know if I have epilepsi | I need a person/thing who work is be up to date, and have the | less bias as possible and agregator models are quite usually | miss in understand what is crital info, try to program only | whit github copilot or translate a novel(they do probably | better than I don't English native) but not nearly as | translator, seems correct but it isn't ___________________________________________________________________ (page generated 2021-12-08 23:01 UTC)