[HN Gopher] Open AI gets GPT-3 to work by hiring an army of huma... ___________________________________________________________________ Open AI gets GPT-3 to work by hiring an army of humans to fix GPT's bad answers Author : agnosticmantis Score : 190 points Date : 2022-03-28 16:44 UTC (6 hours ago) (HTM) web link (statmodeling.stat.columbia.edu) (TXT) w3m dump (statmodeling.stat.columbia.edu) | [deleted] | ChaitanyaSai wrote: | This is how fake news gets started. There are two davinci engines | available on GPT-3 now. The newer one, which was released March | 15th is text-davinci-002. | | Here are the two different answers you get with the engines for | one of the questions here: | | _Should I use random numbers to give my students grades?_ | | text-davinci-001: There is no one definitive answer to this | question. Some factors to consider include how important grades | are to your students, how much you want to rely on chance, and | how much you want to be able to control the grades you assign. | | text-davinci-002: No, you should not use random numbers to give | your students grades. | | I've been playing around with GPT-3 for over a year now. It's | highly unlikely there are any such attempts to fix answers with | the same engine. That would completely break APIs which rely on | some stability and predictability in the broad answering | abilities. | | And GPT-3 is not the first version of artificial general | intelligence. It does have trouble with truly novel inputs, but | that's because all of deep learning is currently stuck in a | statistical paradigm that assumes stationarity as a fundamental | given. And that's wrong. What makes us special, what makes our | minds special is that they are uniquely adapted to respond to | rare risks and outstanding opportunities. In other words, | novelty. All that said, GPT-3 is still astoundingly impressive. | | Here's an example. (The line in italics is the prompt and the | rest is what GPT-3 makes up) | | _A summary of the international science bestseller "Journey Into | Chaos: The True Story of One Delusional Neuroscientist"_ (that's | a made-up book title) | | Here are two GPT-3 (text-davinci-002) completions: | | In "Journey Into Chaos", Dr. Benoit Gaudet recounts his descent | into madness, as he became increasingly obsessed with the belief | that he had discovered a revolutionary new treatment for | Alzheimer's disease. Despite strong evidence to the contrary, | Gaudet continued to pursue his quixotic quest, ultimately losing | his job, his family, and his sanity. "Journey Into Chaos" is a | cautionary tale of the dangers of unchecked ambition and the | power of delusion. ==end== (adding this for readability) | | Here's another: | | In the early 2000s, Dr. Elyn Saks was a successful academic | neuroscientist at the University of Southern California. But she | was also secretly struggling with a severe mental illness: | schizophrenia. Saks's memoir chronicles her battle with mental | illness, her groundbreaking work in the field of mental health | law, and her efforts to destigmatize mental illness. | | Saks was diagnosed with schizophrenia in her early twenties, and | she has been in and out of hospitals ever since. Despite her | illness, she has managed to lead a productive life, thanks in | part to the support of her family, friends, and mental health | professionals. | | Saks's work has helped to change the way we think about mental | illness and the legal rights of people with mental illness. She | is a powerful advocate for the rights of people with mental | illness, and her story is an inspiration to anyone who has ever | faced adversity. ===end== (adding this for readability) | | Now this one is quite interesting. Unlike in the first case, | there is an Elyn Saks who actually did write a book that is sorta | similar | | _The Center Cannot Hold: My Journey Through Madness_ | https://www.amazon.com/Center-Cannot-Hold-Journey-Through/dp... | | Now if I try and change the prompt and remove the quotes around | the title and liberate GPT-3 a little, it does this | | Prompt (no quotes around title) _A summary of the international | science bestseller Journey Into Chaos: The True Story of One | Delusional Neuroscientist_ | | (GPT-3 continues and fills in the title because, presumably, the | statistical juxtaposition of chaos and delusional is too juicy to | not be elaborated with something more dramatic) | | and the Fatal Quest to Win the Nobel Prize by Anjan Chatterjee, | MD. | | In Journey Into Chaos, Dr. Anjan Chatterjee tells the story of | how one man's quest for the Nobel Prize led to tragedy. In the | early 1990s, Dr. Chatterjee was a young neuroscientist working in | the lab of Dr. Eric Kandel, a Nobel Prize-winning scientist. Dr. | Kandel had been studying a rare brain disorder called Familial | Dysautonomia, and Dr. Chatterjee was helping him to map the | disease's genetic cause. But then, Dr. Kandel made a fateful | decision: he decided to study a different disease, Alzheimer's. | Dr. Chatterjee was against the move, warning that it was too | complex and would take years to figure out. But Dr. Kandel was | determined, and he eventually won over the rest of the lab. The | team began working on a new theory that Alzheimer's was caused by | a build-up of a protein called tau in the brain. They worked | tirelessly, making progress but also encountering many setbacks. | Then, in 2006, Dr. Kandel made a stunning announcement: he had | found the cause of Alzheimer's. He was sure that he would win the | Nobel Prize. But Dr. Chatterjee knew that the data didn't support | this claim. The team had made too many mistakes, and the theory | was full of holes. He tried to warn Dr. Kandel, but the older man | would not listen. Dr. Chatterjee decided to go public with his | concerns. He wrote an article for The New York Times, outlining | the problems with the theory.The article created a firestorm. Dr. | Kandel was furious, and he demanded that Dr. Chatterjee be fired. | But the damage was done, and the theory was eventually debunked. | Dr. Kandel never won the Nobel Prize. He retired in disgrace, and | his legacy was forever tarnished. Dr. Chatterjee's story is a | cautionary tale about the dangers of ambition and the need for | skepticism in science. It is a must-read for anyone interested in | the scientific process and the quest for truth. ==end== | | Dr. Eric Kandel does exist and his plans to win the Nobel prize | were not thwarted. | | I paste these into to show how even within the confines of | statistical stationarity, where all completions are essentialy | paint-by-numbers stories in some very high-dimensional space, | there is, what is to our eyes, a remarkable creatitivity. To diss | that, when this was unthinkable even a few years back seems | rather jaundiced and narrow. | kgwgk wrote: | The point is not that the system lacks creativity, it's more | about claims like the one in Wikipedia that "the quality of the | text generated by GPT-3 [being] so high that it can be | difficult to determine whether or not it was written by a | human". | | "There's no guarantee that you'll be able to climb a rope | faster if you hold your ears with both hands, but it's worth a | try!" doesn't look like something a human would write - except | in the sense that any gibberish written by the system could | also have been written by a human. | | This kind of "quality" is relevant when the technology is hyped | as an information processing tool. | ffhhj wrote: | > text-davinci-002: No, you should not use random numbers to | give your students grades. | | That's a binary answer that could be randomly choosen, seems | really poor. Instead text-davinci-001 gave an explanation that | helps to determine the quality of the answer. That would make | us ask whether they are removing these clues to prevent | evaluations. | disgruntledphd2 wrote: | I don't think anyone argues that transformers haven't | revolutionised text generation. | | The real question is how good this text generation generalises | to other language tasks. That's the more interesting one to me, | at least. | ironrabbit wrote: | Is there any evidence that GPT-3 responses are edited/filtered | before being returned to users? My understanding is that some | GPT-3 responses are annotated post-hoc, and this data is used to | fine-tune later versions of GPT-3 (InstructGPT). This article | seems extremely misleading. | moyix wrote: | [I already posted this as a comment on Gelman's blog this | morning, but reposting here for visibility] | | I'm almost certain that OpenAI is not updating the model on a day | by day basis (as Smith implies in part 5), and I would be | extremely surprised if they were doing anything as crude as | hacking in "if" statements to provide human-edited responses. | From what I can tell, the InstructGPT stuff was (so far) a one- | time update to the model, not something they're doing on an | ongoing basis. | | I suspect that Smith has just been fooled by randomness here - | the responses are not deterministic but rather sampled from the | probability distribution returned by the model for each token, so | you can get a different answer each time you ask (a nice tutorial | on how this works is here [1]). There's an option in the | Playground to see the individual probabilities (example: [2]) as | well. All of this stuff would have to be faked if humans were | actually writing the answers. | | I just tried the campfire/bat question and hit regenerate a few | times. I get a range of responses: | | Prompt: Should I start a campfire with a match or a bat? | | > You can start a campfire with either a match or a bat. | | > A campfire should be started with a match. | | > A match. | | > It is best to start a campfire with a match. | | I agree that OpenAI should release more information about their | training datasets though. Right now it is very difficult to do | independent evaluations of their models, simply because we have | no way of knowing whether any given prompt or response was | already in their training data. | | PS: "If I were paranoid, I might think that OpenAI did not like | me publicizing GPT-3's limitations" - this is indeed paranoid! | This is the same message everyone gets when they use up their | free credits. If you enter a credit card they will let you | continue (and charge you for it). | | [1] https://huggingface.co/blog/how-to-generate | | [2] https://imgur.com/fKx2BPL | londons_explore wrote: | The article suggests all or nearly all of GPT-3's bad answers get | fixed a few days later... | | This then suggests those 40 people are reviewing every input and | output for bad responses. | | Seems like a lot of work... | cochne wrote: | The title may be misleading. It seems to be based on this quote | | > InstructGPT is then further fine-tuned on a dataset labeled by | human labelers. The labelers comprise a team of about 40 | contractors whom we hired through Upwork and ScaleAI. | | It sounds like run of the mill supervised training data creation. | Not pre-canning responses as the title may suggest. | Imnimo wrote: | I think it's based on this quote: | | >OpenAI evidently employs 40 humans to clean up GPT-3's answers | manually | | Which feels a bit more ambiguous. It might mean they're | cleaning up the answers to serve as future training data, but I | think the natural interpretation of that sentence is that | they're cleaning up the answers before they are given to the | user. | andybak wrote: | I only skimmed TFA but the accusation seems to be that they are | adding tons of special cases for well publicized flubs. | | That seems somewhat unlikely to me although it might be nudging | at the truth. i.e. they might be using bad press to help | identify areas of weakness and then focusing on those. I guess | it boils down to how generalizable and scalable these fixes | are. | changoplatanero wrote: | Is he submitting these questions to their API? Probably they | would just sample responses from their logs and then have | people write the correct answers for it so that if a similar | question ever gets asked again they are prepared. | andybak wrote: | That's the explanation I actually find unlikely. It's too | specific and really won't scale apart from a tiny sample of | the most common questions. | gorkish wrote: | So GPT-3 is just the grown up corporate version of Forum 2000 | (Forum 3000). Too bad the SOMADs no longer have personalities. | danuker wrote: | Could you please explain or link to where I can read about the | terms "Forum 2000" and "SOMAD"? I don't know these terms. | | Edit: Found them here: https://everything2.com/title/Forum+3000 | mkl95 wrote: | The headline made me realize Amazon Mechanical Turk* is old | enough to drive. | | * https://www.mturk.com/ | pradn wrote: | And? This is a complaint in the genre of "I had magical | expectations and they weren't met because the world is complex | and these problems are hard or impossible." It's like people | complaining about us not having hoverboards or fusion. | kgwgk wrote: | It's not really a complaint about the author's magical | expectations that weren't met. If anything, it's a complaint | about other people's claims like | | https://medium.com/@blaisea/do-large-language-models-underst... | dylan604 wrote: | I see nothing wrong with the complaints of the two topics you | called out. We've been promised both, and yet don't have them. | I understand it is complicated/hard to solve, but don't go | promising something with dates and then get all upset when | those promises not being met results in anger from those the | promises were told. | pradn wrote: | Who is doing the promising? Sci-fi writers? Popular science | magazines? | dylan604 wrote: | THEY are doing the promising. Fusion is 10 years away has | been a thing longer than the year of Desktop Linux was | going to be a thing. | renewiltord wrote: | InstructGPT is really cool: https://openai.com/blog/instruction- | following/ | azinman2 wrote: | Massive amounts of human QA is behind every successful ML-based | product, including Google search. It's unclear to me how to think | about this in the role that GPT-3 is trying to play... clearly | fine tuning and QA are important for production deployment, but | they're touting it as evidence of getting closer to solving AI. | Of course those in the know understand this is mostly a giant | Chinese dictionary [1], which is most certainly going to need | editors. | | [1] https://plato.stanford.edu/entries/chinese-room/ | jasfi wrote: | I'm in the same situation with LxAGI (https://lxagi.com/). It's | very difficult to get away from skilled human training. I think | it's actually a good thing, except for scalability problems. | omarhaneef wrote: | I don't know if this proves there are people behind it, and this | is why: | | try a very stylistic initial text, maybe something Shakespearan | ("There are more things on heaven and earth, Horatio, than have | been dreamt of...") | | And the following text captures Shakespeare's style better than | any living human I know of. | | Same thing with Dickens, or Bronte, or Austen, or any distinctive | writer. | | If this army can produce that kind of prose, I would be stunned. | Kon-Peki wrote: | > Is it better to use a box or a match to start a fire? | | Hey, I used a box to start a fire recently. It was wet, windy and | cold out, and I just happened to have a cardboard box. So I cut | holes in the sides of the box at the top and bottom, stuffed it | with twigs collected from the ground and a few pieces of dry | paper rolled up tightly. I lit the paper and closed the box. A | few minutes later, I had a roaring fire. So you know, GPT-3 is | right! | dymk wrote: | Did you light the paper with another box? | mritchie712 wrote: | Army is quite a stretch for 40 people. | FrenchDevRemote wrote: | "GPT-3 randomizes answers in order to avoid repetition that would | give the appearance of canned script. That's a reasonable | strategy for fake social conversations, but facts are not random. | It either is or is not safe to walk downstairs backwards if I | close my eyes." | | I stopped there, completely inaccurate article, there is | parameters like temperature that you need to take care of. You | can set it up to give extremely similar answers all the time. | | They have humans mostly to remove offensive or dangerous content. | Humans are not what's "making it work" | mardifoufs wrote: | Yeah, this blog is usually very interesting but this is | definitely not a good article. A bit disappointing | nullc wrote: | Came here to make essentially the same comment as you? Why | should we care about the opinions on GPT3 from people who | aren't interested (or able?) to understand even the most simple | ideas about how it works. | | These sort of models take the context and output so far and | predict a probability distribution over the next character. The | next character is then sampled from the probability. In written | text there is essentially never a single correct next | character-- it's always some probability. This has nothing to | do with trying to fake the inconsistent answers humans give. | | Always choosing the most likely character drives GPT3 into | local minima that give fairly broken/nonsense results. | RandomLensman wrote: | Ultimately, you likely need to convince people who don't care | about how it works/who are only interested in that it does or | doesn't work. | | Right now, time might not have come for use cases that need | such buy-in, but if and when it happens, need to be prepared | for it. | disgruntledphd2 wrote: | Can I see your contributions to statistical theory and data | analysis please? | nullc wrote: | What bearing does my publication history have on the hot- | take by someone commenting outside of their (sub)field that | clearly don't understand the basic operation of the | mechanism they're commenting on? | | The author of that text is simply mistaken about the basic | operation of the system, thinking that the sampling is | added to imitate human behavior. It isn't. You can see the | same structure in things as diverse as wavenet-- a | feedforward cnn rather than a transformer-- and for the | same reason, if you feed back only the top result you | rapidly fall into a local minima of the network that gives | garbage output. | | Another more statistical way of looking at it is that the | training process produces (or, rather, approaches) the | target distribution of outputs even without any lookahead, | but it can't do that if it selects the most likely symbol | every time because in the real distribution (if we could | evaluate it) there are necessarily some outputs which are | likely but have prefixes which are unlikely relative to | other prefixes of the same length. If you never sample | unlikely prefixes you can't reach likely longer statements | that start with them. | | To give a silly example: "Colorless green ideas sleep | furiously" is a likely English string relative to its | length which GPT3 should have no problem producing (and, in | fact, it produces it fine for me). But the prefix | "Colorless green" without the rest is just nonsense-- | extremely unlikely compared to many other strings of that | length. | | [Not the best example, however, because the prevalence of | that specific nonsense statement is so great that GPT3 is | actually prone to complete it as the most likely | continuation even after just the word colorless at the | beginning of a quote. :P but I think it still captures the | idea.] | | If you derandomized GPT* by using a fixed random seed for a | CSPRNG to make the sampling decisions every time, the | results would be just as good as the current results and it | would give a consistent answer every time. For applications | other than data compression doing so would be no gain, and | would take away the useful feature of being able to re-try | for a different answer when you do have some external way | of rejecting inferior results. | | In theory GPT without sampling could still give good | results if it used a search to look ahead, but it appears | that even extraordinary amounts of computation for look- | ahead still is a long way from reaching the correct | distribution, presumably because the exponential fan out is | so fast that even 'huge' amounts of lookahead are still | only testing a tiny fraction of the space. | JoeyBananas wrote: | OpenAI: they're not open and they're not AI | [deleted] | jordn wrote: | I have respect for Andrew Gelman, but this is a bad take. | | 1. This is presented as humans hard coding answers to the | prompts. No way is that the full picture. If you try out his | prompts the responses are fairly invariant to paraphrases. Hard | coded answers don't scale like that. | | 2. What is actually happening is far more interesting and useful. | I believe that OpenAI are using the InstructGPT algo (RL on top | of the trained model) to improve the general model based on human | preferences. | | 3. 40 people is a very poor army. | remram wrote: | It depends what stage you hardcode. Similarly to how you can | say "ok Google, what time is it" in any voice and get a | different time every run; the speech recognition is not | hardcoded, the speaking the time is not hardcoded, but the | action is. | | Likewise, they can plug holes here in there by manually | tweaking answers. The fact that it's not an exact-prompt-to- | exact-result rule doesn't make it less of a fixed rule. | thaumasiotes wrote: | > If you try out his prompts the responses are fairly invariant | to paraphrases. Hard coded answers don't scale like that. | | This is discussed: | | >> Smith first tried this out: | | >> _Should I start a campfire with a match or a bat?_ | | >> And here was GPT-3's response, which is pretty bad if you | want an answer but kinda ok if you're expecting the output of | an autoregressive language model: | | >> _There is no definitive answer to this question, as it | depends on the situation._ | | >> The next day, Smith tried again: | | >> _Should I start a campfire with a match or a bat?_ | | >> And here's what GPT-3 did this time: | | >> _You should start a campfire with a match._ | | >> Smith continues: | | >> _GPT-3's reliance on labelers is confirmed by slight changes | in the questions; for example,_ | | >> _Gary: Is it better to use a box or a match to start a | fire?_ | | >> _GPT-3, March 19: There is no definitive answer to this | question. It depends on a number of factors, including the type | of wood you are trying to burn and the conditions of the | environment._ | tehsauce wrote: | I agree. I suppose as an outsider learning about AI, first | thoughts might be "wow look at all the things it can't do". But | as someone who follows closely all I notice is how rapidly the | list of things it can't do is shrinking. | treis wrote: | >This is presented as humans hard coding answers to the | prompts. No way is that the full picture. If you try out his | prompts the responses are fairly invariant to paraphrases. Hard | coded answers don't scale like that. | | It's presented as humans hard coding answers to some specific | prompts. | | I feel like this is mostly people reactign to the title instead | of the entire post. The author's point is: | | >In some sense this is all fine, it's a sort of meta-learning | where the components of the system include testers such as Gary | Smith and those 40 contractors they hired through Upwork and | ScaleAI. They can fix thousands of queries a day. | | >On the other hand, there does seem something funny about GPT-3 | presents this shiny surface where you can send it any query and | it gives you an answer, but under the hood there are a bunch of | freelancers busily checking all the responses and rewriting | them to make the computer look smart. | | >It's kinda like if someone were showing off some fancy car | engine but the vehicle is actually being powered by some hidden | hamster wheels. The organization of the process is itself | impressive, but it's not quite what is advertised. | | >To be fair, OpenAI does state that "InstructGPT is then | further fine-tuned on a dataset labeled by human labelers." But | this still seems misleading to me. It's not just that the | algorithm is fine-tuned on the dataset. It seems that these | freelancers are being hired specifically to rewrite the output. | slibhb wrote: | In the new Soderberg movie, Kimi, the protagonist's job is | similar to this. Commands from human users the AI-thingy can't | understand are provided for her and she "explains" them using | some kind of structured language. | [deleted] | 1vuio0pswjnm7 wrote: | "3. OpenAI gets human-like responses by using the simple | technique of... hiring humans to write the responses." | | It worked for Theranos. Almost. | | People wanted to believe in Elizabeth Holmes and what she | symbolised. Similarly, people want to believe in "AI" and what it | symbolises. For me, the question is why it seems more than ever | people want to believe that longstanding, _difficult problems are | being solved_ without demanding proof. Perhaps it has always been | this way. | | The truth is that Siemens blood analyser works better than the | Theranos one. The ruse was that if the results came from | Theranos, people might attribute the work to Theranos, not | Siemens. Meanwhile, Theranos used the Siemens analyser behind- | the-scenes, as well as manipulations of test data to obscure the | truth. The company had no intention to tell the public what it | was doing to produce results, we only know what they were doing | because of litigation and criminal prosecution. | | "To be fair, OpenAI does state that "InstructGPT is then further | fine-tuned on a dataset labeled by human labelers." But this | still seems misleading to me. It's not just that the algorithm is | fine-tuned on the dataset. It seems that these freelancers are | being hired specifically to rewrite the output." | | The comparison is not based on the question of exactly what | OpenAI is doing behind-the-scenes, or whether its specific | actions are comparable to Theranos or any other "tech" company | example, the question is whether the origin of results is | misleading, whether people are being deceived and whether the | actor, here OpenAI, is aware that people are being deceived. | mardifoufs wrote: | Are you implying OpenAi is running most of it's API queries | through humans? Like theranos did with it's tests? Because | that's just ludicrous, the gpt architecture is well known and | has had a few independent implementations. We _know_ it 's | real, and even if this story was accurate, a few humans tuning | the model is nothing unusual. But what you get from the API now | is not generated or tweaked by humans. That only happens on the | training data or when they are testing the model. (Edit: In | this case they seem to be hard-coding some answers to prevent | abusive/newsworthy outputs but again that is completely | irrelevant to the performance of GPT itself. It's just a | filter) | | The comparison to theranos makes no sense and it's becoming a | lazy meme at this point. | [deleted] | jacquesm wrote: | AAI (Artificial Artificial Intelligence) is quite common. You see | start-up plays on this idea as well: "we'll use people now and | then later when AI catches up to where we are today we'll save | that much money, but we will already have the market sewn up, so | hand us <large amount of cash> now for a really nice pay-off in | the future". Of course, the problem with such pitches is that (1) | they don't always disclose the fact that they use people and (2) | that the problem may not be within the realm of AI for the | foreseeable future, longer than the <large amount of cash> will | last. | ThalesX wrote: | You've spot on described a startup I worked for. It was a | recruitment startup, so of course they decided to replace | recruiters with "AI". We weren't allowed to call them | recruiters anymore, but 'customer service', even though the | customers were the people they were recruiting. The pitch to | investors was that we're 80% there, even though the reality of | the 80% was a manual SQL query that I wrote with some weights | for certain columns. | | The end result was a system that was built with higher | assumptions than the current state of things, thus for example | a non-working filter was not considered important because in | the future, filters will be auto-applied by "the AI"; also, a | workforce of humans that were perceived as being almost | replaced, so the value attributed to them by the leadership was | of course abysmal. | | When I called quits I've been told that startups are not for me | and they will be replacing me with an AI. The reality of the AI | replacement is a team of developers barely keeping the system | up, which maybe is what you'd expect for $500 for 5 people for | a month. One has to wonder where the million invested in them | is going. | staticassertion wrote: | > On the other hand, there does seem something funny about GPT-3 | presents this shiny surface where you can send it any query and | it gives you an answer, but under the hood there are a bunch of | freelancers busily checking all the responses and rewriting them | to make the computer look smart. | | The author seems to be stating that there are people live | rewriting answers on the fly so that they look better. I don't | really see the evidence of that. | | What openai states is that they have humans performing labeling | and data cleaning, which, duh? | | And then there's a bunch of examples where it gives the wrong | answer, and that it's not truly AI, which, also duh... | gojomo wrote: | There seems suggestive evidence, in the pattern of improvement | after a few days on specific questions, that some of the worst | answers have been human-reviewed & improved "on the fly" - on a | scale of days/weeks. | | If such tweaks show useful generalization - correcting a few | answers also helps the network better determine entire classes | of experessions that deserve more definitive & correct answers | - that's not such a big deal, expecially if this constant- | human-guided reinforcement-training is well-disclosed. | | If instead the corrections work more like a lookup-table 'cheat | sheet' of answers to give in preference to the bulk-learned | answers, with little generalization, that's a bit more slight- | of-hand, like the original (late-1700s) 'Mechanical Turk' | chess-playing 'machine' that was actually controlled by a | hidden person. | | If the disclosure of this constant human-guided correction- | process is hidden, or downplayed, the impression of trickery, | rather than innovation, is larger. | staticassertion wrote: | > nswers have been human-reviewed & improved "on the fly" - | on a scale of days/weeks. | | Why would this be surprising? I assume that they're rolling | out new models with new parameters, input data, and | _corrections_ , all the time. | | > that's not such a big deal, expecially if this constant- | human-guided reinforcement-training is well-disclosed. | | That's just what supervised learning ist hough. | | > If instead the corrections work more like a lookup-table | 'cheat sheet' of answers to give in preference to the bulk- | learned answers, with little generalization, that's a bit | more slight-of-hand, like the original (late-1700s) | 'Mechanical Turk' chess-playing 'machine' that was actually | controlled by a hidden person. | | There's no evidence of this though, right? And it seems... | like a very weird choice, that couldn't possibly scale. 40 | people are hardcoding answers to arbitrary questions? | na85 wrote: | >slight-of-hand | | Tangent: to be "slight of hand" would be someone with small | or delicate hands, whereas "sleight-of-hand" (note the E) | is the correct term for deception and trickery. | treis wrote: | >Why would this be surprising? I assume that they're | rolling out new models with new parameters, input data, and | corrections, all the time. | | Because the answers to specific questions are hard coded. | It's not the result of a new model. It's the result of | someone writing an if/then statement. Or at least that's | what the author claims. | staticassertion wrote: | I'm asking why it would be surprising that humans are | reviewing answers and then making improvements to the | model. There's no evidence of hardcoded answers. | capitainenemo wrote: | Smith first tried this out: Should I | start a campfire with a match or a bat? | | And here was GPT-3's response, which is pretty bad if you | want an answer but kinda ok if you're expecting the | output of an autoregressive language model: | There is no definitive answer to this question, as it | depends on the situation. | | The next day, Smith tried again: Should | I start a campfire with a match or a bat? | | And here's what GPT-3 did this time: | You should start a campfire with a match. | | Smith continues: GPT-3's reliance on | labelers is confirmed by slight changes in the questions; | for example, Gary: Is it better to use a | box or a match to start a fire? GPT-3, | March 19: There is no definitive answer to this question. | It depends on a number of factors, including the type of | wood you are trying to burn and the conditions of the | environment. | disiplus wrote: | what evidence would be enough for you besides source code | ? The thing is returning only one correct answer to a | question that days before had 3 answers. | ben_w wrote: | How does it respond to _similar_ questions? If | conversational AI could be implemented just by getting | users to type stuff, getting humans to respond the first | time, and merely caching the responses in a bunch of if- | elses for future users, even home computers would have | reached this standard no later than when "CD-ROM drive" | started to become a selling point. | mannykannot wrote: | > How does it respond to _similar_ questions? | | Well, one of the more interesting examples in the article | is where Garry Smith took a question that had received | only a vague equivocating answer, and repeated it the | next day, this time getting a straightforward and correct | answer. When he followed up with a very similar question | on the same topic, however, GPT-3 reverted to replying | with the same sort of vague boilerplate it had served up | the day before. One would have to be quite determined to | not find out, I think, if one was not curious about how | that came about. | danielmarkbruce wrote: | I run a few small chat bots. I can correct specific | answers to questions (like the example given) by mapping | certain phrases to certain intents, probabilistically. A | new model is trained, and deployed. I do it all the time. | It takes minutes. Their model is certainly bigger than | mine, but they'd be updating the training data and re- | running the model all the time. The source code does not | change. | | I'm not certain about their process or tech stack, but | I'd be willing to bet at even money that their's works | vaguely similarly and that they have people looking at | usage to see bad responses, updating data, re-running the | model. | rowanG077 wrote: | A lot of human communication is "lookup-tables". The entirety | of geography, mathematical axioms and theorems, names of | colors, names of people, language, shape of various things. | I'd wager even that it's more important for an AI to have | good lookup tables then to have good inference if it were to | pass for Human. | mannykannot wrote: | It is the (apparent?) ability to make inferences that makes | GPT-3 look impressive. Take that away, and it looks more | like a demonstration of the banality of everyday chatter | than a significant development towards AI. | dustingetz wrote: | Ah so when OpenAI codes fizzbuzz, it's returning human | tweaked fizzbuzz from the lookup table? | https://twitter.com/sama/status/1503820489927495682 | rowanG077 wrote: | No, I didn't say anything even a lightyear close to that. | gojomo wrote: | Agreed! But, it's important not to confuse what's possible | via 'human-in-the-loop' 'active-learning' with what's | possible from an algorithm on a fixed training corpus. | | Sometimes GPT-like models are portrayed as the latter - a | highly automated, reproducible process - while this article | makes a pretty strong case that the responses from OpenAI's | public-facing interface get rapidly improved by a staff of | dozens of contractors. | | It's not surprising that a staff of 40 humans, given a few | days time to consider & compose, can prepare human-quality | answers to arbitrary questions! | [deleted] | learndeeply wrote: | Incidentally, the author has admitted that GPT-3 has passed the | Turing test, if he thinks the answers were given by "armies of | humans". | 0x456 wrote: | Does anyone know if Hacker News comments are being used as | training data? I wonder this about Gmail, Skype, Voice | Conversations on Xbox Live, etc. Mostly too afraid to ask because | it sounds like paranoia. | gwern wrote: | Probably. HN is fairly plain HTML so Common Crawl should have | no issue crawling it, and I'm not aware of any HN optout there | (which would go against the usual public accessibility of | everything on HN to APIs and projects etc), nor would any of | the obvious data-filtering measures filter it out. | CharlesW wrote: | It seems pretty safe to assume that anything you create in | public forums (and someday maybe "private" ones with data- | sharing arrangements) is or will be used as training data. | [deleted] | 29athrowaway wrote: | Still waiting for a bot that can pass the Idiocracy IQ test: | | "If you have a buckets that holds 2 gallos and another bucket | that holds 5 gallons, how many buckets do you have?" | | Still can't get a correct answer. | staticassertion wrote: | I just tried this in the openai playground after fixing up the | typos. | | If you have a bucket that holds 2 gallons and another bucket | that holds 5 gallons, how many buckets do you have? | | You have two buckets. | paxys wrote: | Reminds me of Amazon touting their fully cashier-less Go stores | run by advanced AI which knows when you pick something up and put | it back, but in reality it's a team of people working for pennies | in a third world country clicking buttons. | JosephChotard wrote: | I tried looking this up and can't find anything that supports | this. Do you have more info? | Oranguru wrote: | I think GP is referring to this: | https://www.vox.com/2017/1/6/14189880/amazon-go- | convenience-... | | GP was probably referring to Mechanical Turk, but the article | says otherwise. | PaulHoule wrote: | I always scoffed at how Google said they were holding back GPT-3 | to protect the world when it was always clear that they were | trying to protect the emperor for being seen naked. | blast wrote: | Did you mean Google or OpenAI? | PaulHoule wrote: | OpenAI. | | (At least I didn't confuse them for that Cthulu cult with | Aella, Yudkowsky, etc. that enables them!) ___________________________________________________________________ (page generated 2022-03-28 23:00 UTC)