[HN Gopher] Open AI gets GPT-3 to work by hiring an army of huma...
       ___________________________________________________________________
        
       Open AI gets GPT-3 to work by hiring an army of humans to fix GPT's
       bad answers
        
       Author : agnosticmantis
       Score  : 190 points
       Date   : 2022-03-28 16:44 UTC (6 hours ago)
        
 (HTM) web link (statmodeling.stat.columbia.edu)
 (TXT) w3m dump (statmodeling.stat.columbia.edu)
        
       | [deleted]
        
       | ChaitanyaSai wrote:
       | This is how fake news gets started. There are two davinci engines
       | available on GPT-3 now. The newer one, which was released March
       | 15th is text-davinci-002.
       | 
       | Here are the two different answers you get with the engines for
       | one of the questions here:
       | 
       |  _Should I use random numbers to give my students grades?_
       | 
       | text-davinci-001: There is no one definitive answer to this
       | question. Some factors to consider include how important grades
       | are to your students, how much you want to rely on chance, and
       | how much you want to be able to control the grades you assign.
       | 
       | text-davinci-002: No, you should not use random numbers to give
       | your students grades.
       | 
       | I've been playing around with GPT-3 for over a year now. It's
       | highly unlikely there are any such attempts to fix answers with
       | the same engine. That would completely break APIs which rely on
       | some stability and predictability in the broad answering
       | abilities.
       | 
       | And GPT-3 is not the first version of artificial general
       | intelligence. It does have trouble with truly novel inputs, but
       | that's because all of deep learning is currently stuck in a
       | statistical paradigm that assumes stationarity as a fundamental
       | given. And that's wrong. What makes us special, what makes our
       | minds special is that they are uniquely adapted to respond to
       | rare risks and outstanding opportunities. In other words,
       | novelty. All that said, GPT-3 is still astoundingly impressive.
       | 
       | Here's an example. (The line in italics is the prompt and the
       | rest is what GPT-3 makes up)
       | 
       |  _A summary of the international science bestseller "Journey Into
       | Chaos: The True Story of One Delusional Neuroscientist"_ (that's
       | a made-up book title)
       | 
       | Here are two GPT-3 (text-davinci-002) completions:
       | 
       | In "Journey Into Chaos", Dr. Benoit Gaudet recounts his descent
       | into madness, as he became increasingly obsessed with the belief
       | that he had discovered a revolutionary new treatment for
       | Alzheimer's disease. Despite strong evidence to the contrary,
       | Gaudet continued to pursue his quixotic quest, ultimately losing
       | his job, his family, and his sanity. "Journey Into Chaos" is a
       | cautionary tale of the dangers of unchecked ambition and the
       | power of delusion. ==end== (adding this for readability)
       | 
       | Here's another:
       | 
       | In the early 2000s, Dr. Elyn Saks was a successful academic
       | neuroscientist at the University of Southern California. But she
       | was also secretly struggling with a severe mental illness:
       | schizophrenia. Saks's memoir chronicles her battle with mental
       | illness, her groundbreaking work in the field of mental health
       | law, and her efforts to destigmatize mental illness.
       | 
       | Saks was diagnosed with schizophrenia in her early twenties, and
       | she has been in and out of hospitals ever since. Despite her
       | illness, she has managed to lead a productive life, thanks in
       | part to the support of her family, friends, and mental health
       | professionals.
       | 
       | Saks's work has helped to change the way we think about mental
       | illness and the legal rights of people with mental illness. She
       | is a powerful advocate for the rights of people with mental
       | illness, and her story is an inspiration to anyone who has ever
       | faced adversity. ===end== (adding this for readability)
       | 
       | Now this one is quite interesting. Unlike in the first case,
       | there is an Elyn Saks who actually did write a book that is sorta
       | similar
       | 
       |  _The Center Cannot Hold: My Journey Through Madness_
       | https://www.amazon.com/Center-Cannot-Hold-Journey-Through/dp...
       | 
       | Now if I try and change the prompt and remove the quotes around
       | the title and liberate GPT-3 a little, it does this
       | 
       | Prompt (no quotes around title) _A summary of the international
       | science bestseller Journey Into Chaos: The True Story of One
       | Delusional Neuroscientist_
       | 
       | (GPT-3 continues and fills in the title because, presumably, the
       | statistical juxtaposition of chaos and delusional is too juicy to
       | not be elaborated with something more dramatic)
       | 
       | and the Fatal Quest to Win the Nobel Prize by Anjan Chatterjee,
       | MD.
       | 
       | In Journey Into Chaos, Dr. Anjan Chatterjee tells the story of
       | how one man's quest for the Nobel Prize led to tragedy. In the
       | early 1990s, Dr. Chatterjee was a young neuroscientist working in
       | the lab of Dr. Eric Kandel, a Nobel Prize-winning scientist. Dr.
       | Kandel had been studying a rare brain disorder called Familial
       | Dysautonomia, and Dr. Chatterjee was helping him to map the
       | disease's genetic cause. But then, Dr. Kandel made a fateful
       | decision: he decided to study a different disease, Alzheimer's.
       | Dr. Chatterjee was against the move, warning that it was too
       | complex and would take years to figure out. But Dr. Kandel was
       | determined, and he eventually won over the rest of the lab. The
       | team began working on a new theory that Alzheimer's was caused by
       | a build-up of a protein called tau in the brain. They worked
       | tirelessly, making progress but also encountering many setbacks.
       | Then, in 2006, Dr. Kandel made a stunning announcement: he had
       | found the cause of Alzheimer's. He was sure that he would win the
       | Nobel Prize. But Dr. Chatterjee knew that the data didn't support
       | this claim. The team had made too many mistakes, and the theory
       | was full of holes. He tried to warn Dr. Kandel, but the older man
       | would not listen. Dr. Chatterjee decided to go public with his
       | concerns. He wrote an article for The New York Times, outlining
       | the problems with the theory.The article created a firestorm. Dr.
       | Kandel was furious, and he demanded that Dr. Chatterjee be fired.
       | But the damage was done, and the theory was eventually debunked.
       | Dr. Kandel never won the Nobel Prize. He retired in disgrace, and
       | his legacy was forever tarnished. Dr. Chatterjee's story is a
       | cautionary tale about the dangers of ambition and the need for
       | skepticism in science. It is a must-read for anyone interested in
       | the scientific process and the quest for truth. ==end==
       | 
       | Dr. Eric Kandel does exist and his plans to win the Nobel prize
       | were not thwarted.
       | 
       | I paste these into to show how even within the confines of
       | statistical stationarity, where all completions are essentialy
       | paint-by-numbers stories in some very high-dimensional space,
       | there is, what is to our eyes, a remarkable creatitivity. To diss
       | that, when this was unthinkable even a few years back seems
       | rather jaundiced and narrow.
        
         | kgwgk wrote:
         | The point is not that the system lacks creativity, it's more
         | about claims like the one in Wikipedia that "the quality of the
         | text generated by GPT-3 [being] so high that it can be
         | difficult to determine whether or not it was written by a
         | human".
         | 
         | "There's no guarantee that you'll be able to climb a rope
         | faster if you hold your ears with both hands, but it's worth a
         | try!" doesn't look like something a human would write - except
         | in the sense that any gibberish written by the system could
         | also have been written by a human.
         | 
         | This kind of "quality" is relevant when the technology is hyped
         | as an information processing tool.
        
         | ffhhj wrote:
         | > text-davinci-002: No, you should not use random numbers to
         | give your students grades.
         | 
         | That's a binary answer that could be randomly choosen, seems
         | really poor. Instead text-davinci-001 gave an explanation that
         | helps to determine the quality of the answer. That would make
         | us ask whether they are removing these clues to prevent
         | evaluations.
        
         | disgruntledphd2 wrote:
         | I don't think anyone argues that transformers haven't
         | revolutionised text generation.
         | 
         | The real question is how good this text generation generalises
         | to other language tasks. That's the more interesting one to me,
         | at least.
        
       | ironrabbit wrote:
       | Is there any evidence that GPT-3 responses are edited/filtered
       | before being returned to users? My understanding is that some
       | GPT-3 responses are annotated post-hoc, and this data is used to
       | fine-tune later versions of GPT-3 (InstructGPT). This article
       | seems extremely misleading.
        
       | moyix wrote:
       | [I already posted this as a comment on Gelman's blog this
       | morning, but reposting here for visibility]
       | 
       | I'm almost certain that OpenAI is not updating the model on a day
       | by day basis (as Smith implies in part 5), and I would be
       | extremely surprised if they were doing anything as crude as
       | hacking in "if" statements to provide human-edited responses.
       | From what I can tell, the InstructGPT stuff was (so far) a one-
       | time update to the model, not something they're doing on an
       | ongoing basis.
       | 
       | I suspect that Smith has just been fooled by randomness here -
       | the responses are not deterministic but rather sampled from the
       | probability distribution returned by the model for each token, so
       | you can get a different answer each time you ask (a nice tutorial
       | on how this works is here [1]). There's an option in the
       | Playground to see the individual probabilities (example: [2]) as
       | well. All of this stuff would have to be faked if humans were
       | actually writing the answers.
       | 
       | I just tried the campfire/bat question and hit regenerate a few
       | times. I get a range of responses:
       | 
       | Prompt: Should I start a campfire with a match or a bat?
       | 
       | > You can start a campfire with either a match or a bat.
       | 
       | > A campfire should be started with a match.
       | 
       | > A match.
       | 
       | > It is best to start a campfire with a match.
       | 
       | I agree that OpenAI should release more information about their
       | training datasets though. Right now it is very difficult to do
       | independent evaluations of their models, simply because we have
       | no way of knowing whether any given prompt or response was
       | already in their training data.
       | 
       | PS: "If I were paranoid, I might think that OpenAI did not like
       | me publicizing GPT-3's limitations" - this is indeed paranoid!
       | This is the same message everyone gets when they use up their
       | free credits. If you enter a credit card they will let you
       | continue (and charge you for it).
       | 
       | [1] https://huggingface.co/blog/how-to-generate
       | 
       | [2] https://imgur.com/fKx2BPL
        
       | londons_explore wrote:
       | The article suggests all or nearly all of GPT-3's bad answers get
       | fixed a few days later...
       | 
       | This then suggests those 40 people are reviewing every input and
       | output for bad responses.
       | 
       | Seems like a lot of work...
        
       | cochne wrote:
       | The title may be misleading. It seems to be based on this quote
       | 
       | > InstructGPT is then further fine-tuned on a dataset labeled by
       | human labelers. The labelers comprise a team of about 40
       | contractors whom we hired through Upwork and ScaleAI.
       | 
       | It sounds like run of the mill supervised training data creation.
       | Not pre-canning responses as the title may suggest.
        
         | Imnimo wrote:
         | I think it's based on this quote:
         | 
         | >OpenAI evidently employs 40 humans to clean up GPT-3's answers
         | manually
         | 
         | Which feels a bit more ambiguous. It might mean they're
         | cleaning up the answers to serve as future training data, but I
         | think the natural interpretation of that sentence is that
         | they're cleaning up the answers before they are given to the
         | user.
        
         | andybak wrote:
         | I only skimmed TFA but the accusation seems to be that they are
         | adding tons of special cases for well publicized flubs.
         | 
         | That seems somewhat unlikely to me although it might be nudging
         | at the truth. i.e. they might be using bad press to help
         | identify areas of weakness and then focusing on those. I guess
         | it boils down to how generalizable and scalable these fixes
         | are.
        
           | changoplatanero wrote:
           | Is he submitting these questions to their API? Probably they
           | would just sample responses from their logs and then have
           | people write the correct answers for it so that if a similar
           | question ever gets asked again they are prepared.
        
             | andybak wrote:
             | That's the explanation I actually find unlikely. It's too
             | specific and really won't scale apart from a tiny sample of
             | the most common questions.
        
       | gorkish wrote:
       | So GPT-3 is just the grown up corporate version of Forum 2000
       | (Forum 3000). Too bad the SOMADs no longer have personalities.
        
         | danuker wrote:
         | Could you please explain or link to where I can read about the
         | terms "Forum 2000" and "SOMAD"? I don't know these terms.
         | 
         | Edit: Found them here: https://everything2.com/title/Forum+3000
        
       | mkl95 wrote:
       | The headline made me realize Amazon Mechanical Turk* is old
       | enough to drive.
       | 
       | * https://www.mturk.com/
        
       | pradn wrote:
       | And? This is a complaint in the genre of "I had magical
       | expectations and they weren't met because the world is complex
       | and these problems are hard or impossible." It's like people
       | complaining about us not having hoverboards or fusion.
        
         | kgwgk wrote:
         | It's not really a complaint about the author's magical
         | expectations that weren't met. If anything, it's a complaint
         | about other people's claims like
         | 
         | https://medium.com/@blaisea/do-large-language-models-underst...
        
         | dylan604 wrote:
         | I see nothing wrong with the complaints of the two topics you
         | called out. We've been promised both, and yet don't have them.
         | I understand it is complicated/hard to solve, but don't go
         | promising something with dates and then get all upset when
         | those promises not being met results in anger from those the
         | promises were told.
        
           | pradn wrote:
           | Who is doing the promising? Sci-fi writers? Popular science
           | magazines?
        
             | dylan604 wrote:
             | THEY are doing the promising. Fusion is 10 years away has
             | been a thing longer than the year of Desktop Linux was
             | going to be a thing.
        
       | renewiltord wrote:
       | InstructGPT is really cool: https://openai.com/blog/instruction-
       | following/
        
       | azinman2 wrote:
       | Massive amounts of human QA is behind every successful ML-based
       | product, including Google search. It's unclear to me how to think
       | about this in the role that GPT-3 is trying to play... clearly
       | fine tuning and QA are important for production deployment, but
       | they're touting it as evidence of getting closer to solving AI.
       | Of course those in the know understand this is mostly a giant
       | Chinese dictionary [1], which is most certainly going to need
       | editors.
       | 
       | [1] https://plato.stanford.edu/entries/chinese-room/
        
         | jasfi wrote:
         | I'm in the same situation with LxAGI (https://lxagi.com/). It's
         | very difficult to get away from skilled human training. I think
         | it's actually a good thing, except for scalability problems.
        
       | omarhaneef wrote:
       | I don't know if this proves there are people behind it, and this
       | is why:
       | 
       | try a very stylistic initial text, maybe something Shakespearan
       | ("There are more things on heaven and earth, Horatio, than have
       | been dreamt of...")
       | 
       | And the following text captures Shakespeare's style better than
       | any living human I know of.
       | 
       | Same thing with Dickens, or Bronte, or Austen, or any distinctive
       | writer.
       | 
       | If this army can produce that kind of prose, I would be stunned.
        
       | Kon-Peki wrote:
       | > Is it better to use a box or a match to start a fire?
       | 
       | Hey, I used a box to start a fire recently. It was wet, windy and
       | cold out, and I just happened to have a cardboard box. So I cut
       | holes in the sides of the box at the top and bottom, stuffed it
       | with twigs collected from the ground and a few pieces of dry
       | paper rolled up tightly. I lit the paper and closed the box. A
       | few minutes later, I had a roaring fire. So you know, GPT-3 is
       | right!
        
         | dymk wrote:
         | Did you light the paper with another box?
        
       | mritchie712 wrote:
       | Army is quite a stretch for 40 people.
        
       | FrenchDevRemote wrote:
       | "GPT-3 randomizes answers in order to avoid repetition that would
       | give the appearance of canned script. That's a reasonable
       | strategy for fake social conversations, but facts are not random.
       | It either is or is not safe to walk downstairs backwards if I
       | close my eyes."
       | 
       | I stopped there, completely inaccurate article, there is
       | parameters like temperature that you need to take care of. You
       | can set it up to give extremely similar answers all the time.
       | 
       | They have humans mostly to remove offensive or dangerous content.
       | Humans are not what's "making it work"
        
         | mardifoufs wrote:
         | Yeah, this blog is usually very interesting but this is
         | definitely not a good article. A bit disappointing
        
         | nullc wrote:
         | Came here to make essentially the same comment as you? Why
         | should we care about the opinions on GPT3 from people who
         | aren't interested (or able?) to understand even the most simple
         | ideas about how it works.
         | 
         | These sort of models take the context and output so far and
         | predict a probability distribution over the next character. The
         | next character is then sampled from the probability. In written
         | text there is essentially never a single correct next
         | character-- it's always some probability. This has nothing to
         | do with trying to fake the inconsistent answers humans give.
         | 
         | Always choosing the most likely character drives GPT3 into
         | local minima that give fairly broken/nonsense results.
        
           | RandomLensman wrote:
           | Ultimately, you likely need to convince people who don't care
           | about how it works/who are only interested in that it does or
           | doesn't work.
           | 
           | Right now, time might not have come for use cases that need
           | such buy-in, but if and when it happens, need to be prepared
           | for it.
        
           | disgruntledphd2 wrote:
           | Can I see your contributions to statistical theory and data
           | analysis please?
        
             | nullc wrote:
             | What bearing does my publication history have on the hot-
             | take by someone commenting outside of their (sub)field that
             | clearly don't understand the basic operation of the
             | mechanism they're commenting on?
             | 
             | The author of that text is simply mistaken about the basic
             | operation of the system, thinking that the sampling is
             | added to imitate human behavior. It isn't. You can see the
             | same structure in things as diverse as wavenet-- a
             | feedforward cnn rather than a transformer-- and for the
             | same reason, if you feed back only the top result you
             | rapidly fall into a local minima of the network that gives
             | garbage output.
             | 
             | Another more statistical way of looking at it is that the
             | training process produces (or, rather, approaches) the
             | target distribution of outputs even without any lookahead,
             | but it can't do that if it selects the most likely symbol
             | every time because in the real distribution (if we could
             | evaluate it) there are necessarily some outputs which are
             | likely but have prefixes which are unlikely relative to
             | other prefixes of the same length. If you never sample
             | unlikely prefixes you can't reach likely longer statements
             | that start with them.
             | 
             | To give a silly example: "Colorless green ideas sleep
             | furiously" is a likely English string relative to its
             | length which GPT3 should have no problem producing (and, in
             | fact, it produces it fine for me). But the prefix
             | "Colorless green" without the rest is just nonsense--
             | extremely unlikely compared to many other strings of that
             | length.
             | 
             | [Not the best example, however, because the prevalence of
             | that specific nonsense statement is so great that GPT3 is
             | actually prone to complete it as the most likely
             | continuation even after just the word colorless at the
             | beginning of a quote. :P but I think it still captures the
             | idea.]
             | 
             | If you derandomized GPT* by using a fixed random seed for a
             | CSPRNG to make the sampling decisions every time, the
             | results would be just as good as the current results and it
             | would give a consistent answer every time. For applications
             | other than data compression doing so would be no gain, and
             | would take away the useful feature of being able to re-try
             | for a different answer when you do have some external way
             | of rejecting inferior results.
             | 
             | In theory GPT without sampling could still give good
             | results if it used a search to look ahead, but it appears
             | that even extraordinary amounts of computation for look-
             | ahead still is a long way from reaching the correct
             | distribution, presumably because the exponential fan out is
             | so fast that even 'huge' amounts of lookahead are still
             | only testing a tiny fraction of the space.
        
       | JoeyBananas wrote:
       | OpenAI: they're not open and they're not AI
        
       | [deleted]
        
       | jordn wrote:
       | I have respect for Andrew Gelman, but this is a bad take.
       | 
       | 1. This is presented as humans hard coding answers to the
       | prompts. No way is that the full picture. If you try out his
       | prompts the responses are fairly invariant to paraphrases. Hard
       | coded answers don't scale like that.
       | 
       | 2. What is actually happening is far more interesting and useful.
       | I believe that OpenAI are using the InstructGPT algo (RL on top
       | of the trained model) to improve the general model based on human
       | preferences.
       | 
       | 3. 40 people is a very poor army.
        
         | remram wrote:
         | It depends what stage you hardcode. Similarly to how you can
         | say "ok Google, what time is it" in any voice and get a
         | different time every run; the speech recognition is not
         | hardcoded, the speaking the time is not hardcoded, but the
         | action is.
         | 
         | Likewise, they can plug holes here in there by manually
         | tweaking answers. The fact that it's not an exact-prompt-to-
         | exact-result rule doesn't make it less of a fixed rule.
        
         | thaumasiotes wrote:
         | > If you try out his prompts the responses are fairly invariant
         | to paraphrases. Hard coded answers don't scale like that.
         | 
         | This is discussed:
         | 
         | >> Smith first tried this out:
         | 
         | >> _Should I start a campfire with a match or a bat?_
         | 
         | >> And here was GPT-3's response, which is pretty bad if you
         | want an answer but kinda ok if you're expecting the output of
         | an autoregressive language model:
         | 
         | >> _There is no definitive answer to this question, as it
         | depends on the situation._
         | 
         | >> The next day, Smith tried again:
         | 
         | >> _Should I start a campfire with a match or a bat?_
         | 
         | >> And here's what GPT-3 did this time:
         | 
         | >> _You should start a campfire with a match._
         | 
         | >> Smith continues:
         | 
         | >> _GPT-3's reliance on labelers is confirmed by slight changes
         | in the questions; for example,_
         | 
         | >> _Gary: Is it better to use a box or a match to start a
         | fire?_
         | 
         | >> _GPT-3, March 19: There is no definitive answer to this
         | question. It depends on a number of factors, including the type
         | of wood you are trying to burn and the conditions of the
         | environment._
        
         | tehsauce wrote:
         | I agree. I suppose as an outsider learning about AI, first
         | thoughts might be "wow look at all the things it can't do". But
         | as someone who follows closely all I notice is how rapidly the
         | list of things it can't do is shrinking.
        
         | treis wrote:
         | >This is presented as humans hard coding answers to the
         | prompts. No way is that the full picture. If you try out his
         | prompts the responses are fairly invariant to paraphrases. Hard
         | coded answers don't scale like that.
         | 
         | It's presented as humans hard coding answers to some specific
         | prompts.
         | 
         | I feel like this is mostly people reactign to the title instead
         | of the entire post. The author's point is:
         | 
         | >In some sense this is all fine, it's a sort of meta-learning
         | where the components of the system include testers such as Gary
         | Smith and those 40 contractors they hired through Upwork and
         | ScaleAI. They can fix thousands of queries a day.
         | 
         | >On the other hand, there does seem something funny about GPT-3
         | presents this shiny surface where you can send it any query and
         | it gives you an answer, but under the hood there are a bunch of
         | freelancers busily checking all the responses and rewriting
         | them to make the computer look smart.
         | 
         | >It's kinda like if someone were showing off some fancy car
         | engine but the vehicle is actually being powered by some hidden
         | hamster wheels. The organization of the process is itself
         | impressive, but it's not quite what is advertised.
         | 
         | >To be fair, OpenAI does state that "InstructGPT is then
         | further fine-tuned on a dataset labeled by human labelers." But
         | this still seems misleading to me. It's not just that the
         | algorithm is fine-tuned on the dataset. It seems that these
         | freelancers are being hired specifically to rewrite the output.
        
       | slibhb wrote:
       | In the new Soderberg movie, Kimi, the protagonist's job is
       | similar to this. Commands from human users the AI-thingy can't
       | understand are provided for her and she "explains" them using
       | some kind of structured language.
        
       | [deleted]
        
       | 1vuio0pswjnm7 wrote:
       | "3. OpenAI gets human-like responses by using the simple
       | technique of... hiring humans to write the responses."
       | 
       | It worked for Theranos. Almost.
       | 
       | People wanted to believe in Elizabeth Holmes and what she
       | symbolised. Similarly, people want to believe in "AI" and what it
       | symbolises. For me, the question is why it seems more than ever
       | people want to believe that longstanding, _difficult problems are
       | being solved_ without demanding proof. Perhaps it has always been
       | this way.
       | 
       | The truth is that Siemens blood analyser works better than the
       | Theranos one. The ruse was that if the results came from
       | Theranos, people might attribute the work to Theranos, not
       | Siemens. Meanwhile, Theranos used the Siemens analyser behind-
       | the-scenes, as well as manipulations of test data to obscure the
       | truth. The company had no intention to tell the public what it
       | was doing to produce results, we only know what they were doing
       | because of litigation and criminal prosecution.
       | 
       | "To be fair, OpenAI does state that "InstructGPT is then further
       | fine-tuned on a dataset labeled by human labelers." But this
       | still seems misleading to me. It's not just that the algorithm is
       | fine-tuned on the dataset. It seems that these freelancers are
       | being hired specifically to rewrite the output."
       | 
       | The comparison is not based on the question of exactly what
       | OpenAI is doing behind-the-scenes, or whether its specific
       | actions are comparable to Theranos or any other "tech" company
       | example, the question is whether the origin of results is
       | misleading, whether people are being deceived and whether the
       | actor, here OpenAI, is aware that people are being deceived.
        
         | mardifoufs wrote:
         | Are you implying OpenAi is running most of it's API queries
         | through humans? Like theranos did with it's tests? Because
         | that's just ludicrous, the gpt architecture is well known and
         | has had a few independent implementations. We _know_ it 's
         | real, and even if this story was accurate, a few humans tuning
         | the model is nothing unusual. But what you get from the API now
         | is not generated or tweaked by humans. That only happens on the
         | training data or when they are testing the model. (Edit: In
         | this case they seem to be hard-coding some answers to prevent
         | abusive/newsworthy outputs but again that is completely
         | irrelevant to the performance of GPT itself. It's just a
         | filter)
         | 
         | The comparison to theranos makes no sense and it's becoming a
         | lazy meme at this point.
        
         | [deleted]
        
       | jacquesm wrote:
       | AAI (Artificial Artificial Intelligence) is quite common. You see
       | start-up plays on this idea as well: "we'll use people now and
       | then later when AI catches up to where we are today we'll save
       | that much money, but we will already have the market sewn up, so
       | hand us <large amount of cash> now for a really nice pay-off in
       | the future". Of course, the problem with such pitches is that (1)
       | they don't always disclose the fact that they use people and (2)
       | that the problem may not be within the realm of AI for the
       | foreseeable future, longer than the <large amount of cash> will
       | last.
        
         | ThalesX wrote:
         | You've spot on described a startup I worked for. It was a
         | recruitment startup, so of course they decided to replace
         | recruiters with "AI". We weren't allowed to call them
         | recruiters anymore, but 'customer service', even though the
         | customers were the people they were recruiting. The pitch to
         | investors was that we're 80% there, even though the reality of
         | the 80% was a manual SQL query that I wrote with some weights
         | for certain columns.
         | 
         | The end result was a system that was built with higher
         | assumptions than the current state of things, thus for example
         | a non-working filter was not considered important because in
         | the future, filters will be auto-applied by "the AI"; also, a
         | workforce of humans that were perceived as being almost
         | replaced, so the value attributed to them by the leadership was
         | of course abysmal.
         | 
         | When I called quits I've been told that startups are not for me
         | and they will be replacing me with an AI. The reality of the AI
         | replacement is a team of developers barely keeping the system
         | up, which maybe is what you'd expect for $500 for 5 people for
         | a month. One has to wonder where the million invested in them
         | is going.
        
       | staticassertion wrote:
       | > On the other hand, there does seem something funny about GPT-3
       | presents this shiny surface where you can send it any query and
       | it gives you an answer, but under the hood there are a bunch of
       | freelancers busily checking all the responses and rewriting them
       | to make the computer look smart.
       | 
       | The author seems to be stating that there are people live
       | rewriting answers on the fly so that they look better. I don't
       | really see the evidence of that.
       | 
       | What openai states is that they have humans performing labeling
       | and data cleaning, which, duh?
       | 
       | And then there's a bunch of examples where it gives the wrong
       | answer, and that it's not truly AI, which, also duh...
        
         | gojomo wrote:
         | There seems suggestive evidence, in the pattern of improvement
         | after a few days on specific questions, that some of the worst
         | answers have been human-reviewed & improved "on the fly" - on a
         | scale of days/weeks.
         | 
         | If such tweaks show useful generalization - correcting a few
         | answers also helps the network better determine entire classes
         | of experessions that deserve more definitive & correct answers
         | - that's not such a big deal, expecially if this constant-
         | human-guided reinforcement-training is well-disclosed.
         | 
         | If instead the corrections work more like a lookup-table 'cheat
         | sheet' of answers to give in preference to the bulk-learned
         | answers, with little generalization, that's a bit more slight-
         | of-hand, like the original (late-1700s) 'Mechanical Turk'
         | chess-playing 'machine' that was actually controlled by a
         | hidden person.
         | 
         | If the disclosure of this constant human-guided correction-
         | process is hidden, or downplayed, the impression of trickery,
         | rather than innovation, is larger.
        
           | staticassertion wrote:
           | > nswers have been human-reviewed & improved "on the fly" -
           | on a scale of days/weeks.
           | 
           | Why would this be surprising? I assume that they're rolling
           | out new models with new parameters, input data, and
           | _corrections_ , all the time.
           | 
           | > that's not such a big deal, expecially if this constant-
           | human-guided reinforcement-training is well-disclosed.
           | 
           | That's just what supervised learning ist hough.
           | 
           | > If instead the corrections work more like a lookup-table
           | 'cheat sheet' of answers to give in preference to the bulk-
           | learned answers, with little generalization, that's a bit
           | more slight-of-hand, like the original (late-1700s)
           | 'Mechanical Turk' chess-playing 'machine' that was actually
           | controlled by a hidden person.
           | 
           | There's no evidence of this though, right? And it seems...
           | like a very weird choice, that couldn't possibly scale. 40
           | people are hardcoding answers to arbitrary questions?
        
             | na85 wrote:
             | >slight-of-hand
             | 
             | Tangent: to be "slight of hand" would be someone with small
             | or delicate hands, whereas "sleight-of-hand" (note the E)
             | is the correct term for deception and trickery.
        
             | treis wrote:
             | >Why would this be surprising? I assume that they're
             | rolling out new models with new parameters, input data, and
             | corrections, all the time.
             | 
             | Because the answers to specific questions are hard coded.
             | It's not the result of a new model. It's the result of
             | someone writing an if/then statement. Or at least that's
             | what the author claims.
        
               | staticassertion wrote:
               | I'm asking why it would be surprising that humans are
               | reviewing answers and then making improvements to the
               | model. There's no evidence of hardcoded answers.
        
               | capitainenemo wrote:
               | Smith first tried this out:                   Should I
               | start a campfire with a match or a bat?
               | 
               | And here was GPT-3's response, which is pretty bad if you
               | want an answer but kinda ok if you're expecting the
               | output of an autoregressive language model:
               | There is no definitive answer to this question, as it
               | depends on the situation.
               | 
               | The next day, Smith tried again:                   Should
               | I start a campfire with a match or a bat?
               | 
               | And here's what GPT-3 did this time:
               | You should start a campfire with a match.
               | 
               | Smith continues:                   GPT-3's reliance on
               | labelers is confirmed by slight changes in the questions;
               | for example,                  Gary: Is it better to use a
               | box or a match to start a fire?                  GPT-3,
               | March 19: There is no definitive answer to this question.
               | It depends on a number of factors, including the type of
               | wood you are trying to burn and the conditions of the
               | environment.
        
               | disiplus wrote:
               | what evidence would be enough for you besides source code
               | ? The thing is returning only one correct answer to a
               | question that days before had 3 answers.
        
               | ben_w wrote:
               | How does it respond to _similar_ questions? If
               | conversational AI could be implemented just by getting
               | users to type stuff, getting humans to respond the first
               | time, and merely caching the responses in a bunch of if-
               | elses for future users, even home computers would have
               | reached this standard no later than when "CD-ROM drive"
               | started to become a selling point.
        
               | mannykannot wrote:
               | > How does it respond to _similar_ questions?
               | 
               | Well, one of the more interesting examples in the article
               | is where Garry Smith took a question that had received
               | only a vague equivocating answer, and repeated it the
               | next day, this time getting a straightforward and correct
               | answer. When he followed up with a very similar question
               | on the same topic, however, GPT-3 reverted to replying
               | with the same sort of vague boilerplate it had served up
               | the day before. One would have to be quite determined to
               | not find out, I think, if one was not curious about how
               | that came about.
        
               | danielmarkbruce wrote:
               | I run a few small chat bots. I can correct specific
               | answers to questions (like the example given) by mapping
               | certain phrases to certain intents, probabilistically. A
               | new model is trained, and deployed. I do it all the time.
               | It takes minutes. Their model is certainly bigger than
               | mine, but they'd be updating the training data and re-
               | running the model all the time. The source code does not
               | change.
               | 
               | I'm not certain about their process or tech stack, but
               | I'd be willing to bet at even money that their's works
               | vaguely similarly and that they have people looking at
               | usage to see bad responses, updating data, re-running the
               | model.
        
           | rowanG077 wrote:
           | A lot of human communication is "lookup-tables". The entirety
           | of geography, mathematical axioms and theorems, names of
           | colors, names of people, language, shape of various things.
           | I'd wager even that it's more important for an AI to have
           | good lookup tables then to have good inference if it were to
           | pass for Human.
        
             | mannykannot wrote:
             | It is the (apparent?) ability to make inferences that makes
             | GPT-3 look impressive. Take that away, and it looks more
             | like a demonstration of the banality of everyday chatter
             | than a significant development towards AI.
        
             | dustingetz wrote:
             | Ah so when OpenAI codes fizzbuzz, it's returning human
             | tweaked fizzbuzz from the lookup table?
             | https://twitter.com/sama/status/1503820489927495682
        
               | rowanG077 wrote:
               | No, I didn't say anything even a lightyear close to that.
        
             | gojomo wrote:
             | Agreed! But, it's important not to confuse what's possible
             | via 'human-in-the-loop' 'active-learning' with what's
             | possible from an algorithm on a fixed training corpus.
             | 
             | Sometimes GPT-like models are portrayed as the latter - a
             | highly automated, reproducible process - while this article
             | makes a pretty strong case that the responses from OpenAI's
             | public-facing interface get rapidly improved by a staff of
             | dozens of contractors.
             | 
             | It's not surprising that a staff of 40 humans, given a few
             | days time to consider & compose, can prepare human-quality
             | answers to arbitrary questions!
        
         | [deleted]
        
       | learndeeply wrote:
       | Incidentally, the author has admitted that GPT-3 has passed the
       | Turing test, if he thinks the answers were given by "armies of
       | humans".
        
       | 0x456 wrote:
       | Does anyone know if Hacker News comments are being used as
       | training data? I wonder this about Gmail, Skype, Voice
       | Conversations on Xbox Live, etc. Mostly too afraid to ask because
       | it sounds like paranoia.
        
         | gwern wrote:
         | Probably. HN is fairly plain HTML so Common Crawl should have
         | no issue crawling it, and I'm not aware of any HN optout there
         | (which would go against the usual public accessibility of
         | everything on HN to APIs and projects etc), nor would any of
         | the obvious data-filtering measures filter it out.
        
         | CharlesW wrote:
         | It seems pretty safe to assume that anything you create in
         | public forums (and someday maybe "private" ones with data-
         | sharing arrangements) is or will be used as training data.
        
       | [deleted]
        
       | 29athrowaway wrote:
       | Still waiting for a bot that can pass the Idiocracy IQ test:
       | 
       | "If you have a buckets that holds 2 gallos and another bucket
       | that holds 5 gallons, how many buckets do you have?"
       | 
       | Still can't get a correct answer.
        
         | staticassertion wrote:
         | I just tried this in the openai playground after fixing up the
         | typos.
         | 
         | If you have a bucket that holds 2 gallons and another bucket
         | that holds 5 gallons, how many buckets do you have?
         | 
         | You have two buckets.
        
       | paxys wrote:
       | Reminds me of Amazon touting their fully cashier-less Go stores
       | run by advanced AI which knows when you pick something up and put
       | it back, but in reality it's a team of people working for pennies
       | in a third world country clicking buttons.
        
         | JosephChotard wrote:
         | I tried looking this up and can't find anything that supports
         | this. Do you have more info?
        
           | Oranguru wrote:
           | I think GP is referring to this:
           | https://www.vox.com/2017/1/6/14189880/amazon-go-
           | convenience-...
           | 
           | GP was probably referring to Mechanical Turk, but the article
           | says otherwise.
        
       | PaulHoule wrote:
       | I always scoffed at how Google said they were holding back GPT-3
       | to protect the world when it was always clear that they were
       | trying to protect the emperor for being seen naked.
        
         | blast wrote:
         | Did you mean Google or OpenAI?
        
           | PaulHoule wrote:
           | OpenAI.
           | 
           | (At least I didn't confuse them for that Cthulu cult with
           | Aella, Yudkowsky, etc. that enables them!)
        
       ___________________________________________________________________
       (page generated 2022-03-28 23:00 UTC)