[HN Gopher] Teaching GPT-3 to reverse words
       ___________________________________________________________________
        
       Teaching GPT-3 to reverse words
        
       Author : ascertain
       Score  : 69 points
       Date   : 2022-05-15 19:42 UTC (1 days ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | jameshart wrote:
       | Oh, I'm so looking forward to my next coding interview.
       | 
       | "Okay, could you show me on the whiteboard how you might go about
       | writing a program that can reverse a string?"
       | 
       | "Great, so I'm going to start by initializing a simple
       | transformer-based neural network with 175 billion parameters and
       | 96 attention layers, and I'm going to train it on a corpus of 45
       | terabytes of data tokenized into about 500 billion tokens..."
        
         | oneepic wrote:
         | "Cool, so what do you think would be the time complexity of
         | that? Do you think we can maybe do better than that?"
        
           | visarga wrote:
           | Not if you also want a short poem where each word starts with
           | a letter from the original word, and then a short literary
           | commentary on it.
        
           | jameshart wrote:
           | Actually it turns out it's O(n). Which goes to show that
           | constant factors can be more important than you think when
           | looking at raw time complexity big-O.
        
         | [deleted]
        
         | [deleted]
        
       | agluszak wrote:
       | https://nitter.net/npew/status/1525900849888866307
        
       | swid wrote:
       | It's funny to me that this kind of usage of GPT is just
       | programming with a lot of extra steps.
        
         | convolvatron wrote:
         | I was just thinking the opposite - that by choosing such a tiny
         | problem one might be able to actually develop some intuition
         | about what's going on inside that very black box
        
           | swid wrote:
           | I meant it mostly as a joke, but there is a certain amount of
           | irony to it. This goes way beyond prompt engineering - he
           | wrote an algorithm to run on GPT in a way you would not
           | expect a non-programmer to write. I think the idea is cool
           | and the process to write it was revealing.
        
             | mberning wrote:
             | Right. What non-programmer is going to think to turn a word
             | into character list with positional metadata sprinkled in.
        
               | visarga wrote:
               | I used a similar technique for a completely unrelated
               | task. My "original" idea.
        
               | jameshart wrote:
               | It's actually weirdly similar to the kind of tricks
               | people use for mental feats like memorizing the order of
               | a complete deck of cards or repeating back a long list of
               | words in reverse order.
               | 
               | When you think about every mental task GPT3 is being
               | asked to do as being something it is being asked to
               | perform _immediately_ and _without having prepared_ and
               | _as fast as possible_ this makes a lot more sense.
               | 
               | Like, a reasonable human response to "quick! What's
               | encyclopedia backwards?!" Would be more like
               | 
               | "Er.. right. A. I. D. E. O? Oh wait is it one of those OE
               | ligature things? P. A. No, O. P. Hang on did I already
               | say P?"
        
         | jameshart wrote:
         | If you just ask GPT-3 text-davinci-002 to complete
         | Create a Python program to reverse a string:
         | 
         | It produces                   def reverse(s):
         | return s[::-1]
         | 
         | And that isn't even the code-specific model.
        
       | mrfusion wrote:
       | > Tokens are chunks of characters. For example, the word
       | "alphabet" gets broken up into the tokens "alph" and "abet".
       | 
       | I didn't know that. Seems like it would confuse it during
       | training. Anyone able to explain?
        
         | aeternum wrote:
         | Humans also think about words in terms of subcomponents,
         | languages make heavy use of prefixes and suffixes for example.
        
           | SemanticStrengh wrote:
           | This is not the same.. The masks are randomized and lossy.
           | Although yes there is potential for a transformer specially
           | trained to segment prefixes/affixes/suffixes, it might
           | augment some of its encoding abilities, see e.g spanbert for
           | a related example of opportunity.
        
             | MauranKilom wrote:
             | What do you mean with "lossy"? What information is being
             | lost? Or do you just mean that there isn't necessarily a
             | unique way to encode a given string?
        
         | SemanticStrengh wrote:
         | This is masked token learning, which is used e.g by BERT. This
         | is obscolete and alternatives such as XLNET are much superior
         | but there is too much inertia in the industry and newer large
         | models are still built with the same lossy encoding..
        
         | gattilorenz wrote:
         | If I recall correctly, it's similar to how fasttext vectors
         | work. For fasttext, this means that the representation of words
         | is dependent to a certain extent to its morphemes (not really,
         | but bear with me), so rare/inflected words can have a better
         | representation due to the similarity with words that are
         | similar-looking and more frequent (e.g. "unconstitutional"
         | might never appear in the training data, but the system can
         | approximate its meaning by composing that of "un", which it has
         | seen in words such as "unbelievable", and the remaining
         | subtokens, that come from the word "constitutional" that was
         | present in the training set)
         | 
         | Not sure if the same thing happens here, tho
        
         | 6gvONxR4sf7o wrote:
         | The alternatives are learning at the character level (way more
         | complex, and scales badly in memory/compute), or learning at
         | the whole word level (needs absurdly massive dictionary of
         | words, and still can't handle really rare/novel words).
         | Breaking things into a set of subwords that allows you to
         | encode any string solves lots of problems and is the relatively
         | standard way to do things these days.
        
           | gwern wrote:
           | > The alternatives are learning at the character level (way
           | more complex
           | 
           | No, BPEs are more complex: you have a whole additional layer
           | of preprocessing, with all sorts of strange and
           | counterintuitive downstream effects and brand new ways to
           | screw up (fun quiz question: everyone knows that BPEs use
           | '<|endoftext|>' tokens to denote document breaks; what does
           | the string '<|endoftext|>' encode to?). BPEs are reliably one
           | of the ways that OA API users screw up, especially when
           | trying to work with longer completions or context windows.
           | 
           | But a character is a character.
           | 
           | > and scales badly in memory/compute)
           | 
           | Actually very competitive:
           | https://arxiv.org/abs/2105.13626#google (Especially if you
           | account for all the time and effort and subtle bugs caused by
           | BPEs.)
        
         | andrewmutz wrote:
         | I believe GPT-3 uses byte pair encoding, which allows it to do
         | tokenization in a language-neutral manner:
         | 
         | https://en.wikipedia.org/wiki/Byte_pair_encoding
        
           | axiom92 wrote:
           | Yeah it's BPE. OpenAI has a nice tool that allows you to play
           | with the tokenizer https://beta.openai.com/tokenizer.
        
           | mrfusion wrote:
           | I thought I read it uses word2vec?
        
       | a65cec93b wrote:
       | > GPT-3 correctly reverses long words! But to get there, we had
       | to teach GPT-3 the algorithm to use to get around its
       | limitations.
       | 
       | Has GPT-3 really been "taught" anything here? If you don't
       | provide an explicit example as the context of your input, GPT-3
       | does not retain the ability to reverse words.
        
         | f38zf5vdt wrote:
         | No, it isn't taught anything. GPT3 text generation is
         | effectively a really fancy autocompletion algorithm based on
         | the n-many previous tokens in a rolling window. You can only
         | "teach" GPT3 something within that window, and it doesn't
         | "learn" there, it just tries its best to generate content based
         | on what is stored in its massive n-dimension table of graph
         | edges for tokens.
         | 
         | That is also why it has such a strong propensity to lose the
         | plot once you are outside of that window size and it's
         | generating new content based on self-generated content.
        
           | yunyu wrote:
           | You can update the "graph edges" with content longer than the
           | window by fine tuning:
           | https://beta.openai.com/docs/guides/fine-tuning
        
             | f38zf5vdt wrote:
             | Yes, training the model is where it learns, not in prompts.
             | Prompting might be considered meta-learning but it will
             | always need a reference point given to it from its training
             | data, and beyond the prompt the original model is never
             | altered.
        
         | skybrian wrote:
         | You're right for GPT 3, but it's an example of chain of thought
         | reasoning, which seems to be a new area of research [1] and
         | might get integrated into newer versions:
         | 
         | [1] https://arxiv.org/abs/2201.11903
        
         | tiborsaas wrote:
         | I'm not sure how you define teaching, but for me getting shown
         | an example and then repeating it successfully with another
         | input does mean teaching/learning. I know the model doesn't
         | update though, let's not focus on that now.
         | 
         | If anthropomorphizing bothers you, then we could just use
         | "prompting", but I feel teaching is a good enough approximation
         | here.
        
           | f38zf5vdt wrote:
           | It's repeating based on what the trained model has given it
           | about situations where instructions possibly similar to the
           | instructions given are specified and which were about
           | reversing strings in general.
           | 
           | If the author messed with temperature and retried their
           | failing prompt enough times, or simply reworded it a little
           | differently, they might also get the correct answer.
        
         | jxy wrote:
         | That's easy to solve. Prepare all K-12 text books as prompts,
         | and train another GPT-N to go from input to those prompts, then
         | feed these prompts to the current GPT-3.
         | 
         | Can we get a GPT-N-3 this way to do SAT?
        
       | Der_Einzige wrote:
       | Part of the problem here is that GPT-3 has such a small
       | vocabulary. It's 50K tokens, and many of those are either
       | garbage, punctuation, or full words (rather than sub words).
       | 
       | I'd be curious to see what scaling up the size of the vocabulary
       | would do to improve these results in a model like GPT-3...
        
         | axiom92 wrote:
         | 50k is not the number of unique words that GPT-3 supports, and
         | perhaps you're referring to the BPE tokens. The input to GPT-3
         | is not tokenized by splitting on spaces, and is based on byte-
         | pair encoding tokens. You can play with it here:
         | https://beta.openai.com/tokenizer.
         | 
         | A rare word like _blithe_ is tokenized into two BPE tokens: bl
         | and ithe, whereas common words like _the_ get their own token.
        
         | rprenger wrote:
         | I don't think a larger vocab would help. All the individual
         | letters are in the ~50k token vocab already, but the word
         | "alphabet" will still not get tokenized to [a, l, p, h, a, b,
         | e, t]. Using a larger vocab like PaLM's 256k vocab would have
         | the same issue.
        
       ___________________________________________________________________
       (page generated 2022-05-16 23:00 UTC)