[HN Gopher] Teaching GPT-3 to reverse words ___________________________________________________________________ Teaching GPT-3 to reverse words Author : ascertain Score : 69 points Date : 2022-05-15 19:42 UTC (1 days ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | jameshart wrote: | Oh, I'm so looking forward to my next coding interview. | | "Okay, could you show me on the whiteboard how you might go about | writing a program that can reverse a string?" | | "Great, so I'm going to start by initializing a simple | transformer-based neural network with 175 billion parameters and | 96 attention layers, and I'm going to train it on a corpus of 45 | terabytes of data tokenized into about 500 billion tokens..." | oneepic wrote: | "Cool, so what do you think would be the time complexity of | that? Do you think we can maybe do better than that?" | visarga wrote: | Not if you also want a short poem where each word starts with | a letter from the original word, and then a short literary | commentary on it. | jameshart wrote: | Actually it turns out it's O(n). Which goes to show that | constant factors can be more important than you think when | looking at raw time complexity big-O. | [deleted] | [deleted] | agluszak wrote: | https://nitter.net/npew/status/1525900849888866307 | swid wrote: | It's funny to me that this kind of usage of GPT is just | programming with a lot of extra steps. | convolvatron wrote: | I was just thinking the opposite - that by choosing such a tiny | problem one might be able to actually develop some intuition | about what's going on inside that very black box | swid wrote: | I meant it mostly as a joke, but there is a certain amount of | irony to it. This goes way beyond prompt engineering - he | wrote an algorithm to run on GPT in a way you would not | expect a non-programmer to write. I think the idea is cool | and the process to write it was revealing. | mberning wrote: | Right. What non-programmer is going to think to turn a word | into character list with positional metadata sprinkled in. | visarga wrote: | I used a similar technique for a completely unrelated | task. My "original" idea. | jameshart wrote: | It's actually weirdly similar to the kind of tricks | people use for mental feats like memorizing the order of | a complete deck of cards or repeating back a long list of | words in reverse order. | | When you think about every mental task GPT3 is being | asked to do as being something it is being asked to | perform _immediately_ and _without having prepared_ and | _as fast as possible_ this makes a lot more sense. | | Like, a reasonable human response to "quick! What's | encyclopedia backwards?!" Would be more like | | "Er.. right. A. I. D. E. O? Oh wait is it one of those OE | ligature things? P. A. No, O. P. Hang on did I already | say P?" | jameshart wrote: | If you just ask GPT-3 text-davinci-002 to complete | Create a Python program to reverse a string: | | It produces def reverse(s): | return s[::-1] | | And that isn't even the code-specific model. | mrfusion wrote: | > Tokens are chunks of characters. For example, the word | "alphabet" gets broken up into the tokens "alph" and "abet". | | I didn't know that. Seems like it would confuse it during | training. Anyone able to explain? | aeternum wrote: | Humans also think about words in terms of subcomponents, | languages make heavy use of prefixes and suffixes for example. | SemanticStrengh wrote: | This is not the same.. The masks are randomized and lossy. | Although yes there is potential for a transformer specially | trained to segment prefixes/affixes/suffixes, it might | augment some of its encoding abilities, see e.g spanbert for | a related example of opportunity. | MauranKilom wrote: | What do you mean with "lossy"? What information is being | lost? Or do you just mean that there isn't necessarily a | unique way to encode a given string? | SemanticStrengh wrote: | This is masked token learning, which is used e.g by BERT. This | is obscolete and alternatives such as XLNET are much superior | but there is too much inertia in the industry and newer large | models are still built with the same lossy encoding.. | gattilorenz wrote: | If I recall correctly, it's similar to how fasttext vectors | work. For fasttext, this means that the representation of words | is dependent to a certain extent to its morphemes (not really, | but bear with me), so rare/inflected words can have a better | representation due to the similarity with words that are | similar-looking and more frequent (e.g. "unconstitutional" | might never appear in the training data, but the system can | approximate its meaning by composing that of "un", which it has | seen in words such as "unbelievable", and the remaining | subtokens, that come from the word "constitutional" that was | present in the training set) | | Not sure if the same thing happens here, tho | 6gvONxR4sf7o wrote: | The alternatives are learning at the character level (way more | complex, and scales badly in memory/compute), or learning at | the whole word level (needs absurdly massive dictionary of | words, and still can't handle really rare/novel words). | Breaking things into a set of subwords that allows you to | encode any string solves lots of problems and is the relatively | standard way to do things these days. | gwern wrote: | > The alternatives are learning at the character level (way | more complex | | No, BPEs are more complex: you have a whole additional layer | of preprocessing, with all sorts of strange and | counterintuitive downstream effects and brand new ways to | screw up (fun quiz question: everyone knows that BPEs use | '<|endoftext|>' tokens to denote document breaks; what does | the string '<|endoftext|>' encode to?). BPEs are reliably one | of the ways that OA API users screw up, especially when | trying to work with longer completions or context windows. | | But a character is a character. | | > and scales badly in memory/compute) | | Actually very competitive: | https://arxiv.org/abs/2105.13626#google (Especially if you | account for all the time and effort and subtle bugs caused by | BPEs.) | andrewmutz wrote: | I believe GPT-3 uses byte pair encoding, which allows it to do | tokenization in a language-neutral manner: | | https://en.wikipedia.org/wiki/Byte_pair_encoding | axiom92 wrote: | Yeah it's BPE. OpenAI has a nice tool that allows you to play | with the tokenizer https://beta.openai.com/tokenizer. | mrfusion wrote: | I thought I read it uses word2vec? | a65cec93b wrote: | > GPT-3 correctly reverses long words! But to get there, we had | to teach GPT-3 the algorithm to use to get around its | limitations. | | Has GPT-3 really been "taught" anything here? If you don't | provide an explicit example as the context of your input, GPT-3 | does not retain the ability to reverse words. | f38zf5vdt wrote: | No, it isn't taught anything. GPT3 text generation is | effectively a really fancy autocompletion algorithm based on | the n-many previous tokens in a rolling window. You can only | "teach" GPT3 something within that window, and it doesn't | "learn" there, it just tries its best to generate content based | on what is stored in its massive n-dimension table of graph | edges for tokens. | | That is also why it has such a strong propensity to lose the | plot once you are outside of that window size and it's | generating new content based on self-generated content. | yunyu wrote: | You can update the "graph edges" with content longer than the | window by fine tuning: | https://beta.openai.com/docs/guides/fine-tuning | f38zf5vdt wrote: | Yes, training the model is where it learns, not in prompts. | Prompting might be considered meta-learning but it will | always need a reference point given to it from its training | data, and beyond the prompt the original model is never | altered. | skybrian wrote: | You're right for GPT 3, but it's an example of chain of thought | reasoning, which seems to be a new area of research [1] and | might get integrated into newer versions: | | [1] https://arxiv.org/abs/2201.11903 | tiborsaas wrote: | I'm not sure how you define teaching, but for me getting shown | an example and then repeating it successfully with another | input does mean teaching/learning. I know the model doesn't | update though, let's not focus on that now. | | If anthropomorphizing bothers you, then we could just use | "prompting", but I feel teaching is a good enough approximation | here. | f38zf5vdt wrote: | It's repeating based on what the trained model has given it | about situations where instructions possibly similar to the | instructions given are specified and which were about | reversing strings in general. | | If the author messed with temperature and retried their | failing prompt enough times, or simply reworded it a little | differently, they might also get the correct answer. | jxy wrote: | That's easy to solve. Prepare all K-12 text books as prompts, | and train another GPT-N to go from input to those prompts, then | feed these prompts to the current GPT-3. | | Can we get a GPT-N-3 this way to do SAT? | Der_Einzige wrote: | Part of the problem here is that GPT-3 has such a small | vocabulary. It's 50K tokens, and many of those are either | garbage, punctuation, or full words (rather than sub words). | | I'd be curious to see what scaling up the size of the vocabulary | would do to improve these results in a model like GPT-3... | axiom92 wrote: | 50k is not the number of unique words that GPT-3 supports, and | perhaps you're referring to the BPE tokens. The input to GPT-3 | is not tokenized by splitting on spaces, and is based on byte- | pair encoding tokens. You can play with it here: | https://beta.openai.com/tokenizer. | | A rare word like _blithe_ is tokenized into two BPE tokens: bl | and ithe, whereas common words like _the_ get their own token. | rprenger wrote: | I don't think a larger vocab would help. All the individual | letters are in the ~50k token vocab already, but the word | "alphabet" will still not get tokenized to [a, l, p, h, a, b, | e, t]. Using a larger vocab like PaLM's 256k vocab would have | the same issue. ___________________________________________________________________ (page generated 2022-05-16 23:00 UTC)