[HN Gopher] Understanding GPT Tokenizers
       ___________________________________________________________________
        
       Understanding GPT Tokenizers
        
       Author : simonw
       Score  : 97 points
       Date   : 2023-06-08 20:40 UTC (2 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | mike_hearn wrote:
       | A few extra notes on tokens.
       | 
       | You don't have to use tiktoken if you aren't actually tokenizing
       | things. The token lists are just text files that consist of the
       | characters base64 encoded followed by the numeric ID. If you want
       | to explore the list you can just download them and decode them
       | yourself.
       | 
       | I find that sorting tokens by length makes it a bit easier to get
       | a feel for what's in there.
       | 
       | GPT-4 has a token vocabulary about twice the size of GPT-3.5.
       | 
       | The most interesting thing to me about the GPT-4 token list is
       | how dominated it is by non-natural languages. It's not as simple
       | as English tokenizing more efficiently than Spanish because of
       | frequency. The most common language after English is code. A huge
       | number of tokens are allocated to even not very common things
       | found in code, like "ValidateAntiForgeryToken" or
       | "_InternalArray". From eyeballing the list I'd guess about half
       | the tokens seem to be from source code.
       | 
       | My guess is that it's not a coincidence that GPT-4 both trained
       | on a lot of code and is also the leading model. I suspect we're
       | going to discover at some point, or maybe OpenAI already did,
       | that training on code isn't just a neat trick to get an LLM that
       | can knock out scripts. Maybe it's fundamentally useful to train
       | the model to reason logically and think clearly. The highly
       | structured and unambiguous yet also complex thought that code
       | represents is probably a great way for the model to really level
       | up its thought processes. Ilya Sutskever mentioned in an
       | interview that one of the bottlenecks they face on training
       | something smarter than GPT-4 is getting access to "more complex
       | thought". If this is true then it's possible the Microsoft
       | collaboration will prove an enduring competitive advantage for
       | OpenAI, as it gives them access to the bulk GitHub corpus which
       | is probably quite hard to scrape otherwise.
        
       | gwern wrote:
       | Worth mentioning the many other consequences of BPE tokenization:
       | gwern.net/gpt-3#bpes
       | https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-...
        
         | pmoriarty wrote:
         | In the article on your blog, you wrote:
         | 
         |  _" GPT-3 rhymes reasonably well and often when appropriate,
         | but the improvement is much smaller on rhyming than it is on
         | pretty much everything else. Apparently it is easier for GPT-3
         | to learn things like arithmetic and spreadsheets than it is to
         | learn how to rhyme."_
         | 
         | I've experimented extensively with Claude, and a bit with
         | Claude+, ChatGPT (GPT 3.5) and GPT4 on poe.com, and I've had
         | not the slightest problem in getting them to rhyme. However,
         | once they've started writing rhyming poetry it's hard to get
         | them to stop rhyming. They seem to have formed a strong
         | association between rhyming and poetry. I've also been unable
         | to get them to obey a specific rhyming scheme like ABBAB.
        
           | burnished wrote:
           | That seems incredibly challenging, I'd expect some
           | fundamental difficulty due to rhyming being determined by how
           | a word sounds and not what it means.
        
       | bluepoint wrote:
       | So the space character is part of the token?
        
         | minimaxir wrote:
         | This can vary by BPE tokenizer. The original GPT-2/GPT-3 was
         | weirder about it.
        
         | simonw wrote:
         | Yup. Most common words have several tokens - the word, the word
         | with a capital letter, the word with a leading space and
         | sometimes the word all in caps too.
         | 
         | Try searching for different words using the search box here:
         | https://observablehq.com/@simonw/gpt-tokenizer#cell-135
        
           | jiggawatts wrote:
           | I wonder if the embeddings could be explicitly configured to
           | account for these "symmetries". E.g.: instead of storing
           | seperate full copies of the "variants", maybe keep a reduced
           | representation with a common prefix and only a small subset
           | of the embedding vector that is allowed to be learned?
           | 
           | This could force the model to correctly learn how to
           | capitalise, make all-caps, etc...
        
       | ywmario wrote:
       | I have been under the impression that the embedded vector is the
       | one actually matters. Token is just another format.
        
       | [deleted]
        
       | hsjqllzlfkf wrote:
       | Could anyone who's an expert comment why there seems to be such a
       | focus on discussing tokenizers? It seems every other day there's
       | a new article or implementation of a tokenizer on HN. But
       | downstream from that, rarely anything. As a non-expert I would
       | have thought to tokenizing is just one step.
        
         | SkyPuncher wrote:
         | Tokens are the primitives that most LLMs (and broadly a lot of
         | NLP) works with. While, you and I would expect whole-words to
         | be tokens, many tokens are shorter - 3 to 4 characters - and
         | don't always match the sentence structure you and I expect.
         | 
         | This can create some interesting challenges and unexpected
         | behavior. It also makes certain things, like vectorization, a
         | challenge since tokens may not map 1:1 with the words you
         | intend to weight them against.
        
           | hsjqllzlfkf wrote:
           | Your answer explains what tokenizers are, which isn't what I
           | asked. You also told me something interesting about
           | tokenizers, which is also not what I asked. Can you tell me
           | anything NOT about tokenized? This is my point.
        
             | mike_hearn wrote:
             | The reason it's not discussed much is that what goes on
             | downstream of tokenization is extremely opaque. It's lots
             | of layers of the transformer network so the overall
             | structure is documented but what exactly those numbers mean
             | is hard to figure out.
             | 
             | There's an article here where the structure of an image
             | generation network is explored a bit:
             | 
             | https://openai.com/research/sparse-transformer
             | 
             | They have a visualization of what the different layers are
             | paying attention to.
             | 
             | There are also some good explanations of transformers
             | elsewhere online. This one is old but I found it helpful:
             | 
             | http://jalammar.github.io/illustrated-transformer/
        
               | hsjqllzlfkf wrote:
               | This was my suspicion, thank you.
        
           | thaumasiotes wrote:
           | > While, you and I would expect whole-words to be tokens,
           | many tokens are shorter - 3 to 4 characters - and don't
           | always match the sentence structure you and I expect.
           | 
           | There is a phenomenon called Broca's Aphasia which is,
           | essentially, the inability to connect words into sentences.
           | This mostly prevents the patient from communicating via
           | language. But patients with this condition can reveal quite a
           | bit about the structure of the language they can no longer
           | speak.
           | 
           | One example discussed in _The Language Instinct_ is someone
           | who works at (and was injured at) a mill. He is unable to
           | produce utterances that are more than one word long, though
           | he seems to do well at understanding what people say to him.
           | One of his single-word utterances, describing the mill where
           | he works, is  "Four hundred tons a day!".
           | 
           | This is the opposite of what you describe, a single token
           | that is longer than one word in the base language instead of
           | being shorter. But it appears to be the same kind of thing.
           | 
           | By the way, if you study a highly inflectional language such
           | as Latin or Russian, you will lose the assumption that
           | interpretive tokens should be whole words. You'd still expect
           | them to align closely with sentence structure, though.
        
           | whimsicalism wrote:
           | You are using the word vectorization in an idiosyncratic way,
           | you are referring to the process of embedding words?
        
         | simonw wrote:
         | I just think they're interesting.
         | 
         | From a practical point of view they only really matter in that
         | we have to think carefully about how to use our token budget.
        
         | ftxbro wrote:
         | The reason it's trending today is because of the phenomenon of
         | Glitch Tokens. They thought all Glitch Tokens had been removed
         | by GPT-4 but apparently one is still left. If you go down the
         | rabbit hole on Glitch Tokens it gets ... really really weird.
        
         | [deleted]
        
       | ftxbro wrote:
       | I just want to say i love your pet pelican names Pelly, Beaky,
       | SkyDancer, Scoop, and Captain Gulliver.
        
         | simonw wrote:
         | Captain Gulliver is genuinely an excellent name for a pelican!
        
       | api wrote:
       | Has anyone ever tried a GPT trained on, say, 256 tokens
       | representing bytes in a byte stream or even more simply binary
       | digits?
       | 
       | I imagine there are efficiency trade-offs but I just wonder if it
       | works at all.
        
         | sandinmyjoints wrote:
         | Not a GPT, but I think Megabyte does that.
        
       | simonw wrote:
       | Here's the Observable notebook I built to explore how the
       | tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer
        
       | jcims wrote:
       | I really really wish someone would try tokenizing off of a
       | phonetic representation rather than textual one. I think it would
       | be interesting to compare the output
        
         | bpiche wrote:
         | spacy's sense2vec gets pretty close to that
         | 
         | https://spacy.io/universe/project/sense2vec/
         | 
         | granted, it is 8 years old, but it's still interesting
        
         | ftxbro wrote:
         | it doesn't matter, the 'bitter lesson' as coined by Rich Sutton
         | is that stacking more layers with more parameters and compute
         | and dataset size is going to swamp any kind of clever 'feature
         | engineering' like trying to be clever about phonetic tokens.
         | Karpathy for example just wants to go back to byte tokens.
        
           | nsinreal wrote:
           | Yes, but how much extra layers and computing power do you
           | need? Of course, phonetic tokens are awkward idea, but there
           | is a reason why word "human" is encoded as only one token.
        
           | spywaregorilla wrote:
           | I don't think that is intuitive at all. "Clever feature
           | engineering" like trying to create columns from calculations
           | of tabular data, sure. You're not going to move the needle.
           | But the basic representation of unstructured data like text
           | could very believably alter the need for parameters, layers,
           | and calculation speed by orders of magnitude.
        
             | whimsicalism wrote:
             | You would be wrong at the scales we are talking about.
             | 
             | The whole point is that it is unintuitive.
        
             | ftxbro wrote:
             | > "I don't think that is intuitive at all."
             | 
             | That's exactly the point. Every intuition is always on the
             | side of feature engineering.
        
           | sp332 wrote:
           | Most current implementations can't count syllables at all, so
           | it would get you at least that far.
        
       | TechBro8615 wrote:
       | Kudos to simonw for all the LLM content you've been publishing. I
       | like reading your perspective and notes on your own learning
       | experiences.
        
       | throwaway2016a wrote:
       | Pardon the n00b question, but...
       | 
       | How does this relate to vectors? It was my understanding that the
       | tokens were vectors and this seems to show them as an integer.
       | 
       | It's probably a really obvious question to anyone who knows AI
       | but I figured if I have it someone else does too.
        
         | binarymax wrote:
         | Very basic overview: A token is assigned a number, that number
         | gets passed into the encoder model with other token numbers,
         | and the encoder model transforms those number sequences into
         | embeddings (vectors)
        
         | nighthawk454 wrote:
         | The tokens are an integer. The first layer of the model is an
         | 'embedding', which is essentially a giant lookup table. So if a
         | string gets tokenized to Token #3, that means get the vector in
         | row 3 of the embedding table. (Those vectors are learned during
         | model training.)
         | 
         | More completely, you can think of the integers as being
         | implicitly a one-hot vector encoding. So say you have a vocab
         | size of 20,000 and you want Token #3. The one-hot vector would
         | be a 20,000 length vector of zeros with a one in position 3.
         | This vector is then multiplied against the embedding
         | table/matrix. Although in practice this is equivalent to just
         | selecting one row directly, so it's implemented as such and
         | there's no reason to explicitly make the large one-hot vectors.
        
         | z3c0 wrote:
         | The integers represent a position within a vector of "all known
         | tokens". Typically, following a simple bag-of-words approach,
         | each position in the vector would be toggled to 1 or 0 based on
         | the presence of a token in a given document. Since most vectors
         | would be almost completely zeroed, the simpler way to represent
         | these vectors is through a list of positions in the now
         | abstracted vector, aka a sparse vector, ie a list of integers.
         | 
         | In the case of more advanced language models like LLMs, a given
         | token can be paired with many other features of the token (such
         | as dependencies or parts-of-speech) to make an integer
         | represent one of many permutations on the same word based on
         | its usage.
        
       ___________________________________________________________________
       (page generated 2023-06-08 23:00 UTC)