[HN Gopher] Understanding GPT Tokenizers ___________________________________________________________________ Understanding GPT Tokenizers Author : simonw Score : 97 points Date : 2023-06-08 20:40 UTC (2 hours ago) (HTM) web link (simonwillison.net) (TXT) w3m dump (simonwillison.net) | mike_hearn wrote: | A few extra notes on tokens. | | You don't have to use tiktoken if you aren't actually tokenizing | things. The token lists are just text files that consist of the | characters base64 encoded followed by the numeric ID. If you want | to explore the list you can just download them and decode them | yourself. | | I find that sorting tokens by length makes it a bit easier to get | a feel for what's in there. | | GPT-4 has a token vocabulary about twice the size of GPT-3.5. | | The most interesting thing to me about the GPT-4 token list is | how dominated it is by non-natural languages. It's not as simple | as English tokenizing more efficiently than Spanish because of | frequency. The most common language after English is code. A huge | number of tokens are allocated to even not very common things | found in code, like "ValidateAntiForgeryToken" or | "_InternalArray". From eyeballing the list I'd guess about half | the tokens seem to be from source code. | | My guess is that it's not a coincidence that GPT-4 both trained | on a lot of code and is also the leading model. I suspect we're | going to discover at some point, or maybe OpenAI already did, | that training on code isn't just a neat trick to get an LLM that | can knock out scripts. Maybe it's fundamentally useful to train | the model to reason logically and think clearly. The highly | structured and unambiguous yet also complex thought that code | represents is probably a great way for the model to really level | up its thought processes. Ilya Sutskever mentioned in an | interview that one of the bottlenecks they face on training | something smarter than GPT-4 is getting access to "more complex | thought". If this is true then it's possible the Microsoft | collaboration will prove an enduring competitive advantage for | OpenAI, as it gives them access to the bulk GitHub corpus which | is probably quite hard to scrape otherwise. | gwern wrote: | Worth mentioning the many other consequences of BPE tokenization: | gwern.net/gpt-3#bpes | https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-... | pmoriarty wrote: | In the article on your blog, you wrote: | | _" GPT-3 rhymes reasonably well and often when appropriate, | but the improvement is much smaller on rhyming than it is on | pretty much everything else. Apparently it is easier for GPT-3 | to learn things like arithmetic and spreadsheets than it is to | learn how to rhyme."_ | | I've experimented extensively with Claude, and a bit with | Claude+, ChatGPT (GPT 3.5) and GPT4 on poe.com, and I've had | not the slightest problem in getting them to rhyme. However, | once they've started writing rhyming poetry it's hard to get | them to stop rhyming. They seem to have formed a strong | association between rhyming and poetry. I've also been unable | to get them to obey a specific rhyming scheme like ABBAB. | burnished wrote: | That seems incredibly challenging, I'd expect some | fundamental difficulty due to rhyming being determined by how | a word sounds and not what it means. | bluepoint wrote: | So the space character is part of the token? | minimaxir wrote: | This can vary by BPE tokenizer. The original GPT-2/GPT-3 was | weirder about it. | simonw wrote: | Yup. Most common words have several tokens - the word, the word | with a capital letter, the word with a leading space and | sometimes the word all in caps too. | | Try searching for different words using the search box here: | https://observablehq.com/@simonw/gpt-tokenizer#cell-135 | jiggawatts wrote: | I wonder if the embeddings could be explicitly configured to | account for these "symmetries". E.g.: instead of storing | seperate full copies of the "variants", maybe keep a reduced | representation with a common prefix and only a small subset | of the embedding vector that is allowed to be learned? | | This could force the model to correctly learn how to | capitalise, make all-caps, etc... | ywmario wrote: | I have been under the impression that the embedded vector is the | one actually matters. Token is just another format. | [deleted] | hsjqllzlfkf wrote: | Could anyone who's an expert comment why there seems to be such a | focus on discussing tokenizers? It seems every other day there's | a new article or implementation of a tokenizer on HN. But | downstream from that, rarely anything. As a non-expert I would | have thought to tokenizing is just one step. | SkyPuncher wrote: | Tokens are the primitives that most LLMs (and broadly a lot of | NLP) works with. While, you and I would expect whole-words to | be tokens, many tokens are shorter - 3 to 4 characters - and | don't always match the sentence structure you and I expect. | | This can create some interesting challenges and unexpected | behavior. It also makes certain things, like vectorization, a | challenge since tokens may not map 1:1 with the words you | intend to weight them against. | hsjqllzlfkf wrote: | Your answer explains what tokenizers are, which isn't what I | asked. You also told me something interesting about | tokenizers, which is also not what I asked. Can you tell me | anything NOT about tokenized? This is my point. | mike_hearn wrote: | The reason it's not discussed much is that what goes on | downstream of tokenization is extremely opaque. It's lots | of layers of the transformer network so the overall | structure is documented but what exactly those numbers mean | is hard to figure out. | | There's an article here where the structure of an image | generation network is explored a bit: | | https://openai.com/research/sparse-transformer | | They have a visualization of what the different layers are | paying attention to. | | There are also some good explanations of transformers | elsewhere online. This one is old but I found it helpful: | | http://jalammar.github.io/illustrated-transformer/ | hsjqllzlfkf wrote: | This was my suspicion, thank you. | thaumasiotes wrote: | > While, you and I would expect whole-words to be tokens, | many tokens are shorter - 3 to 4 characters - and don't | always match the sentence structure you and I expect. | | There is a phenomenon called Broca's Aphasia which is, | essentially, the inability to connect words into sentences. | This mostly prevents the patient from communicating via | language. But patients with this condition can reveal quite a | bit about the structure of the language they can no longer | speak. | | One example discussed in _The Language Instinct_ is someone | who works at (and was injured at) a mill. He is unable to | produce utterances that are more than one word long, though | he seems to do well at understanding what people say to him. | One of his single-word utterances, describing the mill where | he works, is "Four hundred tons a day!". | | This is the opposite of what you describe, a single token | that is longer than one word in the base language instead of | being shorter. But it appears to be the same kind of thing. | | By the way, if you study a highly inflectional language such | as Latin or Russian, you will lose the assumption that | interpretive tokens should be whole words. You'd still expect | them to align closely with sentence structure, though. | whimsicalism wrote: | You are using the word vectorization in an idiosyncratic way, | you are referring to the process of embedding words? | simonw wrote: | I just think they're interesting. | | From a practical point of view they only really matter in that | we have to think carefully about how to use our token budget. | ftxbro wrote: | The reason it's trending today is because of the phenomenon of | Glitch Tokens. They thought all Glitch Tokens had been removed | by GPT-4 but apparently one is still left. If you go down the | rabbit hole on Glitch Tokens it gets ... really really weird. | [deleted] | ftxbro wrote: | I just want to say i love your pet pelican names Pelly, Beaky, | SkyDancer, Scoop, and Captain Gulliver. | simonw wrote: | Captain Gulliver is genuinely an excellent name for a pelican! | api wrote: | Has anyone ever tried a GPT trained on, say, 256 tokens | representing bytes in a byte stream or even more simply binary | digits? | | I imagine there are efficiency trade-offs but I just wonder if it | works at all. | sandinmyjoints wrote: | Not a GPT, but I think Megabyte does that. | simonw wrote: | Here's the Observable notebook I built to explore how the | tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer | jcims wrote: | I really really wish someone would try tokenizing off of a | phonetic representation rather than textual one. I think it would | be interesting to compare the output | bpiche wrote: | spacy's sense2vec gets pretty close to that | | https://spacy.io/universe/project/sense2vec/ | | granted, it is 8 years old, but it's still interesting | ftxbro wrote: | it doesn't matter, the 'bitter lesson' as coined by Rich Sutton | is that stacking more layers with more parameters and compute | and dataset size is going to swamp any kind of clever 'feature | engineering' like trying to be clever about phonetic tokens. | Karpathy for example just wants to go back to byte tokens. | nsinreal wrote: | Yes, but how much extra layers and computing power do you | need? Of course, phonetic tokens are awkward idea, but there | is a reason why word "human" is encoded as only one token. | spywaregorilla wrote: | I don't think that is intuitive at all. "Clever feature | engineering" like trying to create columns from calculations | of tabular data, sure. You're not going to move the needle. | But the basic representation of unstructured data like text | could very believably alter the need for parameters, layers, | and calculation speed by orders of magnitude. | whimsicalism wrote: | You would be wrong at the scales we are talking about. | | The whole point is that it is unintuitive. | ftxbro wrote: | > "I don't think that is intuitive at all." | | That's exactly the point. Every intuition is always on the | side of feature engineering. | sp332 wrote: | Most current implementations can't count syllables at all, so | it would get you at least that far. | TechBro8615 wrote: | Kudos to simonw for all the LLM content you've been publishing. I | like reading your perspective and notes on your own learning | experiences. | throwaway2016a wrote: | Pardon the n00b question, but... | | How does this relate to vectors? It was my understanding that the | tokens were vectors and this seems to show them as an integer. | | It's probably a really obvious question to anyone who knows AI | but I figured if I have it someone else does too. | binarymax wrote: | Very basic overview: A token is assigned a number, that number | gets passed into the encoder model with other token numbers, | and the encoder model transforms those number sequences into | embeddings (vectors) | nighthawk454 wrote: | The tokens are an integer. The first layer of the model is an | 'embedding', which is essentially a giant lookup table. So if a | string gets tokenized to Token #3, that means get the vector in | row 3 of the embedding table. (Those vectors are learned during | model training.) | | More completely, you can think of the integers as being | implicitly a one-hot vector encoding. So say you have a vocab | size of 20,000 and you want Token #3. The one-hot vector would | be a 20,000 length vector of zeros with a one in position 3. | This vector is then multiplied against the embedding | table/matrix. Although in practice this is equivalent to just | selecting one row directly, so it's implemented as such and | there's no reason to explicitly make the large one-hot vectors. | z3c0 wrote: | The integers represent a position within a vector of "all known | tokens". Typically, following a simple bag-of-words approach, | each position in the vector would be toggled to 1 or 0 based on | the presence of a token in a given document. Since most vectors | would be almost completely zeroed, the simpler way to represent | these vectors is through a list of positions in the now | abstracted vector, aka a sparse vector, ie a list of integers. | | In the case of more advanced language models like LLMs, a given | token can be paired with many other features of the token (such | as dependencies or parts-of-speech) to make an integer | represent one of many permutations on the same word based on | its usage. ___________________________________________________________________ (page generated 2023-06-08 23:00 UTC)