[HN Gopher] Show HN: LLaMA tokenizer that runs in browser ___________________________________________________________________ Show HN: LLaMA tokenizer that runs in browser Author : belladoreai Score : 26 points Date : 2023-06-13 20:22 UTC (2 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | belladoreai wrote: | Hi HN! I was looking for a tokenizer that would accurately(!) | count tokens in browser, and I couldn't find one. So I thought | "how hard can it be", and here we are 2 weeks later... | zerojames wrote: | I love this sentiment! Amazing work! | Solvency wrote: | For those who would also think the same thing, what're some of | the the tldr bulletpoints on why this is more complicated than | it'd seem? | belladoreai wrote: | I'll answer with an example. | | Consider the input string " grabbed". | | If we wanted to map this string to tokens by greedily going | from left to right and choosing tokens from the vocabulary | with the strategy of minimizing the number of tokens, our | algorithm would be very simple. We would end up with the | following tokenization: [17229, 2580] == [" grab", "bed"] | | Surprisingly, the LLaMA tokenizer does not work this way. It | actually finds a "worse" tokenization for this input string: | [2646, 1327, 287] == [" gra", "bb", "ed"] | | The tokenizer arrives at this 3 token output by applying | "merges" in a priority order. For example, this is a merge: | [" g", "r"] -> " gr". The trained data contains tens of | thousands of these merges. When we apply the merges in the | priority order, we end up with 3 tokens. | | Now you might be thinking, that's easy, we'll just iterate | the list of merges and see if any of them apply. Only problem | with that approach is that _applying a merge can open up a | new opportunity to merge something else that wasn 't possible | before_. This right here is the key thing that makes this | problem complicated. We can solve this problem by iterating | all possible merges from the beginning after every time we | apply a merge. This would produce the correct solution. Only | problem is: our algorithm is now very slow and takes minutes | to run... ___________________________________________________________________ (page generated 2023-06-13 23:01 UTC)