[HN Gopher] Show HN: LLaMA tokenizer that runs in browser
       ___________________________________________________________________
        
       Show HN: LLaMA tokenizer that runs in browser
        
       Author : belladoreai
       Score  : 26 points
       Date   : 2023-06-13 20:22 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | belladoreai wrote:
       | Hi HN! I was looking for a tokenizer that would accurately(!)
       | count tokens in browser, and I couldn't find one. So I thought
       | "how hard can it be", and here we are 2 weeks later...
        
         | zerojames wrote:
         | I love this sentiment! Amazing work!
        
         | Solvency wrote:
         | For those who would also think the same thing, what're some of
         | the the tldr bulletpoints on why this is more complicated than
         | it'd seem?
        
           | belladoreai wrote:
           | I'll answer with an example.
           | 
           | Consider the input string " grabbed".
           | 
           | If we wanted to map this string to tokens by greedily going
           | from left to right and choosing tokens from the vocabulary
           | with the strategy of minimizing the number of tokens, our
           | algorithm would be very simple. We would end up with the
           | following tokenization: [17229, 2580] == [" grab", "bed"]
           | 
           | Surprisingly, the LLaMA tokenizer does not work this way. It
           | actually finds a "worse" tokenization for this input string:
           | [2646, 1327, 287] == [" gra", "bb", "ed"]
           | 
           | The tokenizer arrives at this 3 token output by applying
           | "merges" in a priority order. For example, this is a merge:
           | [" g", "r"] -> " gr". The trained data contains tens of
           | thousands of these merges. When we apply the merges in the
           | priority order, we end up with 3 tokens.
           | 
           | Now you might be thinking, that's easy, we'll just iterate
           | the list of merges and see if any of them apply. Only problem
           | with that approach is that _applying a merge can open up a
           | new opportunity to merge something else that wasn 't possible
           | before_. This right here is the key thing that makes this
           | problem complicated. We can solve this problem by iterating
           | all possible merges from the beginning after every time we
           | apply a merge. This would produce the correct solution. Only
           | problem is: our algorithm is now very slow and takes minutes
           | to run...
        
       ___________________________________________________________________
       (page generated 2023-06-13 23:01 UTC)