[HN Gopher] Show HN: HuggingFace - Fast tokenization library for...
       ___________________________________________________________________
        
       Show HN: HuggingFace - Fast tokenization library for deep-learning
       NLP pipelines
        
       Author : julien_c
       Score  : 118 points
       Date   : 2020-01-13 16:40 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | hnaccy wrote:
       | Great! Just did a quick test and got a 6-7x speedup on
       | tokenization.
        
         | clmnt wrote:
         | Mind sharing what tests your ran & with which setup? Thanks!
        
       | ZeroCool2u wrote:
       | We use both SpaCy and HuggingFace at work. Is there a comparison
       | of this vs SpaCy's tokenizer[1]?
       | 
       | 1. https://spacy.io/usage/linguistic-features#tokenization
        
       | orestis wrote:
       | Are there examples on how this can be used for topic modeling,
       | document similarity etc? All the examples I've seen (gensim) use
       | bag-of-words which seems to be outdated.
        
         | rococode wrote:
         | They don't use huggingface, but some of the modern approaches
         | for topic modeling use variational auto-encoders, see:
         | 
         | Open-SESAME (2017): https://arxiv.org/abs/1706.09528 /
         | https://github.com/swabhs/open-sesame
         | 
         | VAMPIRE (2019): https://arxiv.org/abs/1906.02242 /
         | https://github.com/allenai/vampire
        
         | ogrisel wrote:
         | Big transformers neural network are probably overkill for topic
         | modeling. More traditional methods implemented in Gensim or
         | scikit learn such as tfidf vectors followed by SVD (aka LSI) or
         | LDA or NMF are probably just fine to extract topics (soft
         | clustering).
        
           | ogrisel wrote:
           | The reason is that you do not need to finely understand the
           | structure of individual sentences to group documents by
           | similar topics. Word order does not matter much for this
           | task. Hence the success of methods that use Bag of Words (eg
           | TFIDF) as their input representation.
        
             | orestis wrote:
             | It might be that the corpus I was trying to cluster needs
             | better preprocessing, or perhaps better n-grams. Using
             | Bigrams only I saw a lot of common words that were
             | meaningless, but adding them as stop words made the results
             | worse. Hence my wondering if some other vectorization would
             | produce better results.
             | 
             | On a related note, as a newcomer just trying to get things
             | done (i.e. applied NLP) I find the whole ecosystem great
             | but frustrating, so many frameworks and libraries but not
             | clear ways to compose them together. Any resources out
             | there that help make a sense of things?
        
               | nestorD wrote:
               | If I understand you problem clearly, you can use TFIDF to
               | reduce the weight of meaningless words.
        
       | [deleted]
        
       | virtuous_signal wrote:
       | I didn't realize that particular emoji had a name. I thought it
       | was a play on this:
       | https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...
        
       | julien_c wrote:
       | TL;DR: Hugging Face, the NLP research company known for its
       | transformers library (DISCLAIMER: I work at Hugging Face), has
       | just released a new open-source library for ultra-fast &
       | versatile tokenization for NLP neural net models (i.e. converting
       | strings in model input tensors).
       | 
       | Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-
       | BPE/WordPiece/SentencePiece... - Compute exhaustive set of
       | outputs (offset mappings, attention masks, special token
       | masks...) - Written in Rust with bindings for Python and node.js
       | 
       | Github repository and doc:
       | https://github.com/huggingface/tokenizers/tree/master/tokeni...
       | 
       | To install: - Rust: https://crates.io/crates/tokenizers - Python:
       | pip install tokenizers - Node: npm install tokenizers
        
       | LunaSea wrote:
       | It used to be that pre-DeepLearning tokenizers would extract
       | ngrams (n-token sized chunks) but this doesn't seem to exist
       | anymore in the word embedding tokenizers I've come by.
       | 
       | Is this possible using HuggingFace (or another word embedding
       | based library)?
       | 
       | I know that there are some simple heuristics like merging noun
       | token sequences together to extract ngrams but they are too
       | simplistic and very error prone.
        
         | brockf wrote:
         | Most implementations are actually moving in the opposite
         | direction. Previously, there was a tendency to look to
         | aggregate words into phrases to better capture the "context" of
         | a word. Now, most approaches are splitting words into sub-word
         | parts or even characters. With networks that capture temporal
         | relationships across tokens (as opposed to older, "bag of
         | words" models), multi-word patterns can effectively be captured
         | by attending to the temporal order of sub-word parts.
        
           | LunaSea wrote:
           | > multi-word patterns can effectively be captured by
           | attending to the temporal order of sub-word parts
           | 
           | Indeed. Do you have an example of a library or snippet that
           | demonstrates this?
           | 
           | My limited understanding of BERT (and other) word embeddings
           | was that they only contain the word's position in the 728 (I
           | believe) dimensional space but doesn't contain queryable
           | temporal information no?
           | 
           | I like ngrams as a sort of untagged / unlabelled entity.
        
             | PeterisP wrote:
             | When using BERT (and all the many things like it, such as
             | earlier ELMO, ULMfit and later ROBERTA/ERNIE/ALBERTa/etc)
             | as the 'embeddings' you provide as input all the tokens in
             | a sequence. You don't get an "embedding for word foobar in
             | position 123", you get an embedding for all the sequence at
             | once, so whatever corresponds to that token is a
             | 728-dimensional "embedding for word foobar in position 123
             | conditional on _all the particular other words that were
             | before and after it_ '. Including very long-distance
             | relations.
             | 
             | One of the simpler ways to try that out in your code seems
             | to be running BERT-as-a-service
             | https://github.com/hanxiao/bert-as-service , or
             | alternatively the huggingface libraries that are discussed
             | in the original article.
             | 
             | It's kind of the other way around compared to word2vec-
             | style systems; before that you used to have a 'thin'
             | embedding layer that's essentially just a lookup table
             | followed by a bunch of complex layers of neural networks
             | (e.g. multiple Bi-LSTMs followed by CRF); in the 'current
             | style' you have "thick embeddings" which is running through
             | all the many transformer layers in a pretrained BERT-like
             | system, followed by a thin custom layer that's often just
             | glorified linear regression.
        
             | visarga wrote:
             | > Do you have an example of a library or snippet that
             | demonstrates this?
             | 
             | All NLP neural nets (based on LSTM or Transformer) do this.
             | It's their main function - to create contextual
             | representations of the input tokens.
             | 
             | The word 'position' in the 728 dimensional space is an
             | embedding and it can be compared with other words by dot
             | product. There are libraries that can do dot product
             | ranking fast (such as annoy).
        
       | echelon wrote:
       | I'm very familiar with the TTS, VC, and other "audio-shaped"
       | spaces, but I've never delved into NLP.
       | 
       | What problems can you solve with NLP? Sentiment analysis?
       | Semantic analysis? Translation?
       | 
       | What cool problems are there?
        
         | visarga wrote:
         | > What problems can you solve with NLP?
         | 
         | It's mostly understanding text and generating text. You can do
         | named entity extraction, question answering, summarisation,
         | dialogue bots, information extraction from semi-structured
         | documents such as tables and invoices, spelling correction,
         | typing auto-suggestions, document classification and
         | clustering, topic discovery, part of speech tagging, syntactic
         | trees, language modelling, image description and image question
         | answering, entailment detection (if two affirmations support
         | one another), coreference resolution, entity linking, intent
         | detection and slot filling, build large knowledge bases
         | (databases of triplets subject-relation-object), spam
         | detection, toxic message detection, ranking search results in
         | search engines and many many more.
        
         | mraison wrote:
         | I believe many folks are particularly attracted to NLP because
         | the Turing test [1] is an NLP problem.
         | 
         | [1] https://en.m.wikipedia.org/wiki/Turing_test
        
           | brokensegue wrote:
           | Disagree. I think it's value is mostly unrelated to that.
        
         | crawdog wrote:
         | There's a lot! Sentence detection, parts of speech (POS)
         | detection to name a couple. These can be used to determine key
         | concepts in documents that lack metadata. For example: you
         | could cluster on common phrases to identify relationships in
         | data.
        
         | starpilot wrote:
         | All of the above, it's like asking what problems can you solve
         | with math? HuggingFace's transformers are said to be a swiss
         | army knife for NLP. I haven't worked with them yet, but the
         | main fundamental utility seems to be generating fixed-length
         | vector representations of words. Word2vec started this, but the
         | vectors have gotten much better with stuff like BERT.
        
           | Isn0gud wrote:
           | I thought transformers are mainly used for multi-word
           | embeddings?!
        
       | rsp1984 wrote:
       | What does tokenization (of strings, I guess) do?
        
         | wyldfire wrote:
         | The README [1] shows a great example:
         | 
         | The sentence "Hello, y'all! How are you ?" is tokenized into
         | words. Those words are then encoded into integers
         | representative of the words' identity in the model's
         | dictionary.                   >>> output =
         | tokenizer.encode("Hello, y'all! How are you  ?")
         | Encoding(num_tokens=13, attributes=[ids, type_ids, tokens,
         | offsets, attention_mask, special_tokens_mask, overflowing,
         | original_str, normalized_str])         >>> print(output.ids,
         | output.tokens, output.offsets)         [101, 7592, 1010, 1061,
         | 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102]
         | ['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are',
         | 'you', '[UNK]', '?', '[SEP]']         [(0, 0), (0, 5), (5, 6),
         | (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22,
         | 25), (26, 27), (28, 29), (0, 0)]
         | 
         | But there's also good detail in the source [2] which says, "A
         | Tokenizer works as a pipeline, it processes some raw text as
         | input and outputs an Encoding. The various steps of the
         | pipeline are: ...."
         | 
         | [1] https://github.com/huggingface/tokenizers#quick-examples-
         | usi...
         | 
         | [2]
         | https://github.com/huggingface/tokenizers/tree/master/tokeni...
        
       | mark_l_watson wrote:
       | I love the work done and made freely available by both spaCy and
       | HuggingFace.
       | 
       | I had my own NLP libraries for about 20 years, simple ones were
       | examples in my books, and more complex and not so understandable
       | ones I sold as products and pulled in lots of consulting work
       | with.
       | 
       | I have completely given up my own work developing NLP tools, and
       | generally I use the Python bindings (via the Hy language (hylang)
       | which is a Lisp that sits on top of Python) for spaCy,
       | huggingface, TensorFlow, and Keras. I am retired now but my
       | personal research is in hybrid symbolic and deep learning AI.
        
         | dunefox wrote:
         | Hybrid symbolic and NN will be my next area of hobby research,
         | currently getting my masters degree in NLP. Do you have a few
         | good resources to get startedor/read about?
        
       | tarr11 wrote:
       | Why is this company called HuggingFace?
        
         | itronitron wrote:
         | I assume it is a reference to the movie Alien.
        
       | manojlds wrote:
       | Title is off? Should mention Tokenizers as the project.
        
       | screye wrote:
       | I can't believe the level of productivity this Hugging face team
       | has.
       | 
       | They seemed to have found the ideal balance of software
       | engineering capability and Neural network knowledge, in a team of
       | highly effective and efficient employees.
       | 
       | Idk what their monetization plan is as a startup, but it is 100%
       | undervalued at 20 million, and that is just the quality of that
       | team. Now, if only I can figure out how to put a few thousand $
       | in a series-A startup as just some guy.
        
         | manojlds wrote:
         | > Idk what their monetization plan is as a startup
         | 
         | > put a few thousand $ in a series-A
         | 
         | Not a good idea.
        
           | screye wrote:
           | I see them as an acqui-hire target. Especially form Facebook
           | since they are so geographically close to FAIR labs in NY or
           | Google and get integrated into Google AI like Deep Mind did.
           | (esp. since google uses a ton of Transformers any ways)
           | 
           | I can't think of many small teams that can be acquired and
           | can build a company's ML infrastructure as fast as this team.
           | 
           | If they have the money for it, OCI and Azure may also be
           | keeping a look out for them.
        
       | useful wrote:
       | Somewhat related, if someone want to build something awesome, I
       | haven't seen anything that merges lucene with BPE/SentencePiece.
       | 
       | SentencePiece has to make it so you can shrink the memory
       | requirements of your indexes for search and typeahead stuff.
        
       ___________________________________________________________________
       (page generated 2020-01-13 23:00 UTC)