[HN Gopher] Show HN: HuggingFace - Fast tokenization library for... ___________________________________________________________________ Show HN: HuggingFace - Fast tokenization library for deep-learning NLP pipelines Author : julien_c Score : 118 points Date : 2020-01-13 16:40 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | hnaccy wrote: | Great! Just did a quick test and got a 6-7x speedup on | tokenization. | clmnt wrote: | Mind sharing what tests your ran & with which setup? Thanks! | ZeroCool2u wrote: | We use both SpaCy and HuggingFace at work. Is there a comparison | of this vs SpaCy's tokenizer[1]? | | 1. https://spacy.io/usage/linguistic-features#tokenization | orestis wrote: | Are there examples on how this can be used for topic modeling, | document similarity etc? All the examples I've seen (gensim) use | bag-of-words which seems to be outdated. | rococode wrote: | They don't use huggingface, but some of the modern approaches | for topic modeling use variational auto-encoders, see: | | Open-SESAME (2017): https://arxiv.org/abs/1706.09528 / | https://github.com/swabhs/open-sesame | | VAMPIRE (2019): https://arxiv.org/abs/1906.02242 / | https://github.com/allenai/vampire | ogrisel wrote: | Big transformers neural network are probably overkill for topic | modeling. More traditional methods implemented in Gensim or | scikit learn such as tfidf vectors followed by SVD (aka LSI) or | LDA or NMF are probably just fine to extract topics (soft | clustering). | ogrisel wrote: | The reason is that you do not need to finely understand the | structure of individual sentences to group documents by | similar topics. Word order does not matter much for this | task. Hence the success of methods that use Bag of Words (eg | TFIDF) as their input representation. | orestis wrote: | It might be that the corpus I was trying to cluster needs | better preprocessing, or perhaps better n-grams. Using | Bigrams only I saw a lot of common words that were | meaningless, but adding them as stop words made the results | worse. Hence my wondering if some other vectorization would | produce better results. | | On a related note, as a newcomer just trying to get things | done (i.e. applied NLP) I find the whole ecosystem great | but frustrating, so many frameworks and libraries but not | clear ways to compose them together. Any resources out | there that help make a sense of things? | nestorD wrote: | If I understand you problem clearly, you can use TFIDF to | reduce the weight of meaningless words. | [deleted] | virtuous_signal wrote: | I didn't realize that particular emoji had a name. I thought it | was a play on this: | https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc... | julien_c wrote: | TL;DR: Hugging Face, the NLP research company known for its | transformers library (DISCLAIMER: I work at Hugging Face), has | just released a new open-source library for ultra-fast & | versatile tokenization for NLP neural net models (i.e. converting | strings in model input tensors). | | Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level- | BPE/WordPiece/SentencePiece... - Compute exhaustive set of | outputs (offset mappings, attention masks, special token | masks...) - Written in Rust with bindings for Python and node.js | | Github repository and doc: | https://github.com/huggingface/tokenizers/tree/master/tokeni... | | To install: - Rust: https://crates.io/crates/tokenizers - Python: | pip install tokenizers - Node: npm install tokenizers | LunaSea wrote: | It used to be that pre-DeepLearning tokenizers would extract | ngrams (n-token sized chunks) but this doesn't seem to exist | anymore in the word embedding tokenizers I've come by. | | Is this possible using HuggingFace (or another word embedding | based library)? | | I know that there are some simple heuristics like merging noun | token sequences together to extract ngrams but they are too | simplistic and very error prone. | brockf wrote: | Most implementations are actually moving in the opposite | direction. Previously, there was a tendency to look to | aggregate words into phrases to better capture the "context" of | a word. Now, most approaches are splitting words into sub-word | parts or even characters. With networks that capture temporal | relationships across tokens (as opposed to older, "bag of | words" models), multi-word patterns can effectively be captured | by attending to the temporal order of sub-word parts. | LunaSea wrote: | > multi-word patterns can effectively be captured by | attending to the temporal order of sub-word parts | | Indeed. Do you have an example of a library or snippet that | demonstrates this? | | My limited understanding of BERT (and other) word embeddings | was that they only contain the word's position in the 728 (I | believe) dimensional space but doesn't contain queryable | temporal information no? | | I like ngrams as a sort of untagged / unlabelled entity. | PeterisP wrote: | When using BERT (and all the many things like it, such as | earlier ELMO, ULMfit and later ROBERTA/ERNIE/ALBERTa/etc) | as the 'embeddings' you provide as input all the tokens in | a sequence. You don't get an "embedding for word foobar in | position 123", you get an embedding for all the sequence at | once, so whatever corresponds to that token is a | 728-dimensional "embedding for word foobar in position 123 | conditional on _all the particular other words that were | before and after it_ '. Including very long-distance | relations. | | One of the simpler ways to try that out in your code seems | to be running BERT-as-a-service | https://github.com/hanxiao/bert-as-service , or | alternatively the huggingface libraries that are discussed | in the original article. | | It's kind of the other way around compared to word2vec- | style systems; before that you used to have a 'thin' | embedding layer that's essentially just a lookup table | followed by a bunch of complex layers of neural networks | (e.g. multiple Bi-LSTMs followed by CRF); in the 'current | style' you have "thick embeddings" which is running through | all the many transformer layers in a pretrained BERT-like | system, followed by a thin custom layer that's often just | glorified linear regression. | visarga wrote: | > Do you have an example of a library or snippet that | demonstrates this? | | All NLP neural nets (based on LSTM or Transformer) do this. | It's their main function - to create contextual | representations of the input tokens. | | The word 'position' in the 728 dimensional space is an | embedding and it can be compared with other words by dot | product. There are libraries that can do dot product | ranking fast (such as annoy). | echelon wrote: | I'm very familiar with the TTS, VC, and other "audio-shaped" | spaces, but I've never delved into NLP. | | What problems can you solve with NLP? Sentiment analysis? | Semantic analysis? Translation? | | What cool problems are there? | visarga wrote: | > What problems can you solve with NLP? | | It's mostly understanding text and generating text. You can do | named entity extraction, question answering, summarisation, | dialogue bots, information extraction from semi-structured | documents such as tables and invoices, spelling correction, | typing auto-suggestions, document classification and | clustering, topic discovery, part of speech tagging, syntactic | trees, language modelling, image description and image question | answering, entailment detection (if two affirmations support | one another), coreference resolution, entity linking, intent | detection and slot filling, build large knowledge bases | (databases of triplets subject-relation-object), spam | detection, toxic message detection, ranking search results in | search engines and many many more. | mraison wrote: | I believe many folks are particularly attracted to NLP because | the Turing test [1] is an NLP problem. | | [1] https://en.m.wikipedia.org/wiki/Turing_test | brokensegue wrote: | Disagree. I think it's value is mostly unrelated to that. | crawdog wrote: | There's a lot! Sentence detection, parts of speech (POS) | detection to name a couple. These can be used to determine key | concepts in documents that lack metadata. For example: you | could cluster on common phrases to identify relationships in | data. | starpilot wrote: | All of the above, it's like asking what problems can you solve | with math? HuggingFace's transformers are said to be a swiss | army knife for NLP. I haven't worked with them yet, but the | main fundamental utility seems to be generating fixed-length | vector representations of words. Word2vec started this, but the | vectors have gotten much better with stuff like BERT. | Isn0gud wrote: | I thought transformers are mainly used for multi-word | embeddings?! | rsp1984 wrote: | What does tokenization (of strings, I guess) do? | wyldfire wrote: | The README [1] shows a great example: | | The sentence "Hello, y'all! How are you ?" is tokenized into | words. Those words are then encoded into integers | representative of the words' identity in the model's | dictionary. >>> output = | tokenizer.encode("Hello, y'all! How are you ?") | Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, | offsets, attention_mask, special_tokens_mask, overflowing, | original_str, normalized_str]) >>> print(output.ids, | output.tokens, output.offsets) [101, 7592, 1010, 1061, | 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102] | ['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', | 'you', '[UNK]', '?', '[SEP]'] [(0, 0), (0, 5), (5, 6), | (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, | 25), (26, 27), (28, 29), (0, 0)] | | But there's also good detail in the source [2] which says, "A | Tokenizer works as a pipeline, it processes some raw text as | input and outputs an Encoding. The various steps of the | pipeline are: ...." | | [1] https://github.com/huggingface/tokenizers#quick-examples- | usi... | | [2] | https://github.com/huggingface/tokenizers/tree/master/tokeni... | mark_l_watson wrote: | I love the work done and made freely available by both spaCy and | HuggingFace. | | I had my own NLP libraries for about 20 years, simple ones were | examples in my books, and more complex and not so understandable | ones I sold as products and pulled in lots of consulting work | with. | | I have completely given up my own work developing NLP tools, and | generally I use the Python bindings (via the Hy language (hylang) | which is a Lisp that sits on top of Python) for spaCy, | huggingface, TensorFlow, and Keras. I am retired now but my | personal research is in hybrid symbolic and deep learning AI. | dunefox wrote: | Hybrid symbolic and NN will be my next area of hobby research, | currently getting my masters degree in NLP. Do you have a few | good resources to get startedor/read about? | tarr11 wrote: | Why is this company called HuggingFace? | itronitron wrote: | I assume it is a reference to the movie Alien. | manojlds wrote: | Title is off? Should mention Tokenizers as the project. | screye wrote: | I can't believe the level of productivity this Hugging face team | has. | | They seemed to have found the ideal balance of software | engineering capability and Neural network knowledge, in a team of | highly effective and efficient employees. | | Idk what their monetization plan is as a startup, but it is 100% | undervalued at 20 million, and that is just the quality of that | team. Now, if only I can figure out how to put a few thousand $ | in a series-A startup as just some guy. | manojlds wrote: | > Idk what their monetization plan is as a startup | | > put a few thousand $ in a series-A | | Not a good idea. | screye wrote: | I see them as an acqui-hire target. Especially form Facebook | since they are so geographically close to FAIR labs in NY or | Google and get integrated into Google AI like Deep Mind did. | (esp. since google uses a ton of Transformers any ways) | | I can't think of many small teams that can be acquired and | can build a company's ML infrastructure as fast as this team. | | If they have the money for it, OCI and Azure may also be | keeping a look out for them. | useful wrote: | Somewhat related, if someone want to build something awesome, I | haven't seen anything that merges lucene with BPE/SentencePiece. | | SentencePiece has to make it so you can shrink the memory | requirements of your indexes for search and typeahead stuff. ___________________________________________________________________ (page generated 2020-01-13 23:00 UTC)