[HN Gopher] Advanced NLP with spaCy v3 ___________________________________________________________________ Advanced NLP with spaCy v3 Author : philipvollet Score : 146 points Date : 2021-12-10 16:07 UTC (6 hours ago) (HTM) web link (course.spacy.io) (TXT) w3m dump (course.spacy.io) | minimaxir wrote: | A relatively underdiscussed quirk of the rise of superlarge | language models like GPT-3 for certain NLP tasks is that since | those models have incorporated so much real world grammar, | there's no need to do advanced preprocessing and can just YOLO | and work with generated embeddings instead without going into | spaCy's (excellent) parsing/NER features. | | OpenAI recently released an Embeddings API for GPT-3 with good | demos and explanations: | https://beta.openai.com/docs/guides/embeddings | | Hugging Face Transformers makes this easier (and for free) as | most models can be configured to return a "last_hidden_state" | which will return the aggregated embedding. Just use DistilBERT | uncased/cased (which is fast enough to run on consumer CPUs) and | you're probably good to go. | mtqwerty wrote: | Readjusting expectations for pre-processing was one of the | biggest differences I noticed going from NLP courses to working | on NLP in production. For the amount of pre-processing learning | material there is, I expected it to be much more important in | practice. | | I feel lucky to gotten into NLP when I did (learning in | 2017/2018 and working in the beginning of 2020). Changing our | system from glove to BERT was super exciting and a great way to | learn about the drawbacks and benefits of each. | Vetch wrote: | While you make sensible points, in the case of GPT-3, not | everyone will be willing to route their data through OpenAI's | servers. | | > Just use DistilBERT uncased/cased (which is fast enough to | run on consumer CPUs) | | This can still be impractical, at least in my case of regularly | needing to process hundreds of pages of text. Simpler systems | can be much faster for an acceptable loss and you can get more | robustness by working with label distributions instead of just | picking argmax. | | Fast simpler classifiers can also help decide where the more | resource intensive models should focus attention. | | Another reason for preprocessing is rule systems. Even if not | glamorous to talk about, they still see heavy use in practical | settings. While dependency parses are hard to make use of, | shallow parses (chunking) and parts of speech data can be | usefully fed into rule systems. | new_stranger wrote: | I imagine it being very useful to understand what you just said | hooande wrote: | lol. a rough translation is that the new super language | models are good enough that you don't have to keep track of | specific parts of speech in your programming. if you look at | the arrays of floating point weights that underlie gpt-3 etc, | you can use them to match present participle phrases with | other present participle phrases and so forth | | this is of course a correct and prescient observation. | minimaxir is kind of an NLP final boss, so I wouldn't expect | most people to be able to follow everything he says | minimaxir wrote: | I don't think it's more of a final boss thing: IMO working | with embeddings/word vectors is easier, even in the basest | case such as word2vec/GloVe, to understand than some of the | more conventional NLP techniques (e.g. bag of words/TF- | IDF). | | The spaCy tutorials in the submission also have a section | on word vectors. | Vetch wrote: | Ah, although, TF-IDF is still good to know. Semantic | search hasn't eliminated the need for classical retrieval | techniques. It can also be used to select a subset of | words to use to create an average of word vectors for a | document signature, a quick and dirty method for document | embeddings. | | Bag of word co-occurrences in matrix format is also a | nice to know, factorizing such matrices were the original | vector space model for distributional semantics and | provide historical context for GloVe and the like. | master_yoda_1 wrote: | I am not able to see what is advanced here. spCY just wrap all | the open source code/model into a python api and just want to | sell the hype. | dang wrote: | " _Please don 't post shallow dismissals, especially of other | people's work. A good critical comment teaches us something._" | | https://news.ycombinator.com/newsguidelines.html | Der_Einzige wrote: | As usual, dang is wrong and not moderating effectively. This | is not a shallow comment but a legitimate concern about | spaCy, and to a lesser extent other NLP tools such as NLTK. | Most of the tooling around them that people end up using | really is nothing more than wrappers around other tools. See | the default tokenizers or models utilized by these tools. | | And yes, even if spaCy is not making money itself, you can | bet that the other paid for tools that they sell are. | dang wrote: | Actually if the GP had posted this critique instead of a | shallow, reductionist internet dismissal ("just want to | sell the hype"), that would have been fine. Thoughtful | critique is welcome--it just requires higher-quality | comments than that. | Ldorigo wrote: | Ah, yes. The tried-and-true method of "just selling the hype" | with an open source library that everyone can use for free. | coding123 wrote: | That's a huge part of software development. Wrapping things to | be more concise, use-case driven. I mean most software | developers are just placing a veneer over something more | complex. That's pretty much all we do. | 41209 wrote: | I really love spaCy, it's trivial to throw up a server which | handles basic NLP. No complaints here, very happy to see it still | being updated | artembugara wrote: | We've been using spaCy a lot for the past few months. | | Mostly for non-production use cases, however, I can say that it | is the most robust framework for NLP at the moment. | | V3 added support for transformers: that's a killer feature as | many models from https://huggingface.co/docs/transformers/index | work great out of the box. | | At the same time, I found NER models provided by spaCy to have a | low accuracy while working with real data: we deal with news | articles https://demo.newscatcherapi.com/ | | Also, while I see how much attention ML models get from the | crowd, I think that many problems can be solved with rule-based | approach: and spaCy is just amazing for these. | | Btw, we recently wrote a blog post comparing spaCy to NLTK for | text normalization task: https://newscatcherapi.com/blog/spacy- | vs-nltk-text-normaliza... | brd wrote: | I really appreciate how accessible SpaCy has made NLP work but | their NER is definitely low accuracy. | | Where stem/lem felt critical to successful NLP processing a few | years ago, we've found stem/lem work to be much less important | for downstream tasks when transformer based models are | involved. | | For topic extraction stem/lem still seems to do a lot to | improve accuracy and for rules based approaches I can still see | how it would facilitate more efficient processing at scale. I'd | be curious to hear your experience fine tuning and/or training | new models after stem/lem processing with transformers, we've | admittedly done little testing to see how transformers actually | performer if properly tuned to post-processed data. | artembugara wrote: | Did you try something like autoNLP by huggingface? | brd wrote: | No, we've got our own fine tuning pipeline and initial | tests showed better performance without traditional | stem/lem processing so we dropped it from our | classification pipelines and haven't seen a need to | revisit. | pantsforbirds wrote: | We use spaCy at work for (mostly) news articles as well. We've | been pretty impressed with it overall for detecting larger | trends using the NER models. I've been contemplating whether it | might be useful to make a spaCy module that uses a Count-Min | Sketch to track the top N of each of the NER categories | partitioned on a daily (or weekly etc.) time. | | Think it could be an interesting use case to get sort of | similar results to Google's search trends. | artembugara wrote: | I'd really love to chat about that. Any chance to connect? | email in bio | Eridrus wrote: | I feel like NER is a poorly designed task in general. You're | eventually trying to link the entities to some kind of KB, so | you should be injecting that entity information into your | system for detecting mentions. | kulikalov wrote: | Are you using the high accuracy eng model for NER? I've been | very happy with orgs recognition, it actually did way better | than any other open source model in my case. | artembugara wrote: | Try it on a sentence where all tokens are lower/upper case. | It just doesn't really work. | Xenoamorphous wrote: | I don't know how it compares with other paid alternatives (like | Google's or Amazon's) but spaCy's NER was pretty close to the | (paid) service we were using (IBM) to the point we ditched IBM. | Also for news articles. | | But yeah disambiguation/entity linking would be nice. | artembugara wrote: | I'd be happy to chat more if you want. | artembugara wrote: | Also I have an article about spaCy NER: | https://newscatcherapi.com/blog/named-entity-recognition-wit... | | The conclusion I came up with: | | "A few notes on my Spacy NER accuracy with "real world" data | | Low accuracy with sentences without a proper casing | | 1. Low accuracy overall, even with a large model | | 2. You'd need to fine-tune your model if you want to use it in | production | | 3. Overall, there's no open-source high accuracy NER model that | you can use out-of-a-box" | wyldfire wrote: | I assume your product does some kind of entity disambiguation | and/or link to an ontology? Spacy doesn't provide this out of | the box either, AFAICT. Can you share more info about how you | do it? | artembugara wrote: | We don't provide entity disambiguation out of a box. It's | more of a on request for Enterprise clients. | | But overall, entity disambiguation is one of the most | useful and difficult tasks in the NLP. | | SpaCy supports entity linking via knowledge base: | https://spacy.io/api/entitylinker | nefitty wrote: | That might be the killer feature from what I've heard. | Tarq0n wrote: | NER good enough to anonymise free text would be the | absolute dream for many governments. | Vetch wrote: | > Overall, there's no open-source high accuracy NER model | that you can use out-of-a-box" | | Part of it is most underestimate the complexity of NER and | the rest of it, in my opinion, is that NER is not well- | defined as a classification problem. | | At least in my experience, having a specific battery of | questions to query documents, first by transformer based | semantic search and narrowed by Q/A models, removed the need | for explicit NER, entity linking or relation extraction. For | the case of entities as features for rule systems, shallow | models and using all label predictions instead of just | selecting argmax has been sufficiently robust. Using big | transformers for classification doesn't pay enough to be | worth it there. ___________________________________________________________________ (page generated 2021-12-10 23:00 UTC)