[HN Gopher] Advanced NLP with spaCy v3
       ___________________________________________________________________
        
       Advanced NLP with spaCy v3
        
       Author : philipvollet
       Score  : 146 points
       Date   : 2021-12-10 16:07 UTC (6 hours ago)
        
 (HTM) web link (course.spacy.io)
 (TXT) w3m dump (course.spacy.io)
        
       | minimaxir wrote:
       | A relatively underdiscussed quirk of the rise of superlarge
       | language models like GPT-3 for certain NLP tasks is that since
       | those models have incorporated so much real world grammar,
       | there's no need to do advanced preprocessing and can just YOLO
       | and work with generated embeddings instead without going into
       | spaCy's (excellent) parsing/NER features.
       | 
       | OpenAI recently released an Embeddings API for GPT-3 with good
       | demos and explanations:
       | https://beta.openai.com/docs/guides/embeddings
       | 
       | Hugging Face Transformers makes this easier (and for free) as
       | most models can be configured to return a "last_hidden_state"
       | which will return the aggregated embedding. Just use DistilBERT
       | uncased/cased (which is fast enough to run on consumer CPUs) and
       | you're probably good to go.
        
         | mtqwerty wrote:
         | Readjusting expectations for pre-processing was one of the
         | biggest differences I noticed going from NLP courses to working
         | on NLP in production. For the amount of pre-processing learning
         | material there is, I expected it to be much more important in
         | practice.
         | 
         | I feel lucky to gotten into NLP when I did (learning in
         | 2017/2018 and working in the beginning of 2020). Changing our
         | system from glove to BERT was super exciting and a great way to
         | learn about the drawbacks and benefits of each.
        
         | Vetch wrote:
         | While you make sensible points, in the case of GPT-3, not
         | everyone will be willing to route their data through OpenAI's
         | servers.
         | 
         | > Just use DistilBERT uncased/cased (which is fast enough to
         | run on consumer CPUs)
         | 
         | This can still be impractical, at least in my case of regularly
         | needing to process hundreds of pages of text. Simpler systems
         | can be much faster for an acceptable loss and you can get more
         | robustness by working with label distributions instead of just
         | picking argmax.
         | 
         | Fast simpler classifiers can also help decide where the more
         | resource intensive models should focus attention.
         | 
         | Another reason for preprocessing is rule systems. Even if not
         | glamorous to talk about, they still see heavy use in practical
         | settings. While dependency parses are hard to make use of,
         | shallow parses (chunking) and parts of speech data can be
         | usefully fed into rule systems.
        
         | new_stranger wrote:
         | I imagine it being very useful to understand what you just said
        
           | hooande wrote:
           | lol. a rough translation is that the new super language
           | models are good enough that you don't have to keep track of
           | specific parts of speech in your programming. if you look at
           | the arrays of floating point weights that underlie gpt-3 etc,
           | you can use them to match present participle phrases with
           | other present participle phrases and so forth
           | 
           | this is of course a correct and prescient observation.
           | minimaxir is kind of an NLP final boss, so I wouldn't expect
           | most people to be able to follow everything he says
        
             | minimaxir wrote:
             | I don't think it's more of a final boss thing: IMO working
             | with embeddings/word vectors is easier, even in the basest
             | case such as word2vec/GloVe, to understand than some of the
             | more conventional NLP techniques (e.g. bag of words/TF-
             | IDF).
             | 
             | The spaCy tutorials in the submission also have a section
             | on word vectors.
        
               | Vetch wrote:
               | Ah, although, TF-IDF is still good to know. Semantic
               | search hasn't eliminated the need for classical retrieval
               | techniques. It can also be used to select a subset of
               | words to use to create an average of word vectors for a
               | document signature, a quick and dirty method for document
               | embeddings.
               | 
               | Bag of word co-occurrences in matrix format is also a
               | nice to know, factorizing such matrices were the original
               | vector space model for distributional semantics and
               | provide historical context for GloVe and the like.
        
       | master_yoda_1 wrote:
       | I am not able to see what is advanced here. spCY just wrap all
       | the open source code/model into a python api and just want to
       | sell the hype.
        
         | dang wrote:
         | " _Please don 't post shallow dismissals, especially of other
         | people's work. A good critical comment teaches us something._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
           | Der_Einzige wrote:
           | As usual, dang is wrong and not moderating effectively. This
           | is not a shallow comment but a legitimate concern about
           | spaCy, and to a lesser extent other NLP tools such as NLTK.
           | Most of the tooling around them that people end up using
           | really is nothing more than wrappers around other tools. See
           | the default tokenizers or models utilized by these tools.
           | 
           | And yes, even if spaCy is not making money itself, you can
           | bet that the other paid for tools that they sell are.
        
             | dang wrote:
             | Actually if the GP had posted this critique instead of a
             | shallow, reductionist internet dismissal ("just want to
             | sell the hype"), that would have been fine. Thoughtful
             | critique is welcome--it just requires higher-quality
             | comments than that.
        
         | Ldorigo wrote:
         | Ah, yes. The tried-and-true method of "just selling the hype"
         | with an open source library that everyone can use for free.
        
         | coding123 wrote:
         | That's a huge part of software development. Wrapping things to
         | be more concise, use-case driven. I mean most software
         | developers are just placing a veneer over something more
         | complex. That's pretty much all we do.
        
       | 41209 wrote:
       | I really love spaCy, it's trivial to throw up a server which
       | handles basic NLP. No complaints here, very happy to see it still
       | being updated
        
       | artembugara wrote:
       | We've been using spaCy a lot for the past few months.
       | 
       | Mostly for non-production use cases, however, I can say that it
       | is the most robust framework for NLP at the moment.
       | 
       | V3 added support for transformers: that's a killer feature as
       | many models from https://huggingface.co/docs/transformers/index
       | work great out of the box.
       | 
       | At the same time, I found NER models provided by spaCy to have a
       | low accuracy while working with real data: we deal with news
       | articles https://demo.newscatcherapi.com/
       | 
       | Also, while I see how much attention ML models get from the
       | crowd, I think that many problems can be solved with rule-based
       | approach: and spaCy is just amazing for these.
       | 
       | Btw, we recently wrote a blog post comparing spaCy to NLTK for
       | text normalization task: https://newscatcherapi.com/blog/spacy-
       | vs-nltk-text-normaliza...
        
         | brd wrote:
         | I really appreciate how accessible SpaCy has made NLP work but
         | their NER is definitely low accuracy.
         | 
         | Where stem/lem felt critical to successful NLP processing a few
         | years ago, we've found stem/lem work to be much less important
         | for downstream tasks when transformer based models are
         | involved.
         | 
         | For topic extraction stem/lem still seems to do a lot to
         | improve accuracy and for rules based approaches I can still see
         | how it would facilitate more efficient processing at scale. I'd
         | be curious to hear your experience fine tuning and/or training
         | new models after stem/lem processing with transformers, we've
         | admittedly done little testing to see how transformers actually
         | performer if properly tuned to post-processed data.
        
           | artembugara wrote:
           | Did you try something like autoNLP by huggingface?
        
             | brd wrote:
             | No, we've got our own fine tuning pipeline and initial
             | tests showed better performance without traditional
             | stem/lem processing so we dropped it from our
             | classification pipelines and haven't seen a need to
             | revisit.
        
         | pantsforbirds wrote:
         | We use spaCy at work for (mostly) news articles as well. We've
         | been pretty impressed with it overall for detecting larger
         | trends using the NER models. I've been contemplating whether it
         | might be useful to make a spaCy module that uses a Count-Min
         | Sketch to track the top N of each of the NER categories
         | partitioned on a daily (or weekly etc.) time.
         | 
         | Think it could be an interesting use case to get sort of
         | similar results to Google's search trends.
        
           | artembugara wrote:
           | I'd really love to chat about that. Any chance to connect?
           | email in bio
        
         | Eridrus wrote:
         | I feel like NER is a poorly designed task in general. You're
         | eventually trying to link the entities to some kind of KB, so
         | you should be injecting that entity information into your
         | system for detecting mentions.
        
         | kulikalov wrote:
         | Are you using the high accuracy eng model for NER? I've been
         | very happy with orgs recognition, it actually did way better
         | than any other open source model in my case.
        
           | artembugara wrote:
           | Try it on a sentence where all tokens are lower/upper case.
           | It just doesn't really work.
        
         | Xenoamorphous wrote:
         | I don't know how it compares with other paid alternatives (like
         | Google's or Amazon's) but spaCy's NER was pretty close to the
         | (paid) service we were using (IBM) to the point we ditched IBM.
         | Also for news articles.
         | 
         | But yeah disambiguation/entity linking would be nice.
        
           | artembugara wrote:
           | I'd be happy to chat more if you want.
        
         | artembugara wrote:
         | Also I have an article about spaCy NER:
         | https://newscatcherapi.com/blog/named-entity-recognition-wit...
         | 
         | The conclusion I came up with:
         | 
         | "A few notes on my Spacy NER accuracy with "real world" data
         | 
         | Low accuracy with sentences without a proper casing
         | 
         | 1. Low accuracy overall, even with a large model
         | 
         | 2. You'd need to fine-tune your model if you want to use it in
         | production
         | 
         | 3. Overall, there's no open-source high accuracy NER model that
         | you can use out-of-a-box"
        
           | wyldfire wrote:
           | I assume your product does some kind of entity disambiguation
           | and/or link to an ontology? Spacy doesn't provide this out of
           | the box either, AFAICT. Can you share more info about how you
           | do it?
        
             | artembugara wrote:
             | We don't provide entity disambiguation out of a box. It's
             | more of a on request for Enterprise clients.
             | 
             | But overall, entity disambiguation is one of the most
             | useful and difficult tasks in the NLP.
             | 
             | SpaCy supports entity linking via knowledge base:
             | https://spacy.io/api/entitylinker
        
               | nefitty wrote:
               | That might be the killer feature from what I've heard.
        
               | Tarq0n wrote:
               | NER good enough to anonymise free text would be the
               | absolute dream for many governments.
        
           | Vetch wrote:
           | > Overall, there's no open-source high accuracy NER model
           | that you can use out-of-a-box"
           | 
           | Part of it is most underestimate the complexity of NER and
           | the rest of it, in my opinion, is that NER is not well-
           | defined as a classification problem.
           | 
           | At least in my experience, having a specific battery of
           | questions to query documents, first by transformer based
           | semantic search and narrowed by Q/A models, removed the need
           | for explicit NER, entity linking or relation extraction. For
           | the case of entities as features for rule systems, shallow
           | models and using all label predictions instead of just
           | selecting argmax has been sufficiently robust. Using big
           | transformers for classification doesn't pay enough to be
           | worth it there.
        
       ___________________________________________________________________
       (page generated 2021-12-10 23:00 UTC)