[HN Gopher] Show HN: A labelling tool to easily extract and labe...
       ___________________________________________________________________
        
       Show HN: A labelling tool to easily extract and label Wikipedia
       data
        
       Hi HN! I am Maria, solo founder of DataQA (https://dataqa.ai/), a
       tool to search and label documents for various NLP tasks (e.g.
       entity extraction, entity linking, etc).  I have worked as a data
       scientist and ML engineer for the better part of a decade, and over
       that time have specialised mainly in applications involving natural
       language processing (NLP). One of the key questions I have always
       had at the back of my mind is whether my time was well spent.
       Whenever I spent more time on feature engineering or trying
       different models, I always wondered whether I would get better
       return on investment by simply labelling more data. I have created
       DataQA to enhance exploration & labelling of documents. It is open-
       source and ships with the elasticsearch text search engine which I
       have packaged as a python package (might be topic of a future
       technical post), as well as a rules-based engine to do pre-
       labelling of documents using NLP rules. It is very easy to install
       with a single pip command.  One of the key things I wanted to add
       to DataQA is an integration to Wikipedia. Even though wikipedia is
       the largest living repository of human knowledge in the world, I
       still always found it difficult to process it and create structured
       datasets for my specific applications. Since wiki pages are long-
       form articles, it is important to divide the text into smaller text
       chunks. A lot of the interesting data is also sometimes displayed
       in tables. With DataQA you can now upload a list of wikipedia page
       urls and the tool will extract the articles, process them and even
       parse the tables, so you can then label any entities you want. You
       can find a tutorial here:
       https://towardsdatascience.com/a-labelling-tool-to-easily-ex....
       The open-source version of DataQA currently only supports csv, but
       I have an enterprise version with premium features such as
       labelling of pdfs (with understanding of tables). If you're
       interested in a free trial, please contact me at contact@dataqa.ai
       :-).
        
       Author : mariarmestre
       Score  : 75 points
       Date   : 2021-12-17 17:42 UTC (5 hours ago)
        
       | jph wrote:
       | This is a great project and tutorial. IMHO you have a value prop
       | that's much larger/better than what you're describing here.
       | 
       | To me it sounds like you're creating a data mining annotation
       | tool that can work on any large corpus of free-form documents
       | that have discoverable labels, such as medical records, legal
       | cases, press releases, SEC filings, customer reviews, etc.
       | 
       | Can you speak to any of these? And do you have a pitch deck or
       | similar ask for funding/help/advisors?
        
         | hbcondo714 wrote:
         | Similar question, the CSV contains a collection of Wiki URLs,
         | why not pass URLs of any website that has "free-form
         | documents"?
        
       | jonas_kgomo wrote:
       | This seems to coincide with the new update on OpenAI's GPT-3
       | ability to reference links from searching, A new version of GPT-3
       | that can use a web browser to more accurately answer questions:
       | https://t.co/bzaaP9XnZm
        
       | dpifke wrote:
       | How do the results from your tool compare to Wikidata?
        
         | mariarmestre wrote:
         | This is to build your own knowledge base. In many cases,
         | Wikidata might not have the data you're looking for. For
         | example, in the tutorial I have linked, the task is to come up
         | with all the products released by a list of companies. Toutiao
         | would be a product of Bytedance. This is a relation that might
         | not exist on Wikidata (I tried to search for it but could not
         | find it https://www.wikidata.org/wiki/Q24835387).
        
           | yorwba wrote:
           | I added ByteDance as the creator and owner of Toutiao. (It
           | was already listed as a "product or material produced" on the
           | ByteDance page https://wikidata.org/wiki/Q55606242 )
        
           | trystero wrote:
           | Why not use that to enrich Wikidata?
        
           | smsm42 wrote:
           | You can add relationships to Wikidata. Something like "is a
           | product of" probably already has a property, and would be
           | well within the scope of Wikidata.
        
       | noajshu wrote:
       | This is really cool and reminds me of the Microsoft tool PICL
       | (https://www.microsoft.com/en-us/research/video/machine-teach...)
       | I would love to see a video demo of the product.
        
         | mariarmestre wrote:
         | Hi! Thanks for your feedback. I had not come across this, but
         | it looks quite similar :) (at least conceptually). I don't have
         | a video, but there is a short gif in the repository. I am
         | planning to make a video at some point though!
        
       ___________________________________________________________________
       (page generated 2021-12-17 23:00 UTC)