hngopher.com

       [HN Gopher] Launch HN: Datasaur (YC W20) - data labeling interfa...
       ___________________________________________________________________
        
       Launch HN: Datasaur (YC W20) - data labeling interface for NLP
        
       Hey HN community -  I'm Ivan from Datasaur (https://datasaur.ai/) -
       we build software to allow humans to more efficiently label data
       for training natural language processing (NLP).  NLP algorithms are
       being trained in a wide variety of industries - from customer
       service to legal contracts, forum moderation to restaurant reviews.
       All these algorithms benefit from recent breakthroughs in academia
       and a generous open-source community. However, in order to be
       deployed to the real world, they require a custom set of training
       data to learn and understand the language unique to each industry.
       Therefore, people around the world are meticulously labeling data
       samples.  Example sentence: _London is the capital and largest city
       of England and of the United Kingdom._  Labels: "London" -->
       "capital", "United Kingdom"  Labels: "London" --> "largest city",
       "England"  In the last few years I've worked at companies such as
       Apple and Yahoo and noticed that many organizations tend to
       reinvent the wheel when creating labeling interfaces for their
       labelers. Some companies still do this work in Excel. We saw an
       opportunity to create a "single interface to rule them all" - to
       handle all sorts of text labeling tasks.  We leverage existing NLP
       capabilities to intelligently validate the quality of labels in a
       document and complement human judgment. Furthermore, we already
       understand terms like "Starbucks" and "New York" - why spend time
       labeling these terms from scratch every time? We created an API so
       you can plug in existing models to apply a first pass on labeling
       the document. We also built many other extensions to help labelers
       optimize their time - a "find and label" extension for labeling
       repetitive terms, a dictionary extension for quickly looking up
       unfamiliar terms. We spent the past year building out the labeling
       solution I wish I could have used.  We now handle named entity
       recognition, parts of speech, document labeling, coreference
       resolution (multiple words referring to the same object/person) and
       dependency parsing (drawing relationships between words). A case
       study with one of our clients shows 70% improved labeling
       efficiency upon adopting the Datasaur platform, and we have much
       more room to improve.  We also spoken with 100+ AI teams globally
       and identified the best practices in labeling. In addition to
       providing an enhanced interface, we can help track labeler
       performance, peer disagreement scores, and detect/remove labeler
       bias. By incorporating and encoding these features into our
       software, we can not only help improve the labeling efficiency but
       also improve the quality of the data and therefore the resulting AI
       model.  We believe that as AI becomes ever more prevalent and
       ubiquitous, labeling will become an increasingly important task. AI
       is a garbage-in, garbage-out technology, and the quantity and
       quality of data can often make a critical difference in the
       resulting AI model. We're really excited to open Datasaur up to the
       world today and hear your feedback. Have you run into similar
       labeling issues? What tips and tricks have you employed to keep up
       with AI's voracious appetite for data? We'd love to hear how you've
       tackled data labeling at your own companies. Thanks so much in
       advance!  Ivan
        
       Author : flyx
       Score  : 81 points
       Date   : 2020-03-06 19:25 UTC (3 hours ago)
        
       | milani wrote:
       | Congratulation for the launch!
       | 
       | To understand the scope of your work a little bit, if I have
       | Prodigy with custom labeling needs set up for me, do I still
       | benefit from switching to datasaur?
        
         | pouta wrote:
         | If anyone on the team is reading this post, please answer this
         | question.
        
         | flyx wrote:
         | Apologies for the delay! There is some overlap with what
         | Prodigy works on and I'm a big fan of what they're working on.
         | We cover some additional use cases (like coreference parsing)
         | and additionally help with managing teams of labelers. We're
         | complementary in many regards. Happy to discuss further, based
         | on your labeling needs.
        
       | braindead_in wrote:
       | Congrats. Do you guys use AllenNLP, by any chance?
        
         | flyx wrote:
         | We've looked into it! So far we've chosen to integrate with
         | spaCy. Can I ask what you like about AllenNLP?
        
       | [deleted]
        
       | hbcondo714 wrote:
       | On the pricing page, the Growth box shows a checkmark for
       | "Unlimited labels" but right below in the "Choose the right plan
       | for you", the Growth plan says the number of labels is
       | 10,000,000.
        
         | flyx wrote:
         | Great catch! We'll correct it asap. Since you caught it, we'll
         | give you unlimited labels :)
        
       | Shenglong wrote:
       | This is awesome--really excited to see this need being solved.
        
       | WFHRenaissance wrote:
       | Very cool logo. Just signed up.
        
         | WFHRenaissance wrote:
         | Following up. Your confirmation email gets flagged as leading
         | to an untrusted site in Gmail. Might be worth figuring out.
        
           | flyx wrote:
           | Yikes, will look into it. Thanks for the heads up!
        
       | chownation wrote:
       | Roarrsome, congrats!
        
       | crimsalis wrote:
       | Congrats on the launch! I spend more than 50% of my time labeling
       | data and this will make life much easier.
        
       | [deleted]
        
       | _prometheus wrote:
       | Datasaur looks awesome! Can't wait to try it out. Congrats on the
       | launch!
       | 
       | Curious about data security and privacy? How do you guarantee
       | privacy? Is there some cryptography or secure enclaves used? Some
       | sets of documents (and email) are super high trust.
       | 
       | Guessing the on-prem version is probably safest route
        
         | flyx wrote:
         | Thanks - good to see so many people concerned about privacy
         | here. I consider a privacy a top-level priority at Datasaur -
         | all data is fully encrypted. Our employees will never be able
         | to see or access any customer data. We already work with a bank
         | and cleared their security bar :)
        
       | comet_trail wrote:
       | Interesting product. Could have used this at previous companies.
       | How is this different from FigureEight or Scale?
        
         | veeralpatel979 wrote:
         | Scale offers labeling as a service. Datasaur is an interface
         | that companies can buy for their own labeling personnel, if I
         | understand correctly.
        
           | flyx wrote:
           | That's right! Scale probably has some awesome internal tools
           | that help them label faster. Datasaur wants to make those
           | same optimizations available to anyone with their own
           | labelers.
        
       | sailfast wrote:
       | This looks awesome! Waiting for my email confirmation.
       | 
       | I was looking for information about where my data has to be
       | hosted to use this service and could not find it. Will there be
       | some more information about how this data is handled once I get
       | past the login? Thanks!
        
         | flyx wrote:
         | We offer both a hosted service on AWS as well as an on-prem
         | solution if needed. We can even choose to host on the cloud
         | provider of your choice - happy to work with you on this!
        
       | inerte wrote:
       | LinkedIn suggested a post from you a couple weeks ago and I
       | remember thinking "what's Ivan up to?" and I saw Datasaur.
       | Congrats on YC! I know that our time at Yahoo was a brief overlap
       | but I remember the swirl of ML, Knowledge Graph and labelling our
       | org was at 5 years ago.
       | 
       | Good luck with Datasaur!
       | 
       | - Julio Nobrega
        
         | flyx wrote:
         | Julio - great to hear from you, and thanks for the kind words
         | :) In many ways, my journey to Datasaur began with that
         | team/project 5 years ago.
        
       | zhangwins wrote:
       | there are lots of solutions for images/video already but great to
       | see someone tackling text.. signing up now
        
       | hbcondo714 wrote:
       | Any chance you could support HTML files? We've been using
       | https://www.tagtog.net/ for some of our data labeling /
       | annotations needs but their tool for these file types is still
       | "experimental".
        
         | flyx wrote:
         | Sure can! Would love to hear more - what do you want to extract
         | from the HTML files?
        
           | hbcondo714 wrote:
           | Thanks! We actually just need to be able to upload HTML files
           | and have it rendered as a web page (and not just display the
           | HTML code) so our team can data label / annotate certain
           | sentences throughout the document.
        
             | flyx wrote:
             | Yea, we can 100% handle this. If you sign up for a demo,
             | happy to discuss further!
        
       | mroll wrote:
       | Hey Ivan, this looks great! What are the privacy implications for
       | my data that I want to label with your tool? I'm assuming I
       | upload it to your servers?
        
         | flyx wrote:
         | Great question! Data privacy is a top-level priority for us. We
         | actually offer both a cloud-based and on-prem solution. One of
         | our clients needed a fully on-prem, air-gapped (no connection
         | to internet) option. Many are choosing to use us _because_ they
         | can 't send their data to outsourced, external parties.
        
       | andrewnc wrote:
       | This is very cool! I especially love the logo. Congrats on the
       | launch and best of luck.
        
       | seaturtles wrote:
       | Awesome! Congrats, excited for this!
        
       | aliakhtar wrote:
       | Cool project, what would be cooler is if you had an API to
       | retrieve the labels for a given word. May be that's in the works?
        
         | flyx wrote:
         | Done and shipped! :D One of our extensions allows you to plug
         | in an API - either use your own model, or an integration with
         | an open-source project like spaCy to apply labels.
        
       | makrmark wrote:
       | Really cool UI, looks like it's super easy to tag stuff.
       | Requested a demo to see more of it in action!
        
       | staticautomatic wrote:
       | Could you please elaborate on what you mean by "intelligently
       | validate the quality of labels in a document and complement human
       | judgment", and discuss your methodology?
       | 
       | This seems to operate under the assumption that human labels are
       | not actually the ground truth. I understand that they can be
       | dirty, but most unsupervised approaches aren't producing a ground
       | truth, either. So, are you saying it's better to have multiple
       | pretty good sources of truth instead? Because depending on the
       | application, that might make sense or it might be like trying to
       | start a farm with a dead horse and a dead cow.
        
         | flyx wrote:
         | Certainly. Our philosophy is to complement human wisdom with
         | computer precision. Humans may often be labeling for 8 hours a
         | day and may get fatigued. So if Starbucks has been labeled as a
         | cafe 35x in a document and as a person 2x, we can flag this and
         | ask "hey, are you sure you wanted to label this as a person?".
         | Or if we know for a fact Canada is a country, but it's labeled
         | as an animal in a document, we can raise a flag as well. This
         | won't work for everything, but we think it can help with
         | quality assurance.
        
       | narrationbox wrote:
       | This looks wonderful, will definitely try it out. We ran into the
       | labeling issue when doing NER a couple years ago on Reddit books
       | dataset. If only this existed then.
        
         | flyx wrote:
         | Thanks for the kind words! Yea we're building out what I wish
         | we had at my last few companies. Looking forward to your
         | feedback.
        
       | dunky11 wrote:
       | Wish you good luck, the website looks clean, the product idea is
       | good:) You request an image however which width is 3000+ pixels:
       | https://s.datasaur.ai/static/media/homepage-hero.4917b8af.pn... .
       | 1200px in width should be enough, I would resize the image, it
       | slows down the page.
        
         | flyx wrote:
         | Yikes - good point. We'll optimize.
        
       ___________________________________________________________________
       (page generated 2020-03-06 23:00 UTC)