[HN Gopher] Launch HN: Aquarium (YC S20) - Improve Your ML Datas...
       ___________________________________________________________________
        
       Launch HN: Aquarium (YC S20) - Improve Your ML Dataset Quality
        
       Hi everyone! I'm Peter from Aquarium
       (https://www.aquariumlearning.com/). We help deep learning
       developers find problems in their datasets and models, then help
       fix them by smartly curating their datasets. We want to build the
       same high-power tooling for data curation that sophisticated ML
       companies like Cruise, Waymo, and Tesla have and bring it to the
       masses.  ML models are defined by a combination of code and the
       data that the code trains on. A programmer must think hard about
       what behavior they want from their model, assemble a dataset of
       labeled examples of what they want their model to do, and then
       train their model on that dataset. As they encounter errors in
       production, they must collect and label data for the model to train
       on to fix these errors, and verify they're fixed by monitoring the
       model's performance on a test set with previous failure cases. See
       Andrej Karpathy's Software 2.0 article
       (https://medium.com/@karpathy/software-2-0-a64152b37c35) for a
       great description of this workflow.  My cofounder Quinn and I were
       early engineers at Cruise Automation (YC W14), where we built the
       perception stack + ML infrastructure for self driving cars. Quinn
       was tech lead of the ML infrastructure team and I was tech lead for
       the Perception team. We frequently ran into problems with our
       dataset that we needed to fix, and we found that most model
       improvement came from improvement to a dataset's variety and
       quality. Basically, ML models are only as good as the datasets
       they're trained on.  ML datasets need variety so the model can
       train on the types of data that it will see in production
       environments. In one case, a safety driver noticed that our car was
       not detecting green construction cones. Why? When we looked into
       our dataset, it turned out that almost all of the cones we had
       labeled were orange. Our model had not seen many examples of green
       cones at training time, so it was performing quite badly on this
       object in production. We found and labeled more green cones into
       our training dataset, retrained the model, and it detected green
       cones just fine.  ML datasets need clean and consistent data so the
       model does not learn the wrong behavior. In another case, we
       retrained our model on a new batch of data that came from our
       labelers and it was performing much worse on detecting "slow signs"
       in our test dataset. After days of careful investigation, we
       realized it was due to a change to our labeling process that caused
       our labelers to label many "speed limit signs" as "slow signs,"
       which was confusing the model and causing it to perform badly on
       detecting "slow signs." We fixed our labeling process, did an
       additional QA pass over our dataset to fix the bad labels,
       retrained our model on the clean data, and the problems went away.
       While there's a lot of tooling out there to debug and improve code,
       there's not a lot of tooling to debug and improve datasets. As a
       result, it's extremely painful to identify issues with variety and
       quality and appropriately modify datasets to fix them. ML engineers
       often encounter scenarios like:  Your model's accuracy measured on
       the test set is at 80%. You abstractly understand that the model is
       failing on the remaining 20% and you have no idea why.  Your model
       does great on your test set but performs disastrously when you
       deploy it to production and you have no idea why.  You retrain your
       model on some new data that came in, it's worse, and you have no
       idea why.  ML teams want to understand what's in their datasets,
       find problems in their dataset and model performance, and then edit
       / sample data to fix these problems. Most teams end up building
       their own one-off tooling in-house that isn't very good. This
       tooling typically relies on naive methods of data curation that are
       really manual and involve "eyeballing" many examples in your
       dataset to discover labeling errors / failure patterns. This works
       well for small datasets but starts to fail as your dataset size
       grows above a few thousand examples.  Aquarium's technology relies
       on letting your trained ML model do the work of guiding what parts
       of the dataset to pay attention to. Users can get started by
       submitting their labels and corresponding model predictions through
       our API. Then Aquarium lets users drill into their model
       performance - for example, visualize all examples where we confused
       a labeled car for a pedestrian from this date range - so users can
       understand the different failure modes of a model. Aquarium also
       finds examples where your model has the highest loss / disagreement
       with your labeled dataset, which tends to surface many labeling
       errors (ie, the model is right and the label is wrong!).  Users can
       also provide their model's embeddings for each entry, which are an
       anonymized representation of what their model "thought" about the
       data. The neural network embeddings for a datapoint (generated by
       either our users' neural networks or by our stable of pretrained
       nets) encode the input data into a relatively short vector of
       floats. We can then identify outliers and group together examples
       in a dataset by analyzing the distances between these embeddings.
       We also provide a nice thousand-foot-view visualization of
       embeddings that allows users to zoom into interesting parts of
       their dataset. (https://youtu.be/DHABgXXe-Fs?t=139)  Since
       embeddings can be extracted from most neural networks, this makes
       our platform very general. We have successfully analyzed dataset +
       models operating on images, 3D point clouds from depth sensors, and
       audio.  After finding problems, Aquarium helps users solve them by
       editing or adding data. After finding bad data, Aquarium integrates
       into our users' labeling platforms to automatically correct
       labeling errors. After finding patterns of model failures, Aquarium
       samples similar examples from users' unlabeled datasets (green
       cones) and sends those to labeling.  Think about this as a platform
       for interactive learning. By focusing on the most "important" areas
       of the dataset that the model is consistently getting wrong, we
       increase the leverage of ML teams to sift through massive datasets
       and decide on the proper corrective action to improve their model
       performance.  Our goal is to build tools to reduce or eliminate the
       need for ML engineers to handhold the process of improving model
       performance through data curation - basically, Andrej Karpathy's
       Operation Vacation concept (https://youtu.be/g2R2T631x7k?t=820) as
       a service.  If any of those experiences speak to you, we'd love to
       hear your thoughts and feedback. We'll be here to answer any
       questions you might have!
        
       Author : pgao
       Score  : 117 points
       Date   : 2020-07-13 15:05 UTC (7 hours ago)
        
       | fractionalhare wrote:
       | If I understand correctly, it sounds like your platform is
       | primarily intended for improving awareness and understanding of
       | the data a team has, so they know which features to focus on and
       | emphasize.
       | 
       | Do you think you'll get into synthetic data generation as well?
       | In other words, improving dataset quality additively, not just
       | curatively.
        
         | pgao wrote:
         | Yes, your interpretation is correct. I don't think we're going
         | to get into synthetic data generation in the near term, mainly
         | due to the amount of effort required + questions about domain
         | transfer. However, we do improve dataset quality additively by
         | sampling the best data to label + retrain on to get the best
         | performance.
         | 
         | Said another way: once you've found "I do badly on green
         | cones," we use similarity search on the embeddings of known
         | green cone examples to find more instances of green cones in
         | the wild. We pick the right examples from streams of unlabeled
         | data, then send it to labeling + add to your dataset so it does
         | better the next time you retrain.
        
           | mlthoughts2018 wrote:
           | I like this much better than synthetic data augmentation
           | actually. I think synthetic augmentation, like with GANs is
           | actually a failed concept.
           | 
           | There has long been theoretical limits around how much you
           | can gain by ensembling with a model of known limitations, and
           | this is all that synthetic training data is at root.
           | 
           | You can't "make up" training data that allows you to escape
           | the ceiling of performance implied by whatever generator
           | process you use for the synthetic data, no differently than
           | you can't learning a better regression just by bootstrapping
           | a large sample of data from your existing training set.
           | 
           | Algorithmic synthetic data is a big type of fool's gold.
        
       | TuringNYC wrote:
       | Dear @pgao thank you for the long intro with references and
       | explanations. I went to your website and noticed the "getting
       | started" is a contact form. Curious -- are you making a product
       | to do this, or is it more consulting/advisory? I'm currently
       | creating some fun datasets for public usage and i'd love to be a
       | test rat for your software.
        
         | pgao wrote:
         | Hey there, it's a product right now! Our goal is to make it
         | self serve, but we're currently onboarding people one-by-one
         | manually until we can streamline the onboarding flow and build
         | out a self serve process. Feel free to DM me or fill out the
         | form and I can send you our public demo!
        
       | stev3 wrote:
       | Thanks for all the hard work and congrats on your launch!
       | 
       | I will definitely check this service out for a side project I'm
       | working on that combines basketball and AI
       | (https://www.myshotcount.com/)
        
       | ishcheklein wrote:
       | Hey! DVC maintainer and co-founder here. First of all, congrats
       | and let me know if we can help you or you have some collaboration
       | in mind! A few questions - how does workflow look like - do you
       | expect users to upload all data to your service? How can data
       | then be consumed from the platform?
        
         | pgao wrote:
         | Thanks!
         | 
         | We don't expect users to upload all data to our service - the
         | type of data we're interested in is "metadata." URLs to the raw
         | data, labels, inferences, embeddings, and any additional
         | attributes for their dataset. Users can POST this to our API
         | and we'll ingest it that way.
         | 
         | If users don't provide their own embeddings, we need access to
         | the raw data so we can run our pretrained models on the data to
         | generate embeddings.
         | 
         | However, if users do provide their own embeddings, we would
         | never need access to the raw data - Aquarium operates on
         | embeddings, so the raw data URLs would be purely for
         | visualization within the UI. This is really nice because it
         | means that we can access restrict URLs so only customers can
         | visualize it (via URL signing endpoints, only authorizing IP
         | addresses within customer VPNs, Okta integration) and Aquarium
         | would operate on relatively anonymized embeddings and metadata.
        
       | hughpeters wrote:
       | Thanks for sharing @pgao! This tool looks really valuable.
       | 
       | > Since embeddings can be extracted from most neural networks,
       | this makes our platform very general. We have successfully
       | analyzed dataset + models operating on images, 3D point clouds
       | from depth sensors, and audio.
       | 
       | Are there any types of datasets/models that this tool would not
       | work well with that you're aware of?
        
         | pgao wrote:
         | Thanks a bunch!
         | 
         | I think the biggest issues with this approach is the
         | requirement for embeddings. It's hard sometimes for a customer
         | to understand what layer to pull out of their net to send to
         | us, so sometimes we just use a pretrained net to generate
         | embeddings. One net for audio, one net for imagery, one net for
         | pointclouds, etc.
         | 
         | I'd say that it's harder for this tool to work with
         | structured/tabular data for a few reasons.
         | 
         | One, most structured datasets are domain-specific, so it's not
         | easy to pull a pretrained model off the shelf to generate
         | embeddings - typically we would need a customer to give us the
         | embeddings from their own model in these cases.
         | 
         | Two, neural nets actually aren't the best for certain
         | structured data tasks. Tree-based techniques often get better
         | performance on simpler tasks, which means there's no obvious
         | embedding to pull from the model.
         | 
         | Three, an alternate interpretation is that a feature vector
         | input for structured data tasks is already an embedding! When
         | the input data is low dimensional, you can do anomaly detection
         | and clustering just by histogramming and other basic population
         | statistics on your data, so it's a lot easier than dealing with
         | unstructured data like imagery.
         | 
         | So I wouldn't say that our tooling wouldn't work for structured
         | data, but more that in those types of cases, maybe there's
         | something simpler that works just as well.
        
       | tmshapland wrote:
       | I'm an Aquarium user. There are two ways Aquarium provides value
       | to my company. First, we improved our model performance. Second,
       | I spent less time and less clicks curating my dataset.
       | 
       | Regarding model performance, I used Aquarium to improve the AUC
       | for my model by 18 percentage points (i.e., comparing the AUC for
       | the first model trained on my new dataset to the AUC for my
       | production model).
       | 
       | Regarding dataset curation efficiency, I spent much less time
       | curating my dataset using Aquarium than I would have spent using
       | our own in-house tooling. For example, the embedding-based point
       | cloud allowed me to identify lots of images with an issue at
       | once, rather than image by image, click by click.
       | 
       | This thread has been mostly focused on improving model
       | performance (i.e., my first point), but Aquarium is also valuable
       | for improving model curation labor efficiency (i.e., my second
       | point). For the business owner, dataset curation labor efficiency
       | means less money wasted on having some of your most expensive
       | employees, ML data scientists, clicking around and writing ad-hoc
       | scripts. For the ML practitioner, dataset curation labor
       | efficiency means fewer clicks and less wear on your carpal
       | tunnels.
       | 
       | The founders, Peter and Quinn, didn't ask me to write this. I
       | chose to write it because it's a great product that I think can
       | help a lot of businesses and people.
        
         | __sy__ wrote:
         | To second your comment, I think non ML folks don't understand
         | how much of an impact dataset curation can have on model
         | performance. More high-quality data will outshine clever
         | network architectures with less data. I've seen it again and
         | again. But the thing is, it's so hard to really curate your
         | data once the dataset has a lot of "dimensionality" to it
         | (sorry couldn't think of a better word...). To be honest, if I
         | were to pick an area of dev-tool I'm most excited about over
         | the next 5 years, this area is probably it.
        
           | __sy__ wrote:
           | Btw, for anyone interested, here's a good/quick talk by
           | Andrej Kapathy on what it will take to build the next
           | software stack.
           | https://www.youtube.com/watch?v=y57wwucbXR8&t=3s
        
       | masio12 wrote:
       | I think is a great idea because as you mentioned quality Datasets
       | can make your model work or not at all. However this is not
       | addressing the big elephant in the room. Which is: no matter how
       | much you curate or clean the data, you are limited to the dataset
       | that you have. The big answer would be, how can you get more and
       | better datasets. I think tooling is super important, but the big
       | difference will be, how to collect/generate/capture reliable,
       | defendable, datasets moving forward. I think your idea is
       | complementary to this other project: https://delegate.dev
        
         | pgao wrote:
         | I absolutely, 120% agree on the importance of adding the right
         | data. Aquarium helps you with: "what data should I be
         | collecting to improve my model" and "where do I find that
         | data?"
         | 
         | For the latter, Aquarium treats the problem of smart data
         | sampling as a search and retrieval problem. You want to find
         | more examples of a "target" from a large stream of unlabeled
         | data. Aquarium does this by comparing embeddings of the
         | unlabeled data to your "target set" and then sending examples
         | to labeling if they're within a defined distance threshold in
         | embedding space. We don't actually do the labeling, but we wrap
         | around common labeling providers and can integrate into in-
         | house flows with our API.
        
           | quinnhj wrote:
           | Other founder here! For a high level overview of this framing
           | of the problem, I recommend reading this Waymo blog post [1].
           | 
           | One nice feature is that by using embeddings produced by a
           | user's model, which has been trained in the context of their
           | domain, we can do this sort of smart sampling in domains
           | we've never seen before. Embeddings are also naturally
           | anonymized, so we can do this without access to a user's
           | potentially private raw data streams.
           | 
           | [1] https://blog.waymo.com/2020/02/content-search.html
        
       | jononor wrote:
       | Have tested the tool a little bit for audio, and see a potential
       | here. Especially useful for anyone who has a relatively large
       | amount of unlabeled data, and want to be efficient in terms of
       | what samples to spend resources labeling.
        
         | pgao wrote:
         | Thanks for the shoutout! We got connected to jononor through
         | our previous r/machinelearning launch:
         | https://www.reddit.com/r/MachineLearning/comments/hjbl4h/p_l...
        
       ___________________________________________________________________
       (page generated 2020-07-13 23:00 UTC)