[HN Gopher] Cleaning algorithm finds 20% of errors in major imag...
       ___________________________________________________________________
        
       Cleaning algorithm finds 20% of errors in major image recognition
       datasets
        
       Author : groar
       Score  : 150 points
       Date   : 2020-04-16 16:08 UTC (6 hours ago)
        
 (HTM) web link (deepomatic.com)
 (TXT) w3m dump (deepomatic.com)
        
       | kent17 wrote:
       | > We then used the error spotting tool on the Deepomatic platform
       | to detect errors and to correct them.
       | 
       | I'm wondering if those errors are selected on how much they
       | impact the performance?
       | 
       | Anyway, this is probably a much better way of gaining accuracy on
       | the cheap than launching 100+ models for hyperparameter tuning.
        
       | frenchie4111 wrote:
       | Best I can tell, they are using the ML model to detect the
       | errors. Isn't this a bit of an ouroboros? The model will
       | naturally get better, because you are only correcting problems
       | where it was right but the label was wrong.
       | 
       | It's not necessarily a representation of a better model, but just
       | of a better testing set.
        
         | groar wrote:
         | If I understand correctly they actually did not change the test
         | set.
        
           | frenchie4111 wrote:
           | Ah, I guess I missed that
        
       | kent17 wrote:
       | 20% annotation error is huge, especially since those datasets
       | (COCO, VOC) are used for basically every benchmark and state of
       | the art research.
        
         | rndgermandude wrote:
         | And people wonder why I am still a bit skeptical of self-
         | driving cars....
        
           | s1t5 wrote:
           | In one of his fastai videos Jeremy Howard makes the point
           | that wrong labels can act as regularization and you shouldn't
           | worry too much about them. I'm a bit skeptical as to how far
           | you can push this but you certainly don't need _perfect_
           | labelling.
        
             | kingvash wrote:
             | We did some interesting experiments with Go where we
             | inverted the label of who won and measured what impact that
             | had on the final model. This is a binary label so it's
             | probably more impactful (it's the only signal we are
             | measuring)
             | 
             | From memory it had only a small impact (2% strength) with
             | ~7% of results flipped, at 4% it was hard to measure the
             | impact (<1%)
        
             | groar wrote:
             | That is true up to a certain point (for instance, in my
             | experience, having bounding boxes that are not pixel-
             | perfect acts as a regularizer), but there is also a good
             | chance that you are mislabelling edge cases, situations
             | that happen rarely, and that definitely hurts the
             | performance of the neural network to make a correct
             | prediction on these difficult / uncommon scenarios.
        
             | strbean wrote:
             | Also, this is applies to mislabeled data in your training
             | set, right? Not a good thing if it is in your test set.
        
           | rumanator wrote:
           | What sparks your skepticism of self-driving cars? Although
           | some companies use stereo vision to generate point clouds,
           | others use lidar.
        
             | rndgermandude wrote:
             | A lot of things. One is the "AI" which isn't so much "I"
             | and quite error prone and hard to impossible to analyze in
             | detail and/or debug. The idea that bad people (be it
             | trolls, criminals or spooks) could force deliberate
             | malfunctioning of/misclassifications in AIs and thus cause
             | crashes is off-putting, on top of the general "normal"
             | errors you can expect.
             | 
             | Then the business/political aspects of it, like Tesla
             | demanding somebody who bought a used car pay again for
             | Autopilot.
             | 
             | We already saw crashes by Autopilot users not paying any
             | attention whatsoever (granted AP isn't fully "self-
             | driving", but still).
             | 
             | On top of that, just like with better car safety and even
             | with the introduction safety belt laws, we saw a stark
             | uptick in accidents, that usually affected people outside
             | the car the most, such as pedestrians and bikers. So me
             | being a pedestrian quite often, I dread in particular the
             | semi-self-driving/assisted driving car tech like autopilot,
             | and have a good skepticism when people tell me that the
             | (almost) perfect fully self-driving cars are just around
             | the corner. If my skepticism turns out to be unwarranted,
             | great.
             | 
             | And this tech will keep many consumer cars around longer,
             | in disfavor of public transportation. The one good-ish
             | thing that came out of SARS-CoV-2 is the reduction in air
             | pollution (I am not saying it is a net positive because of
             | that, far from it). The air smells noticeably nicer around
             | here and the noise is also down.
        
               | ebg13 wrote:
               | > _The idea that bad people (be it trolls, criminals or
               | spooks) could force deliberate malfunctioning of
               | /misclassifications in AIs and thus cause crashes_
               | 
               | I wish people would stop trotting this one out. Bad
               | actors can deliberately cause humans to crash just as
               | easily if not moreso. If they don't, it's only because
               | such behavior is punishable.
        
               | rndgermandude wrote:
               | Making somebody crash in a dumb car is pretty hard if you
               | want to do it in an undetectable manner with minor to no
               | risk to yourself or anybody else.
               | 
               | Glitching an AI on the other hand e.g. by holding up a
               | sign is less risky for yourself, and less detectable.
        
               | ebg13 wrote:
               | > _Making somebody crash in a dumb car is pretty hard_
               | 
               | That's not true even allowing for your next constraints,
               | one of which I find to be quite absurd.
               | 
               | In the advanced technological case, you have
               | https://www.theverge.com/2015/7/21/9009213/chrysler-
               | uconnect...
               | 
               | In the non-advanced technological case, you can drop
               | caltrops behind your vehicle as you drive and no one
               | would know it was you.
               | 
               | "But that only happens in cartoons" - Yes, because most
               | people are not cartoon villains. And yet, look, kids
               | throwing rocks, no AI necessary:
               | https://en.wikipedia.org/wiki/2017_Interstate_75_rock-
               | throwi...
               | 
               | > _with minor to no risk to ... anybody else_
               | 
               | Ah, yes, the ethical murderer who only wants to fuck up
               | just that one car but who sincerely worries about the
               | other drivers on the road. That's the demographic you're
               | concerned about? So how does indiscriminately trying to
               | trick generally available systems specifically target
               | only one person without risking other drivers?
        
               | rndgermandude wrote:
               | If you're interested in replying in a condescending
               | manner and attacking strawmen arguments I never made, be
               | my guest, but I have no desire to further discuss this
               | with you.
        
               | ebg13 wrote:
               | > _I have no desire to further discuss this with you_
               | 
               | I'll just talk to myself then, because, while I
               | understand you feeling hurt by my comment, I did not
               | attack a strawman.
               | 
               | > _Making somebody crash in a dumb car is pretty hard..._
               | 
               | Not true. (I gave examples.)
               | 
               | > _...if you want to do it in an undetectable manner..._
               | 
               | Still not true. (Same examples.)
               | 
               | > _...with minor to no risk to yourself..._
               | 
               | Still not true. (Same examples.)
               | 
               | > _...or anybody else._
               | 
               | Still not true. (This is absurd. Also the same examples
               | still apply.)
        
         | peteradio wrote:
         | Is it really 20% annotation error? I read it as 20% of the
         | errors were detected. Errors could be some very small percent
         | and of those that had error 20% were detected.
        
           | ebg13 wrote:
           | I think the submitted headline is wrong. The article says "we
           | found annotation errors on more than 20% of images". Maybe
           | dang could fix it.
        
             | groar wrote:
             | Agreed, my initial title is not accurate. It should say
             | "finds errors in 20% of annotations".
        
       | CydeWeys wrote:
       | Why aren't these data sets editable instead of static? Treat them
       | like a collaborative wiki or something (OpenStreetMap being the
       | closest fit) and allow everyone to submit improvements so that
       | all may benefit.
       | 
       | I hope the people in this article had a way to contribute back
       | their improvements, and did so.
        
         | lmkg wrote:
         | One major use of the public datasets in the academic community
         | is to serve as a common reference when comparing new techniques
         | against the existing standard. A static baseline is desirable
         | for this task.
         | 
         | You could maybe split the difference by having an "original" or
         | "reference" version, and a separate moving target that
         | incorporates crowdsourced improvements.
        
           | CydeWeys wrote:
           | This sounds like a revisioning system would help a lot. Have
           | a quarterly or annual release cycle or something, so that
           | when you want to compare performance across techniques, you
           | just train both of them to the same target (and ideally all
           | the papers coming out at roughly the same time would already
           | be using the same revision anyway).
           | 
           | You'd always work with a versioned release when training
           | models, and you'd only typically work with HEAD when you were
           | specifically looking to correct flaws in the data (as the
           | authors in the linked article are).
        
         | 6gvONxR4sf7o wrote:
         | The datasets serve as benchmarks. You get an idea for a new
         | model that solves a problem current models have. These ideas
         | don't pan out, so you need empirical evidence that it works. To
         | show that your model does better than previous models, you need
         | some task that your model and previous models can share for
         | both training and evaluation. It's more complicated than that,
         | but that's the gist.
         | 
         | It would be so wasteful to have to retrain a dozen models that
         | require a month of GPU time each on to serve as baselines for
         | your new model...
        
           | hatmatrix wrote:
           | But you can have version numbers like with code and models.
        
           | barkingcat wrote:
           | That's not wasteful. That's correction.
           | 
           | Is it wasteful to throw away a batch of food when 20% of it
           | has been studied to contain the wrong substance, which ends
           | up causing disease?
           | 
           | Isn't it even more wasteful to continue using unedited and
           | unverified data sets just because all the previous models
           | were trained on it, and thus we can no longer advance the
           | state of the research? It's a case of garbage in garbage out.
        
             | 6gvONxR4sf7o wrote:
             | >By one estimate, the training time for AlphaGo cost $35
             | million [0]
             | 
             | How about XLNet which cost something like $30k-60k to train
             | [1]? GPT-2 may have been around the same [2] is estimated
             | around the same, while thankfully BERT only costs about
             | $7k[3], unless of course you're going to do any new
             | hyperparameter tuning on their models which you of course
             | will do on your own model. Who cares about apples-to-apples
             | comparisons?
             | 
             | We're not talking about spending an extra couple hours and
             | a little money on updated replication. We're talking about
             | an immediate overhead of tens to hundreds of thousands of
             | dollars per new paper.
             | 
             | Tasks are updated over time already to take issues into
             | account, but not continuously as far as I know.
             | 
             | [0] https://www.wired.com/story/deepminds-losses-future-
             | artifici...
             | 
             | [1]
             | https://twitter.com/jekbradbury/status/1143397614093651969
             | 
             | [2] https://news.ycombinator.com/item?id=19402666
             | 
             | [3] https://syncedreview.com/2019/06/27/the-staggering-
             | cost-of-t...
        
               | barkingcat wrote:
               | Yah it is by no means wasteful for AlphaGo to throw away
               | all their training data and then re-train itself!
               | 
               | That kind of ruthless experimentation is how AlphaGo was
               | able to exceed even itself. The willingness to say - all
               | these human games we've fed the computer? All these
               | terabytes of data? It's all meaningless! We're going to
               | throw it all away! We will have AlphaGo determine what is
               | good by playing games against itself!
               | 
               | And I bet you that for the next iteration of AlphaGo, the
               | creators of this system will again, delete their own data
               | and retrain when they have a better approach.
               | 
               | If you don't "waste" your existing datasets (once you
               | reallze the flaws in your data sets), you are being held
               | back by the sunk cost principle. You only have yourself
               | to blame when someone does train for the exact same
               | purposes, but with cleaner data.
               | 
               | The person who has the cleanest source of training data
               | will win in deep learning.
               | 
               | You're sabotaging yourself in my opinion. 30k is nothing
               | when you're just sabotaging the training with faulty
               | data.
        
               | [deleted]
        
               | third_I wrote:
               | As an investor, $35m to train just about the pinnacle of
               | AI seems like a cheap, oh so cheap cost. I can't even buy
               | 1 freaking continental jet for that ticket, and there are
               | thousands of these babies flying (not as we speak, but
               | generally).
               | 
               | I don't think you are fully cognizant yet with the
               | formidable scale of AI in the grander scheme of things,
               | as an industry, which is nowadays comparable to
               | transistors circa 1972 in terms of maturity. Long, long
               | ways to go before we sit on "reference" anything. Whether
               | architectures, protocols, models, test standards, it's a
               | Far West as we speak.
               | 
               | You make excellent points in principle, which are
               | important to keep in mind in guiding us all along the
               | way, but now is not the time to set things in stone. More
               | like the opposite.
               | 
               | The matter of the fact is that someone will eventually
               | grab the old and new benchmarks, prove superiority in
               | both, and by that point the new is the one to beat since
               | it would be presumably error-free this time.
        
               | visarga wrote:
               | BERT is trained on unsupervised data. It's not the same
               | kind of model the article talks about.
        
               | [deleted]
        
               | p1esk wrote:
               | I'm actually glad it costs so much to train these models.
               | Great incentive to find more efficient algorithms. That's
               | how biological brains evolved.
        
             | lopmotr wrote:
             | The dataset is a controlled variable in an experiment so it
             | has to be held constant. If you update your model and the
             | dataset for every trial (eg new hyperparameters or new
             | architecture), and find it performs better, you won't know
             | if the model is really better or just the dataset.
        
             | lmkg wrote:
             | The thing is, the value as a baseline doesn't actually
             | change that much for being 20% garbage. A bit counter-
             | intuitive, but basically accepted as true in several
             | fields.
             | 
             | The comparisons are all relative accuracy, not absolute
             | accuracy. And the comparison is _fair_. The new technique
             | is receiving the same part-garbage input that the old-
             | techniques were trained on. For the most part, the better
             | technique will still tend to do better unless there 's
             | specifically something about it that makes it more
             | sensitive to labeling errors.
             | 
             | And frankly, a percentage of junk has some advantages.
             | Real-word data is a pile of ass, so it's useful for
             | academic models to require robustness.
        
               | ethbro wrote:
               | I thought SOTA was still a few % in difference?
               | 
               | It seems worrisome that they few percent might be making
               | a coin flip right-randomly instead of wrong-randomly on a
               | mislabelled subgroup of data...
        
           | roosterdawn wrote:
           | What you're saying is that it's worth it to lie because it's
           | too expensive to give a truthful answer. That is something
           | that your customers likely would not agree with.
        
           | [deleted]
        
           | sdenton4 wrote:
           | It also potentially gives every paper N replication problems
           | to solve, in addition to just the gpu time. I would have to
           | figure out HOW to retrain all of these models on the current
           | form of the dataset... Which is fine for an occasional
           | explicit replication study, but terrible if everyone has to
           | do it.
           | 
           | I think it's probably better to have a (say) yearly release
           | of the dataset, with results of some benchmark models
           | released alongside the new version.
           | 
           | This is similar to how Common Voice is handling the problem:
           | it's a crowd sourced, constantly growing dataset, which is
           | awesome if you want to train in as much as possible for
           | production models. You can get the whole current version any
           | time, but they also have releases with a static fileset and
           | train/test split, which should be better for research.
        
         | polm23 wrote:
         | Multiple reasons, but to name a few:
         | 
         | - Don't want to deal with vandalism
         | 
         | - Hosting static data is dramatically easier than making a
         | public editing interface
         | 
         | - You want reference versions of the dataset for papers to
         | refer to so that results are comparable. Sometimes this is used
         | as a justification for not fixing completely broken data, like
         | with Fasttext.
         | 
         | https://github.com/facebookresearch/fastText/issues/710
         | 
         | - Building on the previous point, large datasets like this
         | don't play nice with Git. There are lots of "git for data"
         | things but none of them are very mature, and most people don't
         | spend time trying to figure something out.
        
         | [deleted]
        
         | xiphias2 wrote:
         | One problem with correcting the benchmark datasets is that it's
         | important for the algorithms to be robust to labelling errors
         | as well. But having multiple versions sounds important anyways.
        
         | seveibar wrote:
         | I'm working on this[1], my theory is the lack of a good IDE
         | (rather than simple crowdsourcing interface) is the reason why
         | it hasn't been done.
         | 
         | Imagine if github had an integrated ide for editing large
         | datasets. Also see dolt which is doing good work here.
         | 
         | [1] https://github.com/UniversalDataTool/universal-data-tool
        
       | gringomarketing wrote:
       | Gringo Marketing Article Spinner is constructed to offer the very
       | best spinning tools with the greatest worth for all users and all
       | languages. We know that premium article is crusial for every
       | single people and company to meet their target marketing needs
       | and requirements. We understand what is needed to supply
       | advanced, yet user friendly software application to provide users
       | the capability to make content quick create the short articles
       | they need with no high costs or complicated settings. Visit
       | https://gringomarketing.com/article-rewriter
        
       | groar wrote:
       | Using simple techniques, they found out that popular open source
       | datasets like VOC or COCO contain up to 20% annotation errors in.
       | By manually correcting those errors, they got an average error
       | reduction of 5% for state-of-the-art computer vision models.
        
         | jessermeyer wrote:
         | Garbage in garbage out.
        
       | m0zg wrote:
       | An idea on how this could work: repeatedly re-split the dataset
       | (to cover all of it), and re-train a detector on the splits, then
       | at the end of each training cycle surface validation frames with
       | the highest computed loss (or some other metric more directly
       | derived from bounding boxes, such as the number of high
       | confidence "false" positives which could be instances of under-
       | labeling) at the end of training. That's what I do on noisy, non-
       | academic datasets, anyway.
        
       | jontro wrote:
       | Weird behaviour on pinch to zoom (macbook). It scrolls instead of
       | zooming and when swiping back nothing happens.
       | 
       | Another example of why you should never mess with the defaults
       | unless strictly necessary.
        
       | rathel wrote:
       | Nothing is however said about _how_ the errors are detected. Can
       | an ML expert chime in?
        
         | ArnoVW wrote:
         | my guess would be using some sort of active learning. In other
         | words: 1) building a model using the data set 2) making
         | predictions using the training data 3) finding the cases where
         | the model is the most confused (difference in probability
         | between classes is low) 4) raising those cases to humans
         | 
         | https://en.wikipedia.org/wiki/Active_learning_(machine_learn...
        
         | thibaut-duguet wrote:
         | Hi rathel, I'm a Product Manager at Deepomatic and I have been
         | leading the study in question here. To detect the errors, we
         | trained a model (with a different neural network architecture
         | than the 6 listed in the post), and we then have a matching
         | algorithm that highlights all bounding boxes that were either
         | annotated but not predicted (False Negative), or predicted but
         | not annotated (False Positive). Those potential errors are also
         | sorted based on an error score to get first the most obvious
         | errors. Happy to answer any other question you may have!
        
           | rathel wrote:
           | Thank you for the explanation.
        
           | [deleted]
        
           | liquidify wrote:
           | Curious if you could find errors by comparing the results
           | from the different models. Places where models disagree with
           | each other more often would be areas that I would want to
           | target for error checking.
        
             | thaumasiotes wrote:
             | > Places where models disagree with each other more often
             | would be areas that I would want to target for error
             | checking.
             | 
             | This is a great idea if your goal is to maximize the rate
             | at which things you look at turn out to be errors. (On at
             | least one side.)
             | 
             | But it's guaranteed to miss cases where every model makes
             | the same inexplicable-to-the-human-eye mistake, and those
             | cases would appear to be especially interesting.
        
           | Zenst wrote:
           | Was the corrected datasets larger or smaller than the
           | originals?
           | 
           | Would also be interesting to see these improved datasets run
           | thru simulation of crashes with existing datasets and see how
           | they handle? Though not sure how you would go about that
           | beyond approaching current providers of such cars for data to
           | work thru and suspect they may be less open to admitting
           | flaws and with that, may be a stumbling block.
           | 
           | Certainly makes you wonder how far we can optimise such
           | datasets to get better results. I know some ML datasets are a
           | case of humans fine tuning and going thru examples and
           | classifying them, and wonder how much that skews or effects
           | error rates as we all know humans error.
        
             | thibaut-duguet wrote:
             | Hi Zenst, To answer your first question, we had both
             | bounding boxes added and removed, and depending on the
             | dataset, the main type of error was different (I'd say it
             | was overall more objectifs that were forgotten, especially
             | small objects).
             | 
             | It would indeed be very interesting to see the impact of
             | those improved datasets on driving, which is ultimately the
             | task that is automated for cars. We've been working on many
             | projects at Deepomatic not only related to autonomous cars,
             | and we did see some concrete impact of cleaning the
             | datasets beyond performance metrics.
        
           | alexchamberlain wrote:
           | ie you get some to check where the model and the annotations
           | disagree.
        
         | captain_price7 wrote:
         | plus we'll have to register simply to see a few examples of
         | mislabeling...that was disappointing
        
           | thibaut-duguet wrote:
           | Hi captain_price7, I've added screenshots of errors in the
           | blogpost so that you have an idea of the errors we spotted.
           | Let me know what you think of them.
        
             | thaumasiotes wrote:
             | A couple notes on those screenshots:
             | 
             | - In the cars-on-the-bridge image, the red bounding box for
             | the semitruck in the oncoming lanes is too small, with its
             | upper bound just above the top of the semi's windshield,
             | ignoring the much taller roof and towed container.
             | 
             | - In the same image, there are red bounding boxes around
             | cars that exist, and also red bounding boxes around non-
             | cars that don't exist. If false positives and false
             | negatives are going to be represented in the same picture,
             | it'd be nice to use different colors for them, so the
             | viewer can tell whether the error was identified correctly
             | or spuriously.
             | 
             | - I have trouble understanding the "bus" screenshot. The
             | caption says "(green pictures are valid errors) - The pink
             | dotted boxes are objects that have not been labelled but
             | that our error spotting algorithm highlighted." In other
             | words, the green-highlighted pictures are false negatives
             | considered from the perspective of the original data set,
             | and the red-highlighted pictures are true negatives. Or
             | alternatively, the green-highlighted pictures are true
             | positives from the perspective of the error-spotting
             | algorithm, and the red-highlighted pictures are false
             | positives. What confuses me is that all 9 pictures are
             | labeled "false positive" by the tabbing at the top of the
             | screenshot.
        
       | benibela wrote:
       | These things are why I stopped doing computer vision after my
       | master thesis
        
       | fwip wrote:
       | The title here seems wrong. Suggested change:
       | 
       | "Cleaning algorithm finds 20% of errors in major image
       | recognition datasets" -> "Cleaning algorithm finds errors in 20%
       | of annotations in major image recognitions."
       | 
       | We don't know if the found errors represent 20%, 90% or 2% of the
       | total errors in the dataset.
        
         | [deleted]
        
         | groar wrote:
         | Yes agreed with that ! I can't change the title unfortunately
        
       | magicalhippo wrote:
       | > Create an account on the Deepomatic platform with the voucher
       | code "SPOT ERRORS" to visualize the detected errors.
       | 
       | Nice ad.
        
         | thibaut-duguet wrote:
         | Our platform is actually designed for enterprise companies, so
         | we don't provide open access unfortunately.
        
           | scribu wrote:
           | I signed up and still couldn't see the errors.
           | 
           | I just see 3 datasets with generic annotations.
        
             | thibaut-duguet wrote:
             | Hi scribu, The process is actually a bit complicated but
             | let me explain it to you. Once you are on a dataset, click
             | on the label that you want and use the slider at the top
             | right corner of the page to switch modes (we call it smart
             | detection). You should then be able to access three tabs
             | and the errors are listed in the False Positive and False
             | Negative tabs (I've added a screenshot in the blogpost so
             | that you can make sure to be at the right place). Let me
             | know if you have any problem, thanks!
        
               | scribu wrote:
               | Thanks, I can see them now.
        
           | magicalhippo wrote:
           | Still, couldn't you have included an example or two in the
           | article no to illustrate the kind of errors we're talking
           | about?
        
       ___________________________________________________________________
       (page generated 2020-04-16 23:00 UTC)