[HN Gopher] Cleaning algorithm finds 20% of errors in major imag... ___________________________________________________________________ Cleaning algorithm finds 20% of errors in major image recognition datasets Author : groar Score : 150 points Date : 2020-04-16 16:08 UTC (6 hours ago) (HTM) web link (deepomatic.com) (TXT) w3m dump (deepomatic.com) | kent17 wrote: | > We then used the error spotting tool on the Deepomatic platform | to detect errors and to correct them. | | I'm wondering if those errors are selected on how much they | impact the performance? | | Anyway, this is probably a much better way of gaining accuracy on | the cheap than launching 100+ models for hyperparameter tuning. | frenchie4111 wrote: | Best I can tell, they are using the ML model to detect the | errors. Isn't this a bit of an ouroboros? The model will | naturally get better, because you are only correcting problems | where it was right but the label was wrong. | | It's not necessarily a representation of a better model, but just | of a better testing set. | groar wrote: | If I understand correctly they actually did not change the test | set. | frenchie4111 wrote: | Ah, I guess I missed that | kent17 wrote: | 20% annotation error is huge, especially since those datasets | (COCO, VOC) are used for basically every benchmark and state of | the art research. | rndgermandude wrote: | And people wonder why I am still a bit skeptical of self- | driving cars.... | s1t5 wrote: | In one of his fastai videos Jeremy Howard makes the point | that wrong labels can act as regularization and you shouldn't | worry too much about them. I'm a bit skeptical as to how far | you can push this but you certainly don't need _perfect_ | labelling. | kingvash wrote: | We did some interesting experiments with Go where we | inverted the label of who won and measured what impact that | had on the final model. This is a binary label so it's | probably more impactful (it's the only signal we are | measuring) | | From memory it had only a small impact (2% strength) with | ~7% of results flipped, at 4% it was hard to measure the | impact (<1%) | groar wrote: | That is true up to a certain point (for instance, in my | experience, having bounding boxes that are not pixel- | perfect acts as a regularizer), but there is also a good | chance that you are mislabelling edge cases, situations | that happen rarely, and that definitely hurts the | performance of the neural network to make a correct | prediction on these difficult / uncommon scenarios. | strbean wrote: | Also, this is applies to mislabeled data in your training | set, right? Not a good thing if it is in your test set. | rumanator wrote: | What sparks your skepticism of self-driving cars? Although | some companies use stereo vision to generate point clouds, | others use lidar. | rndgermandude wrote: | A lot of things. One is the "AI" which isn't so much "I" | and quite error prone and hard to impossible to analyze in | detail and/or debug. The idea that bad people (be it | trolls, criminals or spooks) could force deliberate | malfunctioning of/misclassifications in AIs and thus cause | crashes is off-putting, on top of the general "normal" | errors you can expect. | | Then the business/political aspects of it, like Tesla | demanding somebody who bought a used car pay again for | Autopilot. | | We already saw crashes by Autopilot users not paying any | attention whatsoever (granted AP isn't fully "self- | driving", but still). | | On top of that, just like with better car safety and even | with the introduction safety belt laws, we saw a stark | uptick in accidents, that usually affected people outside | the car the most, such as pedestrians and bikers. So me | being a pedestrian quite often, I dread in particular the | semi-self-driving/assisted driving car tech like autopilot, | and have a good skepticism when people tell me that the | (almost) perfect fully self-driving cars are just around | the corner. If my skepticism turns out to be unwarranted, | great. | | And this tech will keep many consumer cars around longer, | in disfavor of public transportation. The one good-ish | thing that came out of SARS-CoV-2 is the reduction in air | pollution (I am not saying it is a net positive because of | that, far from it). The air smells noticeably nicer around | here and the noise is also down. | ebg13 wrote: | > _The idea that bad people (be it trolls, criminals or | spooks) could force deliberate malfunctioning of | /misclassifications in AIs and thus cause crashes_ | | I wish people would stop trotting this one out. Bad | actors can deliberately cause humans to crash just as | easily if not moreso. If they don't, it's only because | such behavior is punishable. | rndgermandude wrote: | Making somebody crash in a dumb car is pretty hard if you | want to do it in an undetectable manner with minor to no | risk to yourself or anybody else. | | Glitching an AI on the other hand e.g. by holding up a | sign is less risky for yourself, and less detectable. | ebg13 wrote: | > _Making somebody crash in a dumb car is pretty hard_ | | That's not true even allowing for your next constraints, | one of which I find to be quite absurd. | | In the advanced technological case, you have | https://www.theverge.com/2015/7/21/9009213/chrysler- | uconnect... | | In the non-advanced technological case, you can drop | caltrops behind your vehicle as you drive and no one | would know it was you. | | "But that only happens in cartoons" - Yes, because most | people are not cartoon villains. And yet, look, kids | throwing rocks, no AI necessary: | https://en.wikipedia.org/wiki/2017_Interstate_75_rock- | throwi... | | > _with minor to no risk to ... anybody else_ | | Ah, yes, the ethical murderer who only wants to fuck up | just that one car but who sincerely worries about the | other drivers on the road. That's the demographic you're | concerned about? So how does indiscriminately trying to | trick generally available systems specifically target | only one person without risking other drivers? | rndgermandude wrote: | If you're interested in replying in a condescending | manner and attacking strawmen arguments I never made, be | my guest, but I have no desire to further discuss this | with you. | ebg13 wrote: | > _I have no desire to further discuss this with you_ | | I'll just talk to myself then, because, while I | understand you feeling hurt by my comment, I did not | attack a strawman. | | > _Making somebody crash in a dumb car is pretty hard..._ | | Not true. (I gave examples.) | | > _...if you want to do it in an undetectable manner..._ | | Still not true. (Same examples.) | | > _...with minor to no risk to yourself..._ | | Still not true. (Same examples.) | | > _...or anybody else._ | | Still not true. (This is absurd. Also the same examples | still apply.) | peteradio wrote: | Is it really 20% annotation error? I read it as 20% of the | errors were detected. Errors could be some very small percent | and of those that had error 20% were detected. | ebg13 wrote: | I think the submitted headline is wrong. The article says "we | found annotation errors on more than 20% of images". Maybe | dang could fix it. | groar wrote: | Agreed, my initial title is not accurate. It should say | "finds errors in 20% of annotations". | CydeWeys wrote: | Why aren't these data sets editable instead of static? Treat them | like a collaborative wiki or something (OpenStreetMap being the | closest fit) and allow everyone to submit improvements so that | all may benefit. | | I hope the people in this article had a way to contribute back | their improvements, and did so. | lmkg wrote: | One major use of the public datasets in the academic community | is to serve as a common reference when comparing new techniques | against the existing standard. A static baseline is desirable | for this task. | | You could maybe split the difference by having an "original" or | "reference" version, and a separate moving target that | incorporates crowdsourced improvements. | CydeWeys wrote: | This sounds like a revisioning system would help a lot. Have | a quarterly or annual release cycle or something, so that | when you want to compare performance across techniques, you | just train both of them to the same target (and ideally all | the papers coming out at roughly the same time would already | be using the same revision anyway). | | You'd always work with a versioned release when training | models, and you'd only typically work with HEAD when you were | specifically looking to correct flaws in the data (as the | authors in the linked article are). | 6gvONxR4sf7o wrote: | The datasets serve as benchmarks. You get an idea for a new | model that solves a problem current models have. These ideas | don't pan out, so you need empirical evidence that it works. To | show that your model does better than previous models, you need | some task that your model and previous models can share for | both training and evaluation. It's more complicated than that, | but that's the gist. | | It would be so wasteful to have to retrain a dozen models that | require a month of GPU time each on to serve as baselines for | your new model... | hatmatrix wrote: | But you can have version numbers like with code and models. | barkingcat wrote: | That's not wasteful. That's correction. | | Is it wasteful to throw away a batch of food when 20% of it | has been studied to contain the wrong substance, which ends | up causing disease? | | Isn't it even more wasteful to continue using unedited and | unverified data sets just because all the previous models | were trained on it, and thus we can no longer advance the | state of the research? It's a case of garbage in garbage out. | 6gvONxR4sf7o wrote: | >By one estimate, the training time for AlphaGo cost $35 | million [0] | | How about XLNet which cost something like $30k-60k to train | [1]? GPT-2 may have been around the same [2] is estimated | around the same, while thankfully BERT only costs about | $7k[3], unless of course you're going to do any new | hyperparameter tuning on their models which you of course | will do on your own model. Who cares about apples-to-apples | comparisons? | | We're not talking about spending an extra couple hours and | a little money on updated replication. We're talking about | an immediate overhead of tens to hundreds of thousands of | dollars per new paper. | | Tasks are updated over time already to take issues into | account, but not continuously as far as I know. | | [0] https://www.wired.com/story/deepminds-losses-future- | artifici... | | [1] | https://twitter.com/jekbradbury/status/1143397614093651969 | | [2] https://news.ycombinator.com/item?id=19402666 | | [3] https://syncedreview.com/2019/06/27/the-staggering- | cost-of-t... | barkingcat wrote: | Yah it is by no means wasteful for AlphaGo to throw away | all their training data and then re-train itself! | | That kind of ruthless experimentation is how AlphaGo was | able to exceed even itself. The willingness to say - all | these human games we've fed the computer? All these | terabytes of data? It's all meaningless! We're going to | throw it all away! We will have AlphaGo determine what is | good by playing games against itself! | | And I bet you that for the next iteration of AlphaGo, the | creators of this system will again, delete their own data | and retrain when they have a better approach. | | If you don't "waste" your existing datasets (once you | reallze the flaws in your data sets), you are being held | back by the sunk cost principle. You only have yourself | to blame when someone does train for the exact same | purposes, but with cleaner data. | | The person who has the cleanest source of training data | will win in deep learning. | | You're sabotaging yourself in my opinion. 30k is nothing | when you're just sabotaging the training with faulty | data. | [deleted] | third_I wrote: | As an investor, $35m to train just about the pinnacle of | AI seems like a cheap, oh so cheap cost. I can't even buy | 1 freaking continental jet for that ticket, and there are | thousands of these babies flying (not as we speak, but | generally). | | I don't think you are fully cognizant yet with the | formidable scale of AI in the grander scheme of things, | as an industry, which is nowadays comparable to | transistors circa 1972 in terms of maturity. Long, long | ways to go before we sit on "reference" anything. Whether | architectures, protocols, models, test standards, it's a | Far West as we speak. | | You make excellent points in principle, which are | important to keep in mind in guiding us all along the | way, but now is not the time to set things in stone. More | like the opposite. | | The matter of the fact is that someone will eventually | grab the old and new benchmarks, prove superiority in | both, and by that point the new is the one to beat since | it would be presumably error-free this time. | visarga wrote: | BERT is trained on unsupervised data. It's not the same | kind of model the article talks about. | [deleted] | p1esk wrote: | I'm actually glad it costs so much to train these models. | Great incentive to find more efficient algorithms. That's | how biological brains evolved. | lopmotr wrote: | The dataset is a controlled variable in an experiment so it | has to be held constant. If you update your model and the | dataset for every trial (eg new hyperparameters or new | architecture), and find it performs better, you won't know | if the model is really better or just the dataset. | lmkg wrote: | The thing is, the value as a baseline doesn't actually | change that much for being 20% garbage. A bit counter- | intuitive, but basically accepted as true in several | fields. | | The comparisons are all relative accuracy, not absolute | accuracy. And the comparison is _fair_. The new technique | is receiving the same part-garbage input that the old- | techniques were trained on. For the most part, the better | technique will still tend to do better unless there 's | specifically something about it that makes it more | sensitive to labeling errors. | | And frankly, a percentage of junk has some advantages. | Real-word data is a pile of ass, so it's useful for | academic models to require robustness. | ethbro wrote: | I thought SOTA was still a few % in difference? | | It seems worrisome that they few percent might be making | a coin flip right-randomly instead of wrong-randomly on a | mislabelled subgroup of data... | roosterdawn wrote: | What you're saying is that it's worth it to lie because it's | too expensive to give a truthful answer. That is something | that your customers likely would not agree with. | [deleted] | sdenton4 wrote: | It also potentially gives every paper N replication problems | to solve, in addition to just the gpu time. I would have to | figure out HOW to retrain all of these models on the current | form of the dataset... Which is fine for an occasional | explicit replication study, but terrible if everyone has to | do it. | | I think it's probably better to have a (say) yearly release | of the dataset, with results of some benchmark models | released alongside the new version. | | This is similar to how Common Voice is handling the problem: | it's a crowd sourced, constantly growing dataset, which is | awesome if you want to train in as much as possible for | production models. You can get the whole current version any | time, but they also have releases with a static fileset and | train/test split, which should be better for research. | polm23 wrote: | Multiple reasons, but to name a few: | | - Don't want to deal with vandalism | | - Hosting static data is dramatically easier than making a | public editing interface | | - You want reference versions of the dataset for papers to | refer to so that results are comparable. Sometimes this is used | as a justification for not fixing completely broken data, like | with Fasttext. | | https://github.com/facebookresearch/fastText/issues/710 | | - Building on the previous point, large datasets like this | don't play nice with Git. There are lots of "git for data" | things but none of them are very mature, and most people don't | spend time trying to figure something out. | [deleted] | xiphias2 wrote: | One problem with correcting the benchmark datasets is that it's | important for the algorithms to be robust to labelling errors | as well. But having multiple versions sounds important anyways. | seveibar wrote: | I'm working on this[1], my theory is the lack of a good IDE | (rather than simple crowdsourcing interface) is the reason why | it hasn't been done. | | Imagine if github had an integrated ide for editing large | datasets. Also see dolt which is doing good work here. | | [1] https://github.com/UniversalDataTool/universal-data-tool | gringomarketing wrote: | Gringo Marketing Article Spinner is constructed to offer the very | best spinning tools with the greatest worth for all users and all | languages. We know that premium article is crusial for every | single people and company to meet their target marketing needs | and requirements. We understand what is needed to supply | advanced, yet user friendly software application to provide users | the capability to make content quick create the short articles | they need with no high costs or complicated settings. Visit | https://gringomarketing.com/article-rewriter | groar wrote: | Using simple techniques, they found out that popular open source | datasets like VOC or COCO contain up to 20% annotation errors in. | By manually correcting those errors, they got an average error | reduction of 5% for state-of-the-art computer vision models. | jessermeyer wrote: | Garbage in garbage out. | m0zg wrote: | An idea on how this could work: repeatedly re-split the dataset | (to cover all of it), and re-train a detector on the splits, then | at the end of each training cycle surface validation frames with | the highest computed loss (or some other metric more directly | derived from bounding boxes, such as the number of high | confidence "false" positives which could be instances of under- | labeling) at the end of training. That's what I do on noisy, non- | academic datasets, anyway. | jontro wrote: | Weird behaviour on pinch to zoom (macbook). It scrolls instead of | zooming and when swiping back nothing happens. | | Another example of why you should never mess with the defaults | unless strictly necessary. | rathel wrote: | Nothing is however said about _how_ the errors are detected. Can | an ML expert chime in? | ArnoVW wrote: | my guess would be using some sort of active learning. In other | words: 1) building a model using the data set 2) making | predictions using the training data 3) finding the cases where | the model is the most confused (difference in probability | between classes is low) 4) raising those cases to humans | | https://en.wikipedia.org/wiki/Active_learning_(machine_learn... | thibaut-duguet wrote: | Hi rathel, I'm a Product Manager at Deepomatic and I have been | leading the study in question here. To detect the errors, we | trained a model (with a different neural network architecture | than the 6 listed in the post), and we then have a matching | algorithm that highlights all bounding boxes that were either | annotated but not predicted (False Negative), or predicted but | not annotated (False Positive). Those potential errors are also | sorted based on an error score to get first the most obvious | errors. Happy to answer any other question you may have! | rathel wrote: | Thank you for the explanation. | [deleted] | liquidify wrote: | Curious if you could find errors by comparing the results | from the different models. Places where models disagree with | each other more often would be areas that I would want to | target for error checking. | thaumasiotes wrote: | > Places where models disagree with each other more often | would be areas that I would want to target for error | checking. | | This is a great idea if your goal is to maximize the rate | at which things you look at turn out to be errors. (On at | least one side.) | | But it's guaranteed to miss cases where every model makes | the same inexplicable-to-the-human-eye mistake, and those | cases would appear to be especially interesting. | Zenst wrote: | Was the corrected datasets larger or smaller than the | originals? | | Would also be interesting to see these improved datasets run | thru simulation of crashes with existing datasets and see how | they handle? Though not sure how you would go about that | beyond approaching current providers of such cars for data to | work thru and suspect they may be less open to admitting | flaws and with that, may be a stumbling block. | | Certainly makes you wonder how far we can optimise such | datasets to get better results. I know some ML datasets are a | case of humans fine tuning and going thru examples and | classifying them, and wonder how much that skews or effects | error rates as we all know humans error. | thibaut-duguet wrote: | Hi Zenst, To answer your first question, we had both | bounding boxes added and removed, and depending on the | dataset, the main type of error was different (I'd say it | was overall more objectifs that were forgotten, especially | small objects). | | It would indeed be very interesting to see the impact of | those improved datasets on driving, which is ultimately the | task that is automated for cars. We've been working on many | projects at Deepomatic not only related to autonomous cars, | and we did see some concrete impact of cleaning the | datasets beyond performance metrics. | alexchamberlain wrote: | ie you get some to check where the model and the annotations | disagree. | captain_price7 wrote: | plus we'll have to register simply to see a few examples of | mislabeling...that was disappointing | thibaut-duguet wrote: | Hi captain_price7, I've added screenshots of errors in the | blogpost so that you have an idea of the errors we spotted. | Let me know what you think of them. | thaumasiotes wrote: | A couple notes on those screenshots: | | - In the cars-on-the-bridge image, the red bounding box for | the semitruck in the oncoming lanes is too small, with its | upper bound just above the top of the semi's windshield, | ignoring the much taller roof and towed container. | | - In the same image, there are red bounding boxes around | cars that exist, and also red bounding boxes around non- | cars that don't exist. If false positives and false | negatives are going to be represented in the same picture, | it'd be nice to use different colors for them, so the | viewer can tell whether the error was identified correctly | or spuriously. | | - I have trouble understanding the "bus" screenshot. The | caption says "(green pictures are valid errors) - The pink | dotted boxes are objects that have not been labelled but | that our error spotting algorithm highlighted." In other | words, the green-highlighted pictures are false negatives | considered from the perspective of the original data set, | and the red-highlighted pictures are true negatives. Or | alternatively, the green-highlighted pictures are true | positives from the perspective of the error-spotting | algorithm, and the red-highlighted pictures are false | positives. What confuses me is that all 9 pictures are | labeled "false positive" by the tabbing at the top of the | screenshot. | benibela wrote: | These things are why I stopped doing computer vision after my | master thesis | fwip wrote: | The title here seems wrong. Suggested change: | | "Cleaning algorithm finds 20% of errors in major image | recognition datasets" -> "Cleaning algorithm finds errors in 20% | of annotations in major image recognitions." | | We don't know if the found errors represent 20%, 90% or 2% of the | total errors in the dataset. | [deleted] | groar wrote: | Yes agreed with that ! I can't change the title unfortunately | magicalhippo wrote: | > Create an account on the Deepomatic platform with the voucher | code "SPOT ERRORS" to visualize the detected errors. | | Nice ad. | thibaut-duguet wrote: | Our platform is actually designed for enterprise companies, so | we don't provide open access unfortunately. | scribu wrote: | I signed up and still couldn't see the errors. | | I just see 3 datasets with generic annotations. | thibaut-duguet wrote: | Hi scribu, The process is actually a bit complicated but | let me explain it to you. Once you are on a dataset, click | on the label that you want and use the slider at the top | right corner of the page to switch modes (we call it smart | detection). You should then be able to access three tabs | and the errors are listed in the False Positive and False | Negative tabs (I've added a screenshot in the blogpost so | that you can make sure to be at the right place). Let me | know if you have any problem, thanks! | scribu wrote: | Thanks, I can see them now. | magicalhippo wrote: | Still, couldn't you have included an example or two in the | article no to illustrate the kind of errors we're talking | about? ___________________________________________________________________ (page generated 2020-04-16 23:00 UTC)