[HN Gopher] Label a Dataset with a Few Lines of Code ___________________________________________________________________ Label a Dataset with a Few Lines of Code Author : ulrikhansen54 Score : 17 points Date : 2021-01-18 21:21 UTC (1 hours ago) (HTM) web link (eric-landau.medium.com) (TXT) w3m dump (eric-landau.medium.com) | Imnimo wrote: | I'm not really convinced this would work in practice. The trick | seems to depend on the fact that the dataset is a sequence of | frames of the same object shot from slightly different angles. | But that's a terrible dataset - it might work for training a toy | proof-of-concept, but if you actually wanted to do calorie | estimation in the wild, you'd need a much more varied (and | larger) training set. And once you have that, you lose the | properties that made this labelling approach viable in the first | place. | gharman wrote: | This reminds me of Snorkel (though unclear from the article if | they're using Snorkel's trick of aggregating many weak | heuristics). It can be made to work even in the real world. The | rub is that coming up with these programmatic labelers is | easier said than done especially for complex data. | | It works well if a domain expert can say something without | "cheating" and looking at the data like "put a box around round | red objects because those are always apples". But in practice | people tend to cheat and look at the data first, and you end up | with humans trying to emulate ML, poorly. | eric_landau wrote: | Definitely easier said than done, but the process at least | makes labelling interesting. Sometimes you run into | roadblocks where you can't get past just having a human doing | some element of the labelling, but once you do have a few | algorithmic strategies that work reasonably well on a | representative sample of your data, you can usually scale | them pretty effectively to the rest of your data | eric_landau wrote: | Hi Imnimo, I wrote the article and definitely understand your | concerns. The point is not the specific steps I took working in | general for most datasets, but more the overall idea of using a | more data science-y approach to labelling rather than just | blindly throwing your data at a workforce. | | A more varied dataset will require additional strategies. We | have done this type of thing with various datasets and what | normally works is a combination of some vertical models, | heuristics specific to the dataset, classical computer vision | techniques, and some human label seeding/correction. | Q6T46nT668w6i3m wrote: | A common mistake in applied computer vision is to use a | classical method (e.g. distanced-based watershed) to buoy your | training set. You'll end up with a computationally expensive | method (e.g. a region-based convolutional neural network) | that's a poor replication of the classical method. The major | advantage of learning-based methods is to go _beyond_ classical | performance and make inferences comparable to the manually | annotated image. There's no shortcut. | florin4- wrote: | if you could just do "algorithmic labelling" why do you need to | go to all that trouble of making a dataset and training a model | in the first place? Why not just use this "algorithmic labelling" | thing? | | Because thats not how this works, thats not how any of this | works. ___________________________________________________________________ (page generated 2021-01-18 23:00 UTC)