[HN Gopher] Label a Dataset with a Few Lines of Code
       ___________________________________________________________________
        
       Label a Dataset with a Few Lines of Code
        
       Author : ulrikhansen54
       Score  : 17 points
       Date   : 2021-01-18 21:21 UTC (1 hours ago)
        
 (HTM) web link (eric-landau.medium.com)
 (TXT) w3m dump (eric-landau.medium.com)
        
       | Imnimo wrote:
       | I'm not really convinced this would work in practice. The trick
       | seems to depend on the fact that the dataset is a sequence of
       | frames of the same object shot from slightly different angles.
       | But that's a terrible dataset - it might work for training a toy
       | proof-of-concept, but if you actually wanted to do calorie
       | estimation in the wild, you'd need a much more varied (and
       | larger) training set. And once you have that, you lose the
       | properties that made this labelling approach viable in the first
       | place.
        
         | gharman wrote:
         | This reminds me of Snorkel (though unclear from the article if
         | they're using Snorkel's trick of aggregating many weak
         | heuristics). It can be made to work even in the real world. The
         | rub is that coming up with these programmatic labelers is
         | easier said than done especially for complex data.
         | 
         | It works well if a domain expert can say something without
         | "cheating" and looking at the data like "put a box around round
         | red objects because those are always apples". But in practice
         | people tend to cheat and look at the data first, and you end up
         | with humans trying to emulate ML, poorly.
        
           | eric_landau wrote:
           | Definitely easier said than done, but the process at least
           | makes labelling interesting. Sometimes you run into
           | roadblocks where you can't get past just having a human doing
           | some element of the labelling, but once you do have a few
           | algorithmic strategies that work reasonably well on a
           | representative sample of your data, you can usually scale
           | them pretty effectively to the rest of your data
        
         | eric_landau wrote:
         | Hi Imnimo, I wrote the article and definitely understand your
         | concerns. The point is not the specific steps I took working in
         | general for most datasets, but more the overall idea of using a
         | more data science-y approach to labelling rather than just
         | blindly throwing your data at a workforce.
         | 
         | A more varied dataset will require additional strategies. We
         | have done this type of thing with various datasets and what
         | normally works is a combination of some vertical models,
         | heuristics specific to the dataset, classical computer vision
         | techniques, and some human label seeding/correction.
        
         | Q6T46nT668w6i3m wrote:
         | A common mistake in applied computer vision is to use a
         | classical method (e.g. distanced-based watershed) to buoy your
         | training set. You'll end up with a computationally expensive
         | method (e.g. a region-based convolutional neural network)
         | that's a poor replication of the classical method. The major
         | advantage of learning-based methods is to go _beyond_ classical
         | performance and make inferences comparable to the manually
         | annotated image. There's no shortcut.
        
       | florin4- wrote:
       | if you could just do "algorithmic labelling" why do you need to
       | go to all that trouble of making a dataset and training a model
       | in the first place? Why not just use this "algorithmic labelling"
       | thing?
       | 
       | Because thats not how this works, thats not how any of this
       | works.
        
       ___________________________________________________________________
       (page generated 2021-01-18 23:00 UTC)