[HN Gopher] Illustrated FixMatch for semi-supervised learning
       ___________________________________________________________________
        
       Illustrated FixMatch for semi-supervised learning
        
       Author : amitness
       Score  : 203 points
       Date   : 2020-04-03 14:16 UTC (8 hours ago)
        
 (HTM) web link (amitness.com)
 (TXT) w3m dump (amitness.com)
        
       | hadsed wrote:
       | The cold hard reality of machine learning is that most useful
       | data isn't readily available to just be collected. Semi-
       | supervised and weakly supervised learning, data augmentation,
       | multi-task learning, these are the things that will enable
       | machine learning for the majority of companies out there who need
       | to build datasets and potentially leverage domain expertise
       | somehow to bootstrap intelligent features in their apps. This is
       | great work in that direction for computer vision.
       | 
       | Even the giants are recognizing this fact and are leveraging it
       | to great effect. Some keywords to search for good papers and
       | projects: Overton, Snorkel, Snorkel Metal
        
         | najarvg wrote:
         | Also Flying Squid, another interesting project from Stanford -
         | http://hazyresearch.stanford.edu/flyingsquid
        
       | starpilot wrote:
       | I wish there was a way to augment data as easily for free text,
       | and other business data. I always see these few-shot learning
       | papers for images, I suspect because it's easy to augment image
       | datasets and because image-recognition is interesting to
       | laypeople. The vast majority of data we deal with in business is
       | text/numerical which is much harder to use in these approaches.
        
         | amitness wrote:
         | Agree with you on this. For text data, there was a paper called
         | "UDA"(https://arxiv.org/abs/1904.12848) that did some work on
         | this direction.
         | 
         | They augmented text by using backtranslation. Basic idea is you
         | take text in English, translate that to some other language say
         | French and then translate back the French text to English.
         | Usually, you get back an English sentence that is different
         | that the original English sentence but has the same meaning.
         | Another approach they use to augment is to randomly replace
         | stopword/low tf-idf(intuitively say very frequent words like a,
         | an, the) with random words.
         | 
         | You will find implementation of UDA on GitHub and try that out.
         | 
         | I am learning these existing image semi-supervised technique
         | right now and the plan is to do research on how we can transfer
         | those ideas to text data. Let's see how it goes.
        
           | codegladiator wrote:
           | Haha that's how we used to generate blog spam content and
           | comments :p
        
             | [deleted]
        
       | edsykes wrote:
       | I had a read through this and I couldn't really tell if there was
       | something novel here?
       | 
       | I understand that perturbations and generating new examples from
       | labelled examples is a pretty normal park of the process when you
       | only have a limited number of examples available.
        
         | amitness wrote:
         | The novelty is in applying 2 perturbations to available
         | _unlabeled images_ and use them as part of training. This is
         | different than what you are describing about applying
         | augmentations to labeled images to increase data size.
        
           | daenz wrote:
           | My immediate question was "how do you use unlabeled images
           | for training?" But then I decided to read the paper :) The
           | answer is:
           | 
           | Two different perturbations to the same image should have the
           | same predicted label by the model, even if it doesn't know
           | what the correct label is. That information can be used in
           | the training.
        
             | computerex wrote:
             | What if the model's prediction is wrong with high
             | confidence? What if the cat is labeled as a dog for both
             | perturbations? Then wouldn't the system train against the
             | wrong label?
        
               | amitness wrote:
               | Nope,because of the way it works. So in the beginning
               | when the model is being trained on the labeled data, it
               | will make many mistakes. So it's confidence for either
               | cat or dog will be low. Hence, in that case unlabeled
               | data are not used at all.
               | 
               | As training progresses, the model will become better at
               | labeled data. And so it can start predicting with high
               | confidence on unlabeled images that are trivial/similar-
               | looking/same distribution with labeled data. So,
               | gradually unlabeled images get started being used as part
               | of training. As training progresses, more and more
               | unlabeled data are added.
               | 
               | The mathematics of the combined loss function and
               | curriculum learning part talks about this.
        
       | sireat wrote:
       | It is not the same thing but kind of reminds of my naive and
       | obvious(meaning something that came up when drinking beer) idea
       | of generating bunch of variations of your labeled data in cases
       | when you do not have enough.
       | 
       | Let's say you only have one image of dog, you generate bunch of
       | color variations, sharpness adjustments, flips, transforms, etc.
       | Voila you have 256 images of the same dog.
       | 
       | EDIT: I noticed that this is definitely a common idea as others
       | have already pointed out.
        
       | manthideaal wrote:
       | I wonder if a two step process could work better than this, first
       | use a variational autoencoder or simple an autoencoder then use
       | it to train the labeled sampled.
       | 
       | In (1) there is a full example of using the two step strategy but
       | using more labeled data to obtain 92% of accuracy. Someone can
       | try changing the second part to use only ten labels for the
       | classifying part and share results?
       | 
       | (1) https://www.datacamp.com/community/tutorials/autoencoder-
       | cla...
       | 
       | Edited: I found a deep analysis in (2), in short for CIFAR 10 the
       | VAE semi-supervised learning approach provides poor results, but
       | the author has not used augmentation!
       | 
       | (2) http://bjlkeng.github.io/posts/semi-supervised-learning-
       | with...
        
       | jonpon wrote:
       | Great summary! Reminds me a lot about Leon Bottou's work on using
       | deep learning to learn causal invariant representations. (Video:
       | https://www.youtube.com/watch?v=lbZNQt0Q5HA)
       | 
       | We can view the augmentations of the image as "interventions"
       | forcing the model to learn an invariant representation of the
       | image.
       | 
       | Although the blog post did not frame it as this type of problem
       | (not sure if the paper did), I think it can definitely be seen as
       | such and is really promising.
        
         | amitness wrote:
         | Interesting, thank you for sharing that. It reminds me of this
         | approach called "PIRL" by Facebook AI. They framed the problem
         | to learn invariant representations. You might find it
         | interesting.
         | 
         | https://amitness.com/2020/03/illustrated-pirl/
        
       | fermienrico wrote:
       | I don't know much about ML/Deep-Learning and I have a burning
       | question:
       | 
       | Say we have 10 images as a starting point. Then we create 10,000
       | images from those 10 images by adding noise, filters, flip them,
       | skew them, distort them, etc. Isn't the underlying data the same
       | (or some formal definition of shannon information entropy)? Would
       | that actually improve neural networks?
       | 
       | I've always wondered. Is it possible to generate infinite data
       | and get almost perfect neural network accuracy?
        
         | arketyp wrote:
         | For a standard convolutional net, the low-entropy formulation
         | of for isntance rotation is not immediately accessible, which
         | makes rotation a viable data augmentation and regularization
         | strategy. Some designs try to account for natural symmetries by
         | incorporating the related transformation as priors in the
         | architecture.
        
         | ogrisel wrote:
         | Data augmentation will help prevent your model from overfitting
         | a bit but the amount of useful information you get from naively
         | augmented data will reach diminishing returns at some point.
         | 
         | Data augmentation alone (e.g. rotations / shift / crops / color
         | perturbations / cutout... of a single photo of an husky dog)
         | will never yield the added information that is contained in new
         | pictures showing subtle variations of the phenomenom your are
         | trying to model (e.g. a new photo of a Dalmatian dog if you
         | have no Dalmatian dogs in your original training set).
        
         | ska wrote:
         | Short answer is no, certainly to the "perfect" part.
         | 
         | The core problem in ML is generalization; simply put - how well
         | does your approach work with new data it hasn't seen before.
         | Think of it this way: there is a large set of all the potential
         | inputs you could see, and you only get to see small subset when
         | you are training; what do you do so your general performance is
         | best? Which of course you can't actually know but you can try
         | and estimate.
         | 
         | There are two issues that can give you a lot of trouble here.
         | The first is overfitting (you'll do much better on the training
         | set than "in real life"), the second is bias in in your
         | training samples. Data augmentation (what you are talking
         | about) is one approach to reduce parts of the former effect,
         | and done correctly it can help.
         | 
         | Take a simple example, imagine we were trying to recognize
         | simple geometric shapes on images of a page - you want to find
         | triangles, rectangles, ellipses, etc. I only give you a small
         | set images, say 10s of shapes total.
         | 
         | Now you suspect that "in the wild" you can have triangles at
         | all sorts of rotations, and sizes, but I've only given you a
         | few examples. So you want your algorithm to learn the shapes,
         | but not the sizes or orientations. If you just train on these,
         | it may not recognize a triangle that is just 2x as big as any
         | it has seen, or rotated 20 degrees left from one it has seen,
         | etc.
         | 
         | One approach would be to try and find a rotation and/or scale
         | invariant representation for your inputs - if you "know" that
         | shouldn't matter you've now removed it from the problem. This
         | can be hard or even mathematically impossible to do, depending
         | on the problem space (e.g. there is no rotation and scale
         | invariant manifold for photographic images). So another way you
         | can approach it is empirically; to take the examples I gave
         | you, and generate new examples in different poses and scales.
         | You feed this into your training and should get a much more
         | robust result, one that doesn't hew too closely to the training
         | set (i.e. less over training).
         | 
         | So this sounds great, right? What could go wrong? There are a
         | few issues. One is you are now enforcing things outside what
         | you learn from the data, so if you are wrong you will make
         | things worse.
         | 
         | More subtly when you do this you can tend to amplify any of the
         | sampling biases you had originally. Imagine that I never gave
         | you an equilateral triangle in the training set. It's quite
         | plausible that by generating millions of inputs from a few
         | examples, this category gets pushed closer to something
         | symmetric, like a sphere , say.
         | 
         | Another issue that can be subtle is that the manipulations you
         | are doing for data augmentation can easily introduce new things
         | to the data that you don't see, and your training can pick that
         | up. Consider, for example, rotations of these shapes. I told
         | you we were doing this from images, i.e. discretely sampled
         | grids. This means that other than certain symmetric rotations
         | and flips, you can't do this without resampling. And you can't
         | resample without smoothing. So if you take a dozen or so
         | "crisp" examples and turn them into 10s of thousands of
         | "smoothed" examples, what exactly are you teaching your model.
         | I'm also waving my hands hear about how you are extracting
         | "shapes" from "background" and in a NN context, what your
         | inputs actually look like... but you can introduce issues here
         | also.
         | 
         | There are lots of trade offs here. It's a useful technique, but
         | unsurprisingly isn't a silver bullet.
        
         | 6gvONxR4sf7o wrote:
         | Your big problem with 10 images is going to be overfitting. By
         | modifying an image and training on that too, you're effectively
         | teaching the network that that sort of modification shouldn't
         | change the label. It learns a kind of invariant. That invariant
         | isn't the same as actually seeing the dog from another angle,
         | but it's better than nothing.
        
         | Der_Einzige wrote:
         | This is already done. It's called data augmentation and is
         | extremely helpful in computer vision
        
           | fermienrico wrote:
           | How do we generate more "information" from a limited given
           | information? Doesn't that break some law of information
           | theory?
        
             | claytonjy wrote:
             | it's less "generating more information" and more
             | "presenting the same information in new ways". A more ideal
             | model wouldn't need augmented data, but this is what works
             | well with current architectures. It may be that practical
             | constraints mean we never move away from augmentation, just
             | as we'll never move towards single-layer neural nets, even
             | though theoretically they can fit any model.
        
             | psb217 wrote:
             | With data augmentation, we're effectively injecting
             | additional information about what sorts of transformations
             | of the data the model should be insensitive to. The
             | additional information comes from our (hopefully) well-
             | informed human decisions about how to augment the data. By
             | doing this, we can reduce the tendency for the model to
             | pick up dependencies on patterns that are useful in the
             | context of the (very small) training dataset, but which
             | don't work well on new data that isn't in the training set.
        
         | jmalicki wrote:
         | This is quite common, it's often called data augmentation.
         | 
         | For example, most CNNs aren't invariant to skew, distortions,
         | rotations, or even zoom level. So to train a neural net to
         | recognize both 8x8 pixel birds and 10x10 pixel birds, you need
         | to add images of both zoom levels.
         | 
         | Of course, this is a weakness, and there is a lot of research
         | to try to rectify this, like Hinton's capsule networks.
         | 
         | For things like adding noise, in some cases it is used as
         | regularization to make the model robust to noise, in others,
         | such as GANs, models are trained to learn the difference
         | between the generatively created images that are higher entropy
         | from the true images, to refine the model.
         | 
         | But as the sibling noted, and you mention, the underlying data
         | is somewhat the same, so yes, you do need a lot of diversity...
         | but in practice these techniques can be helpful.
        
         | rocauc wrote:
         | As others have noted, this is data augmentation, and it's
         | incredibly useful to increase variation in training data to
         | help decrease overfitting.
         | 
         | It's not a silver bullet. It won't capture the natural
         | variations that happen in the real world.
         | 
         | But new forms of augmentation (like OP) are helping us get
         | closer.
         | 
         | For example, MixMatch creates "mosaic" images by combining
         | images across the training set [1]. In object detection,
         | bounding box only augmentations are improving models by
         | introducing variation [2].
         | 
         | And an anecdote: I work on https://roboflow.ai , and we've seen
         | customers make production-ready results from datasets <20
         | images based on techniques like these.
         | 
         | [1] https://arxiv.org/abs/1905.02249 [2]
         | https://arxiv.org/pdf/1906.11172.pdf
        
         | colincooke wrote:
         | This depends on how well those 10 images are representing the
         | disribution of data for your actual task. With only 10 samples
         | thats highly unlikely.
         | 
         | What you are talking about is data augmentation, a strategy we
         | can use to expand our training dataset synthetically, mostly in
         | a bid to prevent over-fitting.
        
         | DougBTX wrote:
         | > Is it possible to generate infinite data and get almost
         | perfect neural network accuracy?
         | 
         | Basic answer is no, but the reason is kind of interesting.
         | 
         | Imagine that the input to the model is a list of facts,
         | initially the facts are just:                  * Image 1 has
         | class A        * Image 2 has class B        * etc...
         | 
         | The idea with data augmentation, in a roundabout way, is to add
         | other facts:                  * Flipping the image does not
         | change its class        * Translating the image does not change
         | its class        * Adding a small amount of noise to the image
         | does          not change its class        * etc...
         | 
         | However, it is tricky to express those facts as inputs to the
         | model, but it is easy to generate new images based on those
         | facts which the model should be able to learn from. It would
         | likely be more efficient if those facts could be expressed
         | directly though.
         | 
         | So, by generating more data the model can progressively learn
         | those "class invariant transformations", but the model would
         | only reach perfect accuracy if _all_ class invariant
         | transformations were taught.
         | 
         | Another way to "teach" a model these rules is to embed the rule
         | into the structure of the model itself, eg the idea behind
         | convolutional neural networks is to embed translation
         | independence into the model, so that it doesn't need to be
         | taught that from large batches of translated images.
        
       | master_yoda_1 wrote:
       | I am not sure how this article got ranked so high. I am
       | suspicious about reading these article written by non experts. I
       | would prefer to go to authentic sources and read the original
       | paper. Most of the time information in these articles are
       | misleading and wrong.
        
         | shookness wrote:
         | Instead of speaking in generalities, can you point out what is
         | wrong in the posted article?
        
           | master_yoda_1 wrote:
           | The title is fraud. It fraudulently report 85% accuracy
           | gain.But the inside it is something else. "FixMatch is a
           | recent semi-supervised approach by Sohn et al. from Google
           | Brain that improved the state of the art in semi-supervised
           | learning(SSL). It is a simpler combination of previous
           | methods such as UDA and ReMixMatch. In this post, we will
           | understand the concept of FixMatch and also see it got 78%
           | median accuracy and 84% maximum accuracy on CIFAR-10 with
           | just 10 labeled images."
        
             | master_yoda_1 wrote:
             | We should flag these fraudulent articles, I am not sure the
             | author has any credibility.
        
       | antipaul wrote:
       | I wish all papers were structured this way, by default.
       | 
       | That is, plenty of good diagrams, clear explanations and
       | intuitions, no unnecessary mathiness.
        
         | pmiller2 wrote:
         | Speaking of diagrams, I once read a short (probably no more
         | than 3-4 pages) paper with one theorem and one diagram. The
         | diagram was essential for me to understand the proof. The
         | problem was the diagram was a diagram _of_ the proof.
        
         | amitness wrote:
         | Hi,
         | 
         | Wanted to clarify that this is the summary article of the
         | paper. I wrote it to help out people who might not have the
         | math rigor and research background to understand research
         | papers but would benefit from an intuitive explanation.
         | 
         | The actual paper is available here:
         | https://arxiv.org/abs/2001.07685
        
           | mabbo wrote:
           | I would argue that folks like you translating the heavy
           | science into comprehensible ideas to those less deep into the
           | field are doing just as much to advance science as the
           | authors of these papers.
           | 
           | Seriously, this is fantastic work and I cannot compliment you
           | enough on it.
        
         | colincooke wrote:
         | This is a blog not a paper, it seems you wouldn't like the
         | source material: https://arxiv.org/pdf/2001.07685.pdf
         | 
         | But you are correct! This way of showing off your work is much
         | nicer than what ends up in the paper. The blog representation
         | is a good opportunity to show off the results of the paper on a
         | higher level. However, the "mathy" paper is still important so
         | other experts in the field can understand the details of the
         | technique
        
       | mattkrause wrote:
       | Title is (slightly) wrong.
       | 
       | As the first paragraph says: "In this post, we will understand
       | the concept of FixMatch and also see it got 78% accuracy on
       | CIFAR-10 with just 10 images."
       | 
       | Reporting the _best_ performance on a method that deliberately
       | uses just a small subset of the data is shady as heck.
        
         | colincooke wrote:
         | Agreed, also this model fully uses the other images, just not
         | the way that traditional supervised learning would. With "just
         | 10 labels" is more accurate. Impressive results, but this isn't
         | some hyper-convergence technique that somehow trains on only
         | ten images.
        
           | mattkrause wrote:
           | This depends a lot on the application.
           | 
           | It seems like a big win for images and other stuff where
           | getting images is cheap, but labelling them is expensive.
           | Less great for (say) drug discovery where running the
           | experiments to generate the data points is the bottleneck.
        
         | kevinskii wrote:
         | I agree it's pretty sensationalistic, and I almost ignored it
         | for that reason. But it turns out that it's actually well worth
         | a read if you can get past that one flaw.
        
           | mattkrause wrote:
           | I did read it--that's how I noticed the number was wrong :-)
        
         | dang wrote:
         | Ok, we've reverted the title to that of the page, in keeping
         | with the site guidelines
         | (https://news.ycombinator.com/newsguidelines.html). When
         | changing titles, the idea is to make them less baity or
         | misleading, not more!
         | 
         | (Submitted title was "Semi-Supervised Learning: 85% accuracy on
         | CIFAR-10 with only 10 labeled images")
        
       ___________________________________________________________________
       (page generated 2020-04-03 23:00 UTC)