[HN Gopher] Illustrated FixMatch for semi-supervised learning ___________________________________________________________________ Illustrated FixMatch for semi-supervised learning Author : amitness Score : 203 points Date : 2020-04-03 14:16 UTC (8 hours ago) (HTM) web link (amitness.com) (TXT) w3m dump (amitness.com) | hadsed wrote: | The cold hard reality of machine learning is that most useful | data isn't readily available to just be collected. Semi- | supervised and weakly supervised learning, data augmentation, | multi-task learning, these are the things that will enable | machine learning for the majority of companies out there who need | to build datasets and potentially leverage domain expertise | somehow to bootstrap intelligent features in their apps. This is | great work in that direction for computer vision. | | Even the giants are recognizing this fact and are leveraging it | to great effect. Some keywords to search for good papers and | projects: Overton, Snorkel, Snorkel Metal | najarvg wrote: | Also Flying Squid, another interesting project from Stanford - | http://hazyresearch.stanford.edu/flyingsquid | starpilot wrote: | I wish there was a way to augment data as easily for free text, | and other business data. I always see these few-shot learning | papers for images, I suspect because it's easy to augment image | datasets and because image-recognition is interesting to | laypeople. The vast majority of data we deal with in business is | text/numerical which is much harder to use in these approaches. | amitness wrote: | Agree with you on this. For text data, there was a paper called | "UDA"(https://arxiv.org/abs/1904.12848) that did some work on | this direction. | | They augmented text by using backtranslation. Basic idea is you | take text in English, translate that to some other language say | French and then translate back the French text to English. | Usually, you get back an English sentence that is different | that the original English sentence but has the same meaning. | Another approach they use to augment is to randomly replace | stopword/low tf-idf(intuitively say very frequent words like a, | an, the) with random words. | | You will find implementation of UDA on GitHub and try that out. | | I am learning these existing image semi-supervised technique | right now and the plan is to do research on how we can transfer | those ideas to text data. Let's see how it goes. | codegladiator wrote: | Haha that's how we used to generate blog spam content and | comments :p | [deleted] | edsykes wrote: | I had a read through this and I couldn't really tell if there was | something novel here? | | I understand that perturbations and generating new examples from | labelled examples is a pretty normal park of the process when you | only have a limited number of examples available. | amitness wrote: | The novelty is in applying 2 perturbations to available | _unlabeled images_ and use them as part of training. This is | different than what you are describing about applying | augmentations to labeled images to increase data size. | daenz wrote: | My immediate question was "how do you use unlabeled images | for training?" But then I decided to read the paper :) The | answer is: | | Two different perturbations to the same image should have the | same predicted label by the model, even if it doesn't know | what the correct label is. That information can be used in | the training. | computerex wrote: | What if the model's prediction is wrong with high | confidence? What if the cat is labeled as a dog for both | perturbations? Then wouldn't the system train against the | wrong label? | amitness wrote: | Nope,because of the way it works. So in the beginning | when the model is being trained on the labeled data, it | will make many mistakes. So it's confidence for either | cat or dog will be low. Hence, in that case unlabeled | data are not used at all. | | As training progresses, the model will become better at | labeled data. And so it can start predicting with high | confidence on unlabeled images that are trivial/similar- | looking/same distribution with labeled data. So, | gradually unlabeled images get started being used as part | of training. As training progresses, more and more | unlabeled data are added. | | The mathematics of the combined loss function and | curriculum learning part talks about this. | sireat wrote: | It is not the same thing but kind of reminds of my naive and | obvious(meaning something that came up when drinking beer) idea | of generating bunch of variations of your labeled data in cases | when you do not have enough. | | Let's say you only have one image of dog, you generate bunch of | color variations, sharpness adjustments, flips, transforms, etc. | Voila you have 256 images of the same dog. | | EDIT: I noticed that this is definitely a common idea as others | have already pointed out. | manthideaal wrote: | I wonder if a two step process could work better than this, first | use a variational autoencoder or simple an autoencoder then use | it to train the labeled sampled. | | In (1) there is a full example of using the two step strategy but | using more labeled data to obtain 92% of accuracy. Someone can | try changing the second part to use only ten labels for the | classifying part and share results? | | (1) https://www.datacamp.com/community/tutorials/autoencoder- | cla... | | Edited: I found a deep analysis in (2), in short for CIFAR 10 the | VAE semi-supervised learning approach provides poor results, but | the author has not used augmentation! | | (2) http://bjlkeng.github.io/posts/semi-supervised-learning- | with... | jonpon wrote: | Great summary! Reminds me a lot about Leon Bottou's work on using | deep learning to learn causal invariant representations. (Video: | https://www.youtube.com/watch?v=lbZNQt0Q5HA) | | We can view the augmentations of the image as "interventions" | forcing the model to learn an invariant representation of the | image. | | Although the blog post did not frame it as this type of problem | (not sure if the paper did), I think it can definitely be seen as | such and is really promising. | amitness wrote: | Interesting, thank you for sharing that. It reminds me of this | approach called "PIRL" by Facebook AI. They framed the problem | to learn invariant representations. You might find it | interesting. | | https://amitness.com/2020/03/illustrated-pirl/ | fermienrico wrote: | I don't know much about ML/Deep-Learning and I have a burning | question: | | Say we have 10 images as a starting point. Then we create 10,000 | images from those 10 images by adding noise, filters, flip them, | skew them, distort them, etc. Isn't the underlying data the same | (or some formal definition of shannon information entropy)? Would | that actually improve neural networks? | | I've always wondered. Is it possible to generate infinite data | and get almost perfect neural network accuracy? | arketyp wrote: | For a standard convolutional net, the low-entropy formulation | of for isntance rotation is not immediately accessible, which | makes rotation a viable data augmentation and regularization | strategy. Some designs try to account for natural symmetries by | incorporating the related transformation as priors in the | architecture. | ogrisel wrote: | Data augmentation will help prevent your model from overfitting | a bit but the amount of useful information you get from naively | augmented data will reach diminishing returns at some point. | | Data augmentation alone (e.g. rotations / shift / crops / color | perturbations / cutout... of a single photo of an husky dog) | will never yield the added information that is contained in new | pictures showing subtle variations of the phenomenom your are | trying to model (e.g. a new photo of a Dalmatian dog if you | have no Dalmatian dogs in your original training set). | ska wrote: | Short answer is no, certainly to the "perfect" part. | | The core problem in ML is generalization; simply put - how well | does your approach work with new data it hasn't seen before. | Think of it this way: there is a large set of all the potential | inputs you could see, and you only get to see small subset when | you are training; what do you do so your general performance is | best? Which of course you can't actually know but you can try | and estimate. | | There are two issues that can give you a lot of trouble here. | The first is overfitting (you'll do much better on the training | set than "in real life"), the second is bias in in your | training samples. Data augmentation (what you are talking | about) is one approach to reduce parts of the former effect, | and done correctly it can help. | | Take a simple example, imagine we were trying to recognize | simple geometric shapes on images of a page - you want to find | triangles, rectangles, ellipses, etc. I only give you a small | set images, say 10s of shapes total. | | Now you suspect that "in the wild" you can have triangles at | all sorts of rotations, and sizes, but I've only given you a | few examples. So you want your algorithm to learn the shapes, | but not the sizes or orientations. If you just train on these, | it may not recognize a triangle that is just 2x as big as any | it has seen, or rotated 20 degrees left from one it has seen, | etc. | | One approach would be to try and find a rotation and/or scale | invariant representation for your inputs - if you "know" that | shouldn't matter you've now removed it from the problem. This | can be hard or even mathematically impossible to do, depending | on the problem space (e.g. there is no rotation and scale | invariant manifold for photographic images). So another way you | can approach it is empirically; to take the examples I gave | you, and generate new examples in different poses and scales. | You feed this into your training and should get a much more | robust result, one that doesn't hew too closely to the training | set (i.e. less over training). | | So this sounds great, right? What could go wrong? There are a | few issues. One is you are now enforcing things outside what | you learn from the data, so if you are wrong you will make | things worse. | | More subtly when you do this you can tend to amplify any of the | sampling biases you had originally. Imagine that I never gave | you an equilateral triangle in the training set. It's quite | plausible that by generating millions of inputs from a few | examples, this category gets pushed closer to something | symmetric, like a sphere , say. | | Another issue that can be subtle is that the manipulations you | are doing for data augmentation can easily introduce new things | to the data that you don't see, and your training can pick that | up. Consider, for example, rotations of these shapes. I told | you we were doing this from images, i.e. discretely sampled | grids. This means that other than certain symmetric rotations | and flips, you can't do this without resampling. And you can't | resample without smoothing. So if you take a dozen or so | "crisp" examples and turn them into 10s of thousands of | "smoothed" examples, what exactly are you teaching your model. | I'm also waving my hands hear about how you are extracting | "shapes" from "background" and in a NN context, what your | inputs actually look like... but you can introduce issues here | also. | | There are lots of trade offs here. It's a useful technique, but | unsurprisingly isn't a silver bullet. | 6gvONxR4sf7o wrote: | Your big problem with 10 images is going to be overfitting. By | modifying an image and training on that too, you're effectively | teaching the network that that sort of modification shouldn't | change the label. It learns a kind of invariant. That invariant | isn't the same as actually seeing the dog from another angle, | but it's better than nothing. | Der_Einzige wrote: | This is already done. It's called data augmentation and is | extremely helpful in computer vision | fermienrico wrote: | How do we generate more "information" from a limited given | information? Doesn't that break some law of information | theory? | claytonjy wrote: | it's less "generating more information" and more | "presenting the same information in new ways". A more ideal | model wouldn't need augmented data, but this is what works | well with current architectures. It may be that practical | constraints mean we never move away from augmentation, just | as we'll never move towards single-layer neural nets, even | though theoretically they can fit any model. | psb217 wrote: | With data augmentation, we're effectively injecting | additional information about what sorts of transformations | of the data the model should be insensitive to. The | additional information comes from our (hopefully) well- | informed human decisions about how to augment the data. By | doing this, we can reduce the tendency for the model to | pick up dependencies on patterns that are useful in the | context of the (very small) training dataset, but which | don't work well on new data that isn't in the training set. | jmalicki wrote: | This is quite common, it's often called data augmentation. | | For example, most CNNs aren't invariant to skew, distortions, | rotations, or even zoom level. So to train a neural net to | recognize both 8x8 pixel birds and 10x10 pixel birds, you need | to add images of both zoom levels. | | Of course, this is a weakness, and there is a lot of research | to try to rectify this, like Hinton's capsule networks. | | For things like adding noise, in some cases it is used as | regularization to make the model robust to noise, in others, | such as GANs, models are trained to learn the difference | between the generatively created images that are higher entropy | from the true images, to refine the model. | | But as the sibling noted, and you mention, the underlying data | is somewhat the same, so yes, you do need a lot of diversity... | but in practice these techniques can be helpful. | rocauc wrote: | As others have noted, this is data augmentation, and it's | incredibly useful to increase variation in training data to | help decrease overfitting. | | It's not a silver bullet. It won't capture the natural | variations that happen in the real world. | | But new forms of augmentation (like OP) are helping us get | closer. | | For example, MixMatch creates "mosaic" images by combining | images across the training set [1]. In object detection, | bounding box only augmentations are improving models by | introducing variation [2]. | | And an anecdote: I work on https://roboflow.ai , and we've seen | customers make production-ready results from datasets <20 | images based on techniques like these. | | [1] https://arxiv.org/abs/1905.02249 [2] | https://arxiv.org/pdf/1906.11172.pdf | colincooke wrote: | This depends on how well those 10 images are representing the | disribution of data for your actual task. With only 10 samples | thats highly unlikely. | | What you are talking about is data augmentation, a strategy we | can use to expand our training dataset synthetically, mostly in | a bid to prevent over-fitting. | DougBTX wrote: | > Is it possible to generate infinite data and get almost | perfect neural network accuracy? | | Basic answer is no, but the reason is kind of interesting. | | Imagine that the input to the model is a list of facts, | initially the facts are just: * Image 1 has | class A * Image 2 has class B * etc... | | The idea with data augmentation, in a roundabout way, is to add | other facts: * Flipping the image does not | change its class * Translating the image does not change | its class * Adding a small amount of noise to the image | does not change its class * etc... | | However, it is tricky to express those facts as inputs to the | model, but it is easy to generate new images based on those | facts which the model should be able to learn from. It would | likely be more efficient if those facts could be expressed | directly though. | | So, by generating more data the model can progressively learn | those "class invariant transformations", but the model would | only reach perfect accuracy if _all_ class invariant | transformations were taught. | | Another way to "teach" a model these rules is to embed the rule | into the structure of the model itself, eg the idea behind | convolutional neural networks is to embed translation | independence into the model, so that it doesn't need to be | taught that from large batches of translated images. | master_yoda_1 wrote: | I am not sure how this article got ranked so high. I am | suspicious about reading these article written by non experts. I | would prefer to go to authentic sources and read the original | paper. Most of the time information in these articles are | misleading and wrong. | shookness wrote: | Instead of speaking in generalities, can you point out what is | wrong in the posted article? | master_yoda_1 wrote: | The title is fraud. It fraudulently report 85% accuracy | gain.But the inside it is something else. "FixMatch is a | recent semi-supervised approach by Sohn et al. from Google | Brain that improved the state of the art in semi-supervised | learning(SSL). It is a simpler combination of previous | methods such as UDA and ReMixMatch. In this post, we will | understand the concept of FixMatch and also see it got 78% | median accuracy and 84% maximum accuracy on CIFAR-10 with | just 10 labeled images." | master_yoda_1 wrote: | We should flag these fraudulent articles, I am not sure the | author has any credibility. | antipaul wrote: | I wish all papers were structured this way, by default. | | That is, plenty of good diagrams, clear explanations and | intuitions, no unnecessary mathiness. | pmiller2 wrote: | Speaking of diagrams, I once read a short (probably no more | than 3-4 pages) paper with one theorem and one diagram. The | diagram was essential for me to understand the proof. The | problem was the diagram was a diagram _of_ the proof. | amitness wrote: | Hi, | | Wanted to clarify that this is the summary article of the | paper. I wrote it to help out people who might not have the | math rigor and research background to understand research | papers but would benefit from an intuitive explanation. | | The actual paper is available here: | https://arxiv.org/abs/2001.07685 | mabbo wrote: | I would argue that folks like you translating the heavy | science into comprehensible ideas to those less deep into the | field are doing just as much to advance science as the | authors of these papers. | | Seriously, this is fantastic work and I cannot compliment you | enough on it. | colincooke wrote: | This is a blog not a paper, it seems you wouldn't like the | source material: https://arxiv.org/pdf/2001.07685.pdf | | But you are correct! This way of showing off your work is much | nicer than what ends up in the paper. The blog representation | is a good opportunity to show off the results of the paper on a | higher level. However, the "mathy" paper is still important so | other experts in the field can understand the details of the | technique | mattkrause wrote: | Title is (slightly) wrong. | | As the first paragraph says: "In this post, we will understand | the concept of FixMatch and also see it got 78% accuracy on | CIFAR-10 with just 10 images." | | Reporting the _best_ performance on a method that deliberately | uses just a small subset of the data is shady as heck. | colincooke wrote: | Agreed, also this model fully uses the other images, just not | the way that traditional supervised learning would. With "just | 10 labels" is more accurate. Impressive results, but this isn't | some hyper-convergence technique that somehow trains on only | ten images. | mattkrause wrote: | This depends a lot on the application. | | It seems like a big win for images and other stuff where | getting images is cheap, but labelling them is expensive. | Less great for (say) drug discovery where running the | experiments to generate the data points is the bottleneck. | kevinskii wrote: | I agree it's pretty sensationalistic, and I almost ignored it | for that reason. But it turns out that it's actually well worth | a read if you can get past that one flaw. | mattkrause wrote: | I did read it--that's how I noticed the number was wrong :-) | dang wrote: | Ok, we've reverted the title to that of the page, in keeping | with the site guidelines | (https://news.ycombinator.com/newsguidelines.html). When | changing titles, the idea is to make them less baity or | misleading, not more! | | (Submitted title was "Semi-Supervised Learning: 85% accuracy on | CIFAR-10 with only 10 labeled images") ___________________________________________________________________ (page generated 2020-04-03 23:00 UTC)