[HN Gopher] Understanding deep learning requires rethinking gene... ___________________________________________________________________ Understanding deep learning requires rethinking generalization Author : tmfi Score : 92 points Date : 2021-03-04 18:32 UTC (4 hours ago) (HTM) web link (cacm.acm.org) (TXT) w3m dump (cacm.acm.org) | benlivengood wrote: | This is only tangentially my field, so pure speculation. | | I suppose it's possible that generalized minima are numerically | more common than overfitted minima in an over-parameterized | model, so probabilistically SGD will find a more general minima | than not, regardless of regularization. | hervature wrote: | I think the general consensus (from my interactions) is that a | local minima requires the gradient to vanish. When you have | many dimensions, it's unlikely that they are all 0. Coupled | with modern optimization methods (primarily momentum), this | encourages the result to be in a shallow valley as opposed to a | spiky minima. The leap of faith is equating shallow=general and | spiky=overfitted. | [deleted] | vonsydov wrote: | The whole point of neural networks was that you don't need to | think hard about generalizations. | [deleted] | clircle wrote: | > Conventional wisdom attributes small generalization error | either to properties of the model family or to the regularization | techniques used during training. | | I'd say it's more about the simplicity of the task and quality of | the data. | magicalhippo wrote: | _The experiments we conducted emphasize that the effective | capacity of several successful neural network architectures is | large enough to shatter the training data. Consequently, these | models are in principle rich enough to memorize the training | data._ | | So they're fitting elephants[1]. | | I've been trying to use DeepSpeech[2] lately for a project, would | be interesting to see the results for that. | | I guess it could also be a decent test for your model? Retrain it | with random labels and if it succeeds the model is just | memorizing, so either reduce model complexity or add more | training data? | | [1]: https://www.johndcook.com/blog/2011/06/21/how-to-fit-an- | elep... | | [2]: https://github.com/mozilla/DeepSpeech | mxwsn wrote: | Large model capacity enough to perfectly memorize/interpolate | data may not be a bad thing. A phenomenon known as "deep double | descent" says that increasing modeling capacity relative to the | dataset size can reduce generalization error, even after the | model achieves perfect training performance (see work by | Mikhail Belkin [0] and empirical demonstrations on large deep | learning tasks by researchers from Harvard/OpenAI [1]). Other | work argues that memorization is critical to good performance | on real-world tasks where the data distribution is often long- | tailed [2]: to perform well at tasks where the training data | set only has 1 or 2 examples, it's best to memorize those | labels, rather than extrapolate from other data (which a lower | capacity model may prefer to do). | | [0]: https://arxiv.org/abs/1812.11118 | | [1]: https://openai.com/blog/deep-double-descent/ | | [2]: https://arxiv.org/abs/2008.03703 | derbOac wrote: | I think there's something critical about the implicit data | universe being considered in the test and training data, and | in these randomized datasets. Memorizing elephants isn't | necessarily a bad thing if you can be assured there are | actually elephants in the data, or if your job is to | reproduce some data that has some highly non-random, low | entropy (in an abstract sense) features. | | I think where the phenomenon in this paper, and deep double | descent starts to clash with intuition, is the more realistic | case where the adversarial data universe is structured, not | random, but not conforming to the observed training target | label alphabet (to borrow a term loosely from the IT | literature). That is, it's interesting to know that these | models can perfectly reproduce random data, but generalizing | from training to test data isn't interesting in a real-world | sense if both are constrained by some implicit features of | the data universe involved in the modeling process (e.g., | that the non-elephant data only differs randomly from the | elephant data, or doesn't contain any nonelephants that | aren't represented in the training data). So then you end up | with this: | https://www.theverge.com/2020/9/20/21447998/twitter-photo- | pr... | | I guess it seems to me there's a lot of implicit assumptions | about the data space and what's actually being inferred in a | lot of these DL models. The insight about SGD is useful, but | maybe only underscores certain things, and seems to get lost | in some of the discussion about DDD. Even Rademacher | complexity isn't taken with regard to the _entire_ dataspace, | just over a uniformly random sampling of it -- so it will | underrepresent certain corners of the data space that are | highly nonuniform, low entropy, which is exactly where the | trouble lies. | | There's lots of fascinating stuff in this area of research, | lots to say, glad to see it here on HN again. | caddemon wrote: | Would that necessarily imply it is memorizing on the non-random | labels though? I know analogies to human learning are overdone, | but I definitely have seen humans fall back on memorization | when they are struggling to "actually learn" some topic. | | So genuinely asking from a technical/ML perspective - is it | possible a network could optimize training loss without | memorization when possible, but as that fails end up just | memorizing? | hervature wrote: | > is it possible a network could optimize training loss | without memorization when possible | | What does that even mean? ML is simply a set of tools to | minimize (f(x_i) - y_i)^2 (if you don't like squared loss, | pick whatever you prefer). In particular, f() here is our | neural network. There is no loss without previous data. The | only thing the network is trying to do is "memorize" old | data. | caddemon wrote: | It depends of course how you are defining memorization, but | the network doesn't necessarily need to use the entirety of | every input to do what you are describing. I would think | what people mean when they say "it isn't learning, just | memorizing" is that the vast majority of information about | all previous inputs is being directly encoded in the | network. | | The person I was responding to mentioned training on random | labels, and if training still goes well the network must be | a memorizer. But I don't see why it couldn't be the case | that a network is able to act as a memorizer, but doesn't | if there are certain patterns it can generalize on in the | training data. | | Also, there is no human learning without previous data | either, but I wouldn't characterize all of human learning | as memorization. | dumb1224 wrote: | I don't understand the random label training part. | Presumably you train on randomised labels which have no | relationship with the input but surely it won't | generalise well at all given the small probability of | predicting the labels correctly by chance (The setup for | a Permutation test am I wrong)? | magicalhippo wrote: | This was the thing I misread first time around. | | If you look at Table 1, you see that the models manage to | train almost 100% correctly on the randomized labels, but | crucially the control test score is down in the 10% | region. This is in stark comparison to roughly 80-90% | test score for the properly labeled data. | | So it seems to me that when faced with structured data | they manage to generalize the structure somehow, while | when faced with random training data they're powerful | enough to simply memorize the training data. | | edit: so just point point out, obviously it's to be | expected the test to be bad for random input, after all | how can you properly test classification of random data? | | So the point, as I understand it, isn't that the | randomized input leads to poor test results, but rather | that the non-randomized ones manages to generalize | despite it being capable of simply memorizing the input. | caddemon wrote: | AFAIK that's right, it would be very unlikely to | generalize on random labels, which is why I read the | comment as suggesting the network shouldn't have low | training loss in that situation. | tlb wrote: | There's a spectrum of generalization-memorization. The | extreme case of memorization is that it would only correctly | classify images with precisely identical pixel values to the | ones in the training set. When we say that people have | "memorized" something, we are still far more general than | that. We might key off a few words, say. | | The behavior of highly overparameterized networks is | basically what you suggest: they will memorize if needed | (which you can test by randomizing labels) but will usually | generalize better when possible. | bloaf wrote: | That's not my reading. I think they are saying that models | which *can* over-fit the data in both theory and practice, | _appear not do so when there are in fact generalizations in the | data._ | magicalhippo wrote: | Ah hmm yes, good point. I forgot this piece from the article: | | _We observe a steady deterioration of the generalization | error as we increase the noise level. This shows that neural | networks are able to capture the remaining signal in the data | while at the same time fit the noisy part using brute-force._ | | Since they score well on the test data, they must have | generalized to some degree. But since they're just as good at | training on random input they also have the capacity to just | memorize the training data. | blt wrote: | Please add the (still) to the HN post title. The original version | of the paper without (still) in the title is several years old. | davnn wrote: | 2016 for v1 to be exact, link: https://arxiv.org/abs/1611.03530 | and discussion: https://news.ycombinator.com/item?id=13566917 ___________________________________________________________________ (page generated 2021-03-04 23:00 UTC)