[HN Gopher] Understanding deep learning requires rethinking gene...
       ___________________________________________________________________
        
       Understanding deep learning requires rethinking generalization
        
       Author : tmfi
       Score  : 92 points
       Date   : 2021-03-04 18:32 UTC (4 hours ago)
        
 (HTM) web link (cacm.acm.org)
 (TXT) w3m dump (cacm.acm.org)
        
       | benlivengood wrote:
       | This is only tangentially my field, so pure speculation.
       | 
       | I suppose it's possible that generalized minima are numerically
       | more common than overfitted minima in an over-parameterized
       | model, so probabilistically SGD will find a more general minima
       | than not, regardless of regularization.
        
         | hervature wrote:
         | I think the general consensus (from my interactions) is that a
         | local minima requires the gradient to vanish. When you have
         | many dimensions, it's unlikely that they are all 0. Coupled
         | with modern optimization methods (primarily momentum), this
         | encourages the result to be in a shallow valley as opposed to a
         | spiky minima. The leap of faith is equating shallow=general and
         | spiky=overfitted.
        
       | [deleted]
        
       | vonsydov wrote:
       | The whole point of neural networks was that you don't need to
       | think hard about generalizations.
        
         | [deleted]
        
       | clircle wrote:
       | > Conventional wisdom attributes small generalization error
       | either to properties of the model family or to the regularization
       | techniques used during training.
       | 
       | I'd say it's more about the simplicity of the task and quality of
       | the data.
        
       | magicalhippo wrote:
       | _The experiments we conducted emphasize that the effective
       | capacity of several successful neural network architectures is
       | large enough to shatter the training data. Consequently, these
       | models are in principle rich enough to memorize the training
       | data._
       | 
       | So they're fitting elephants[1].
       | 
       | I've been trying to use DeepSpeech[2] lately for a project, would
       | be interesting to see the results for that.
       | 
       | I guess it could also be a decent test for your model? Retrain it
       | with random labels and if it succeeds the model is just
       | memorizing, so either reduce model complexity or add more
       | training data?
       | 
       | [1]: https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-
       | elep...
       | 
       | [2]: https://github.com/mozilla/DeepSpeech
        
         | mxwsn wrote:
         | Large model capacity enough to perfectly memorize/interpolate
         | data may not be a bad thing. A phenomenon known as "deep double
         | descent" says that increasing modeling capacity relative to the
         | dataset size can reduce generalization error, even after the
         | model achieves perfect training performance (see work by
         | Mikhail Belkin [0] and empirical demonstrations on large deep
         | learning tasks by researchers from Harvard/OpenAI [1]). Other
         | work argues that memorization is critical to good performance
         | on real-world tasks where the data distribution is often long-
         | tailed [2]: to perform well at tasks where the training data
         | set only has 1 or 2 examples, it's best to memorize those
         | labels, rather than extrapolate from other data (which a lower
         | capacity model may prefer to do).
         | 
         | [0]: https://arxiv.org/abs/1812.11118
         | 
         | [1]: https://openai.com/blog/deep-double-descent/
         | 
         | [2]: https://arxiv.org/abs/2008.03703
        
           | derbOac wrote:
           | I think there's something critical about the implicit data
           | universe being considered in the test and training data, and
           | in these randomized datasets. Memorizing elephants isn't
           | necessarily a bad thing if you can be assured there are
           | actually elephants in the data, or if your job is to
           | reproduce some data that has some highly non-random, low
           | entropy (in an abstract sense) features.
           | 
           | I think where the phenomenon in this paper, and deep double
           | descent starts to clash with intuition, is the more realistic
           | case where the adversarial data universe is structured, not
           | random, but not conforming to the observed training target
           | label alphabet (to borrow a term loosely from the IT
           | literature). That is, it's interesting to know that these
           | models can perfectly reproduce random data, but generalizing
           | from training to test data isn't interesting in a real-world
           | sense if both are constrained by some implicit features of
           | the data universe involved in the modeling process (e.g.,
           | that the non-elephant data only differs randomly from the
           | elephant data, or doesn't contain any nonelephants that
           | aren't represented in the training data). So then you end up
           | with this:
           | https://www.theverge.com/2020/9/20/21447998/twitter-photo-
           | pr...
           | 
           | I guess it seems to me there's a lot of implicit assumptions
           | about the data space and what's actually being inferred in a
           | lot of these DL models. The insight about SGD is useful, but
           | maybe only underscores certain things, and seems to get lost
           | in some of the discussion about DDD. Even Rademacher
           | complexity isn't taken with regard to the _entire_ dataspace,
           | just over a uniformly random sampling of it -- so it will
           | underrepresent certain corners of the data space that are
           | highly nonuniform, low entropy, which is exactly where the
           | trouble lies.
           | 
           | There's lots of fascinating stuff in this area of research,
           | lots to say, glad to see it here on HN again.
        
         | caddemon wrote:
         | Would that necessarily imply it is memorizing on the non-random
         | labels though? I know analogies to human learning are overdone,
         | but I definitely have seen humans fall back on memorization
         | when they are struggling to "actually learn" some topic.
         | 
         | So genuinely asking from a technical/ML perspective - is it
         | possible a network could optimize training loss without
         | memorization when possible, but as that fails end up just
         | memorizing?
        
           | hervature wrote:
           | > is it possible a network could optimize training loss
           | without memorization when possible
           | 
           | What does that even mean? ML is simply a set of tools to
           | minimize (f(x_i) - y_i)^2 (if you don't like squared loss,
           | pick whatever you prefer). In particular, f() here is our
           | neural network. There is no loss without previous data. The
           | only thing the network is trying to do is "memorize" old
           | data.
        
             | caddemon wrote:
             | It depends of course how you are defining memorization, but
             | the network doesn't necessarily need to use the entirety of
             | every input to do what you are describing. I would think
             | what people mean when they say "it isn't learning, just
             | memorizing" is that the vast majority of information about
             | all previous inputs is being directly encoded in the
             | network.
             | 
             | The person I was responding to mentioned training on random
             | labels, and if training still goes well the network must be
             | a memorizer. But I don't see why it couldn't be the case
             | that a network is able to act as a memorizer, but doesn't
             | if there are certain patterns it can generalize on in the
             | training data.
             | 
             | Also, there is no human learning without previous data
             | either, but I wouldn't characterize all of human learning
             | as memorization.
        
               | dumb1224 wrote:
               | I don't understand the random label training part.
               | Presumably you train on randomised labels which have no
               | relationship with the input but surely it won't
               | generalise well at all given the small probability of
               | predicting the labels correctly by chance (The setup for
               | a Permutation test am I wrong)?
        
               | magicalhippo wrote:
               | This was the thing I misread first time around.
               | 
               | If you look at Table 1, you see that the models manage to
               | train almost 100% correctly on the randomized labels, but
               | crucially the control test score is down in the 10%
               | region. This is in stark comparison to roughly 80-90%
               | test score for the properly labeled data.
               | 
               | So it seems to me that when faced with structured data
               | they manage to generalize the structure somehow, while
               | when faced with random training data they're powerful
               | enough to simply memorize the training data.
               | 
               | edit: so just point point out, obviously it's to be
               | expected the test to be bad for random input, after all
               | how can you properly test classification of random data?
               | 
               | So the point, as I understand it, isn't that the
               | randomized input leads to poor test results, but rather
               | that the non-randomized ones manages to generalize
               | despite it being capable of simply memorizing the input.
        
               | caddemon wrote:
               | AFAIK that's right, it would be very unlikely to
               | generalize on random labels, which is why I read the
               | comment as suggesting the network shouldn't have low
               | training loss in that situation.
        
           | tlb wrote:
           | There's a spectrum of generalization-memorization. The
           | extreme case of memorization is that it would only correctly
           | classify images with precisely identical pixel values to the
           | ones in the training set. When we say that people have
           | "memorized" something, we are still far more general than
           | that. We might key off a few words, say.
           | 
           | The behavior of highly overparameterized networks is
           | basically what you suggest: they will memorize if needed
           | (which you can test by randomizing labels) but will usually
           | generalize better when possible.
        
         | bloaf wrote:
         | That's not my reading. I think they are saying that models
         | which *can* over-fit the data in both theory and practice,
         | _appear not do so when there are in fact generalizations in the
         | data._
        
           | magicalhippo wrote:
           | Ah hmm yes, good point. I forgot this piece from the article:
           | 
           |  _We observe a steady deterioration of the generalization
           | error as we increase the noise level. This shows that neural
           | networks are able to capture the remaining signal in the data
           | while at the same time fit the noisy part using brute-force._
           | 
           | Since they score well on the test data, they must have
           | generalized to some degree. But since they're just as good at
           | training on random input they also have the capacity to just
           | memorize the training data.
        
       | blt wrote:
       | Please add the (still) to the HN post title. The original version
       | of the paper without (still) in the title is several years old.
        
         | davnn wrote:
         | 2016 for v1 to be exact, link: https://arxiv.org/abs/1611.03530
         | and discussion: https://news.ycombinator.com/item?id=13566917
        
       ___________________________________________________________________
       (page generated 2021-03-04 23:00 UTC)