[HN Gopher] Proving the Lottery Ticket Hypothesis: Pruning is Al...
       ___________________________________________________________________
        
       Proving the Lottery Ticket Hypothesis: Pruning is All You Need
        
       Author : che_shr_cat
       Score  : 113 points
       Date   : 2020-03-07 03:55 UTC (19 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | anonymousDan wrote:
       | As a potentially naive thought experiment, if you just generated
       | in advance a number of random networks of similar size to the
       | pruned lottery ticket, and then trained them all in parallel,
       | would you eventually find a lottery ticket? If so how many would
       | you have to train to find a lottery ticket with high probability?
       | Why is training one big network and then pruning any better than
       | training lots of different smaller network? Assume in all of the
       | above that you have a rough idea of how big the pruned network
       | will be be.
        
       | stared wrote:
       | The lottery ticket hypothesis is IMHO the single most interesting
       | finding for deep learning. It explains why does deep learning
       | works (vs shallow neural nets), why is initial over-
       | parametrization is often useful, why deeper is often better than
       | shallow, etc.
       | 
       | I recommend for an overview:
       | 
       | - the original paper "The Lottery Ticket Hypothesis: Finding
       | Sparse, Trainable Neural Networks",
       | https://arxiv.org/abs/1803.03635
       | 
       | - "Deconstructing Lottery Tickets: Zeros, Signs, and the
       | Supermask" https://eng.uber.com/deconstructing-lottery-tickets/
       | showing that if we remove "non-winning tickets" before the
       | training, the trained network still works well
        
         | jl2718 wrote:
         | This is really not a new idea for anybody that has studied
         | numerical optimization or done a lot of much simpler linear
         | modeling. It has to do with the gradients, which are basically
         | random steps in high-dimensional spaces unless you align with a
         | 'large' eigenvector of the Hessian. This explanation didn't go
         | over well at all with the ML cargo cultists until someone gave
         | it a name. Interesting how things work.
        
           | sdenton4 wrote:
           | It's also fun to consider that most neural network
           | architectures are mostly piecewise linear: relu(Mx + b) for
           | standard feedforward or convolutional layers... So one would
           | expect a lot of knowledge and folklore from (piecewise)
           | linear optimization to carry over nicely.
           | 
           | IIRC, ResNet was a good example of that: It's more efficient
           | to learn relu(x + Mx + b) than relu(Mx + b)
           | 
           | One difficulty, though, is that proofs generally /don't/
           | carry over, and IME a lot of 'classical' methods constrain
           | themselves to provable operations... So at times it can be
           | hard to tell the difference between 'useful folklore' and
           | 'tweak that makes the proofs work.'
           | 
           | I've been delving into sparse coding literature from about
           | ten years ago, and there's a lot of this kind of
           | difficulty... Interestingly, the best sparse coding
           | architectures ended up being very very similar to shallow
           | neural networks. There's a nice 'deep ksvd denoising' paper
           | from a few months back which improves handily on the older
           | sparse coding architectures by bringing in a smattering of
           | ideas from the DNN age; the rant at the end is makes the case
           | for building these 'blended' architectures to get the best of
           | both worlds. https://arxiv.org/abs/1909.13164
           | 
           | I tend to think that the DNN architectures beat the provable-
           | domain architectures for Good Reasons, but the USEFUL
           | response is to build awesome blended architectures that maybe
           | go a long way towards bringing down model sizes and
           | complexity, while increasing explainability.
        
           | stared wrote:
           | Could you point to concrete papers, showing that empirically?
           | 
           | (Many things were rediscovered over and over, sure.)
           | 
           | > random steps in high-dimensional spaces unless you align
           | with a 'large' eigenvector of the Hessian
           | 
           | In this case, the main observation is different than this
           | one, and has much more to do with sparsity (and an
           | exponentially growing number of connections with each layer).
        
             | lsorber wrote:
             | Nocedal has some papers in this direction.
        
           | nograpes wrote:
           | Do you have any good papers or textbooks where I can learn
           | about this from a numerical optimization perspective?
        
             | adampk wrote:
             | https://www.amazon.com/Numerical-Optimization-Operations-
             | Fin...
             | 
             | seems to be "the" numerical optimization textbook
        
           | moultano wrote:
           | I don't see any connection between what you've written and
           | the lottery ticket hypothesis. Could you elaborate? (In
           | particular, the core of the lottery ticket hypothesis is that
           | layers make the nets combinatorial, and so I don't see how
           | anything from simple linear modelling should transfer.)
        
       | tells wrote:
       | ELI5 someone please.
        
       | zackmorris wrote:
       | I've always felt the there is a deep connection between evolution
       | and thought, or more specifically, genetic algorithms (GAs) and
       | neural networks (NNs).
       | 
       | The state of the art when I started following AI in the late 90s
       | was random weights and hyper-parameters chosen with a GA, then
       | optimized with NN hill climbing to find the local maximum. Looks
       | like the research has continued:
       | 
       | https://www.google.com/search?q=genetic+algorithm+neural+net...
       | 
       | All I'm saying is that since we're no longer compute-bound, I'd
       | like to see more big-picture thinking. We're so obsessed with
       | getting 99% accuracy on some pattern-matching test that we
       | completely miss other options, like in this case that effective
       | subnetworks can evolve within a larger system of networks.
       | 
       | I'd like to see a mathematical proof showing that these and all
       | other approaches to AI like simulated annealing are (or can be
       | made) equivalent. Sort of like a Church-Turing thesis for machine
       | learning:
       | 
       | https://en.wikipedia.org/wiki/Church-Turing_thesis
       | 
       | If we had this, then we could use higher-level abstractions and
       | substitute simpler algorithms (like GAs) for the harder ones
       | (like NNs) and not get so lost in the minutia and terminology.
       | Once we had working solutions, we could analyze them and work
       | backwards to covert them to their optimized/complex NN
       | equivalents.
       | 
       | An analogy for this would be solving problems in our heads with
       | simpler/abstract methods like spreadsheets, functional
       | programming and higher-order functions. Then translating those
       | solutions to whatever limited/verbose imperative languages we
       | have to use for our jobs.
       | 
       | Edit: I should have said "NN gradient descent to find the local
       | minimum" but hopefully my point still stands.
       | 
       | Edit 2: I should clarify that in layman's terms, Church-Turing
       | says "every effectively calculable function is a computable
       | function" so functional programming and imperative programming
       | can solve the same problems, be used interchangeably and even be
       | converted from one form to the other.
        
         | p1esk wrote:
         | What do you mean we're no longer compute bound? Why don't you
         | try replicating Microsoft's Turing-NLG 17B parameter model:
         | "... trains with only 256 NVIDIA GPUs compared to 1024 NVIDIA
         | GPUs needed by using Megatron-LM alone."
        
         | sgillen wrote:
         | > All I'm saying is that since we're no longer compute-bound,
         | I'd like to see more big-picture thinking. We're so obsessed
         | with getting 99% accuracy on some pattern-matching test that we
         | completely miss other options.
         | 
         | There are so many people working in this field now, you can be
         | sure a lot of them are doing big picture thinking.
         | 
         | > I'd like to see a mathematical proof showing that these and
         | all other approaches to AI like simulated annealing are (or can
         | be made) equivalent. Sort of like a Church-Turing thesis for
         | machine learning:
         | 
         | Maybe I'm misunderstanding what you are saying, but I think the
         | different optimization techniques/Metaheuristics you're talking
         | do actually have provably different properties.
        
       | IX-103 wrote:
       | This is really neat and has a lot of implications for porting
       | larger models to limited platforms like mobile. Unfortunately you
       | still have to train the larger network, so the gains are somewhat
       | limited. Some other papers I read show that you might be able to
       | prune the network in the middle of training, which would make
       | larger models more practical to work with.
        
         | ekelsen wrote:
         | The implications are unclear to me. We already know how to
         | prune models for inference. For example
         | https://arxiv.org/abs/1710.01878, along with earlier work (and
         | more recent work). There's also work showing that you can take
         | advantage of the sparsity to achieve practical speed gains:
         | https://arxiv.org/abs/1911.09723.
         | 
         | We can also train networks that are sparse from the beginning
         | of training (without requiring any special knowledge of the
         | solution): https://arxiv.org/abs/1911.11134. It remains to be
         | shown that this can be done with a speed advantage.
        
           | estebarb wrote:
           | Without reading: I think that the importance is that before
           | we had methods that _could_ do that. Now we know that there
           | is an algorithm that _can_ do that. They proved that it is
           | always possible, not in some subset of the networks.
           | 
           | In the other hand, it will trigger research on reducing the
           | size of the networks. That is important, as most researchers
           | don't have access to the computing power of Google and the
           | like.
        
             | ekelsen wrote:
             | It's unclear this algorithm would be useful in practice.
             | Training the weights will lead to a more accurate network
             | for the same amount of work at inference time.
        
           | veselin wrote:
           | I wonder, isn't the new CPU training with locality sensitive
           | hashing similar to constructing something close to a winning
           | lottery ticket?
        
           | stared wrote:
           | In most cases, there is limited support for sparse
           | operations. "Sparse Networks from Scratch: Faster Training
           | without Losing Performance" https://arxiv.org/abs/1907.04840
           | openly says "Currently, no GPU accelerated libraries that
           | utilize sparse tensors exist, and as such we use masked
           | weights to simulate sparse neural networks.".
           | 
           | However, the situation seems to be very dynamic. See:
           | 
           | - https://github.com/StanfordVL/MinkowskiEngine (Minkowski
           | Engine is an auto-diff convolutional neural network library
           | for high-dimensional sparse tensors)
           | 
           | - https://github.com/rusty1s/pytorch_sparse (an efficient
           | sparse tensor implementation for PyTorch; the official one is
           | slower SciPy https://github.com/pytorch/pytorch/issues/16187;
           | however, I failed to install it - it is not "pip
           | install"-simple)
           | 
           | EDIT:
           | 
           | I checked it now and was able to install pytorch_sparse with
           | one command. It is a dynamic field indeed.
        
             | madlag wrote:
             | OpenAI just announced it will port its sparse libraries to
             | PyTorch, so exciting times ahead! You can read more here
             | about this (OP here) : https://medium.com/huggingface/is-
             | the-future-of-neural-netwo...
        
       | xt00 wrote:
       | If "pruning is all you need" that does feel like a way of
       | explaining how intelligence could come out of a mass of neurons
       | such as our brain. Or at least that sounds like a thing that
       | makes it understandable to me. Basically add a bunch of
       | connections relatively randomly, start pruning slowly until you
       | hit a point where the system changes... I'll keep hand waving
       | until somebody who knows this stuff can chime in.. :)
        
         | Buttons840 wrote:
         | In psychology class the professor told me that the number of
         | connections in a human brain increases only two times in life,
         | during infancy and during adolecense, at all other time the
         | number of connections is decreasing.
        
       | bo1024 wrote:
       | I have a question. They show that any given depth-ell network,
       | computing F, is w.h.p. approximated by some subnetwork of a
       | random depth-2ell network.
       | 
       | But there is a theorem that even depth-2 networks can approximate
       | _any_ continuous function F. If the assumptions were the same,
       | then their theorem would imply any continuous function F is
       | w.h.p. approximated by some subnetwork of a depth-4 network.
       | 
       | So what is the difference in assumptions, i.e. what's the
       | significance of F being computed by a depth-ell network? What
       | functions can a depth-ell+1 network approximate that a depth-ell
       | network can't? I'd guess it has to do with Lipschitz assumptions
       | and bounded parameters but would be awesome if someone can
       | clarify!
        
       ___________________________________________________________________
       (page generated 2020-03-07 23:00 UTC)