[HN Gopher] Proving the Lottery Ticket Hypothesis: Pruning is Al... ___________________________________________________________________ Proving the Lottery Ticket Hypothesis: Pruning is All You Need Author : che_shr_cat Score : 113 points Date : 2020-03-07 03:55 UTC (19 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | anonymousDan wrote: | As a potentially naive thought experiment, if you just generated | in advance a number of random networks of similar size to the | pruned lottery ticket, and then trained them all in parallel, | would you eventually find a lottery ticket? If so how many would | you have to train to find a lottery ticket with high probability? | Why is training one big network and then pruning any better than | training lots of different smaller network? Assume in all of the | above that you have a rough idea of how big the pruned network | will be be. | stared wrote: | The lottery ticket hypothesis is IMHO the single most interesting | finding for deep learning. It explains why does deep learning | works (vs shallow neural nets), why is initial over- | parametrization is often useful, why deeper is often better than | shallow, etc. | | I recommend for an overview: | | - the original paper "The Lottery Ticket Hypothesis: Finding | Sparse, Trainable Neural Networks", | https://arxiv.org/abs/1803.03635 | | - "Deconstructing Lottery Tickets: Zeros, Signs, and the | Supermask" https://eng.uber.com/deconstructing-lottery-tickets/ | showing that if we remove "non-winning tickets" before the | training, the trained network still works well | jl2718 wrote: | This is really not a new idea for anybody that has studied | numerical optimization or done a lot of much simpler linear | modeling. It has to do with the gradients, which are basically | random steps in high-dimensional spaces unless you align with a | 'large' eigenvector of the Hessian. This explanation didn't go | over well at all with the ML cargo cultists until someone gave | it a name. Interesting how things work. | sdenton4 wrote: | It's also fun to consider that most neural network | architectures are mostly piecewise linear: relu(Mx + b) for | standard feedforward or convolutional layers... So one would | expect a lot of knowledge and folklore from (piecewise) | linear optimization to carry over nicely. | | IIRC, ResNet was a good example of that: It's more efficient | to learn relu(x + Mx + b) than relu(Mx + b) | | One difficulty, though, is that proofs generally /don't/ | carry over, and IME a lot of 'classical' methods constrain | themselves to provable operations... So at times it can be | hard to tell the difference between 'useful folklore' and | 'tweak that makes the proofs work.' | | I've been delving into sparse coding literature from about | ten years ago, and there's a lot of this kind of | difficulty... Interestingly, the best sparse coding | architectures ended up being very very similar to shallow | neural networks. There's a nice 'deep ksvd denoising' paper | from a few months back which improves handily on the older | sparse coding architectures by bringing in a smattering of | ideas from the DNN age; the rant at the end is makes the case | for building these 'blended' architectures to get the best of | both worlds. https://arxiv.org/abs/1909.13164 | | I tend to think that the DNN architectures beat the provable- | domain architectures for Good Reasons, but the USEFUL | response is to build awesome blended architectures that maybe | go a long way towards bringing down model sizes and | complexity, while increasing explainability. | stared wrote: | Could you point to concrete papers, showing that empirically? | | (Many things were rediscovered over and over, sure.) | | > random steps in high-dimensional spaces unless you align | with a 'large' eigenvector of the Hessian | | In this case, the main observation is different than this | one, and has much more to do with sparsity (and an | exponentially growing number of connections with each layer). | lsorber wrote: | Nocedal has some papers in this direction. | nograpes wrote: | Do you have any good papers or textbooks where I can learn | about this from a numerical optimization perspective? | adampk wrote: | https://www.amazon.com/Numerical-Optimization-Operations- | Fin... | | seems to be "the" numerical optimization textbook | moultano wrote: | I don't see any connection between what you've written and | the lottery ticket hypothesis. Could you elaborate? (In | particular, the core of the lottery ticket hypothesis is that | layers make the nets combinatorial, and so I don't see how | anything from simple linear modelling should transfer.) | tells wrote: | ELI5 someone please. | zackmorris wrote: | I've always felt the there is a deep connection between evolution | and thought, or more specifically, genetic algorithms (GAs) and | neural networks (NNs). | | The state of the art when I started following AI in the late 90s | was random weights and hyper-parameters chosen with a GA, then | optimized with NN hill climbing to find the local maximum. Looks | like the research has continued: | | https://www.google.com/search?q=genetic+algorithm+neural+net... | | All I'm saying is that since we're no longer compute-bound, I'd | like to see more big-picture thinking. We're so obsessed with | getting 99% accuracy on some pattern-matching test that we | completely miss other options, like in this case that effective | subnetworks can evolve within a larger system of networks. | | I'd like to see a mathematical proof showing that these and all | other approaches to AI like simulated annealing are (or can be | made) equivalent. Sort of like a Church-Turing thesis for machine | learning: | | https://en.wikipedia.org/wiki/Church-Turing_thesis | | If we had this, then we could use higher-level abstractions and | substitute simpler algorithms (like GAs) for the harder ones | (like NNs) and not get so lost in the minutia and terminology. | Once we had working solutions, we could analyze them and work | backwards to covert them to their optimized/complex NN | equivalents. | | An analogy for this would be solving problems in our heads with | simpler/abstract methods like spreadsheets, functional | programming and higher-order functions. Then translating those | solutions to whatever limited/verbose imperative languages we | have to use for our jobs. | | Edit: I should have said "NN gradient descent to find the local | minimum" but hopefully my point still stands. | | Edit 2: I should clarify that in layman's terms, Church-Turing | says "every effectively calculable function is a computable | function" so functional programming and imperative programming | can solve the same problems, be used interchangeably and even be | converted from one form to the other. | p1esk wrote: | What do you mean we're no longer compute bound? Why don't you | try replicating Microsoft's Turing-NLG 17B parameter model: | "... trains with only 256 NVIDIA GPUs compared to 1024 NVIDIA | GPUs needed by using Megatron-LM alone." | sgillen wrote: | > All I'm saying is that since we're no longer compute-bound, | I'd like to see more big-picture thinking. We're so obsessed | with getting 99% accuracy on some pattern-matching test that we | completely miss other options. | | There are so many people working in this field now, you can be | sure a lot of them are doing big picture thinking. | | > I'd like to see a mathematical proof showing that these and | all other approaches to AI like simulated annealing are (or can | be made) equivalent. Sort of like a Church-Turing thesis for | machine learning: | | Maybe I'm misunderstanding what you are saying, but I think the | different optimization techniques/Metaheuristics you're talking | do actually have provably different properties. | IX-103 wrote: | This is really neat and has a lot of implications for porting | larger models to limited platforms like mobile. Unfortunately you | still have to train the larger network, so the gains are somewhat | limited. Some other papers I read show that you might be able to | prune the network in the middle of training, which would make | larger models more practical to work with. | ekelsen wrote: | The implications are unclear to me. We already know how to | prune models for inference. For example | https://arxiv.org/abs/1710.01878, along with earlier work (and | more recent work). There's also work showing that you can take | advantage of the sparsity to achieve practical speed gains: | https://arxiv.org/abs/1911.09723. | | We can also train networks that are sparse from the beginning | of training (without requiring any special knowledge of the | solution): https://arxiv.org/abs/1911.11134. It remains to be | shown that this can be done with a speed advantage. | estebarb wrote: | Without reading: I think that the importance is that before | we had methods that _could_ do that. Now we know that there | is an algorithm that _can_ do that. They proved that it is | always possible, not in some subset of the networks. | | In the other hand, it will trigger research on reducing the | size of the networks. That is important, as most researchers | don't have access to the computing power of Google and the | like. | ekelsen wrote: | It's unclear this algorithm would be useful in practice. | Training the weights will lead to a more accurate network | for the same amount of work at inference time. | veselin wrote: | I wonder, isn't the new CPU training with locality sensitive | hashing similar to constructing something close to a winning | lottery ticket? | stared wrote: | In most cases, there is limited support for sparse | operations. "Sparse Networks from Scratch: Faster Training | without Losing Performance" https://arxiv.org/abs/1907.04840 | openly says "Currently, no GPU accelerated libraries that | utilize sparse tensors exist, and as such we use masked | weights to simulate sparse neural networks.". | | However, the situation seems to be very dynamic. See: | | - https://github.com/StanfordVL/MinkowskiEngine (Minkowski | Engine is an auto-diff convolutional neural network library | for high-dimensional sparse tensors) | | - https://github.com/rusty1s/pytorch_sparse (an efficient | sparse tensor implementation for PyTorch; the official one is | slower SciPy https://github.com/pytorch/pytorch/issues/16187; | however, I failed to install it - it is not "pip | install"-simple) | | EDIT: | | I checked it now and was able to install pytorch_sparse with | one command. It is a dynamic field indeed. | madlag wrote: | OpenAI just announced it will port its sparse libraries to | PyTorch, so exciting times ahead! You can read more here | about this (OP here) : https://medium.com/huggingface/is- | the-future-of-neural-netwo... | xt00 wrote: | If "pruning is all you need" that does feel like a way of | explaining how intelligence could come out of a mass of neurons | such as our brain. Or at least that sounds like a thing that | makes it understandable to me. Basically add a bunch of | connections relatively randomly, start pruning slowly until you | hit a point where the system changes... I'll keep hand waving | until somebody who knows this stuff can chime in.. :) | Buttons840 wrote: | In psychology class the professor told me that the number of | connections in a human brain increases only two times in life, | during infancy and during adolecense, at all other time the | number of connections is decreasing. | bo1024 wrote: | I have a question. They show that any given depth-ell network, | computing F, is w.h.p. approximated by some subnetwork of a | random depth-2ell network. | | But there is a theorem that even depth-2 networks can approximate | _any_ continuous function F. If the assumptions were the same, | then their theorem would imply any continuous function F is | w.h.p. approximated by some subnetwork of a depth-4 network. | | So what is the difference in assumptions, i.e. what's the | significance of F being computed by a depth-ell network? What | functions can a depth-ell+1 network approximate that a depth-ell | network can't? I'd guess it has to do with Lipschitz assumptions | and bounded parameters but would be awesome if someone can | clarify! ___________________________________________________________________ (page generated 2020-03-07 23:00 UTC)