[HN Gopher] Researchers unveil a pruning algorithm to shrink de... ___________________________________________________________________ Researchers unveil a pruning algorithm to shrink deep learning models Author : headalgorithm Score : 66 points Date : 2020-05-07 16:29 UTC (1 days ago) (HTM) web link (news.mit.edu) (TXT) w3m dump (news.mit.edu) | asparagui wrote: | Comparing Rewinding and Fine-tuning in Neural Network Pruning | | https://arxiv.org/abs/2003.02389 | CShorten wrote: | Check out our interview on Machine Learning Street Talk with | Jonathan Frankle explaining Rewinding and why you can only reset | the Learning Rate rather than the weights! | https://www.youtube.com/watch?v=SfjJoevBbjU&t=1177s | brilee wrote: | This technique figures out how to find a subset of neural network | edge weights that can replicate most of the full model's | performance, and is indeed quite simple. The catch is that this | sparse subset of NN edge weights has no structure, so it doesn't | get efficient execution on a GPU. If you're on a CPU with no | specialized matrix math, this does in fact cut down on execution | costs, but if you have an embedded ML chip, this doesn't really | help. | | Distillation is another shrinkage technique that is also pretty | foolproof, but allows you to pick an arbitrary architecture - | this way, you can make full use of whatever hardware you're | deploying too. | walrus wrote: | I'm just speculating (and haven't read the paper yet), but it | may be possible to achieve similar speedups on GPUs by pruning | the smallest 20% of blocks of size >=KxK to produce block- | sparse weights[0], rather than pruning the smallest 20% of | weights. | | [0] https://openai.com/blog/block-sparse-gpu-kernels/ | lonelappde wrote: | Does a GPU run a pruned network slower than unpruned, or just | not as faster as the modern shrinkages would suggest? | fxtentacle wrote: | A naive GPU implementation on a non-block sparse network will | be just as slow as the full un-pruned network. | buildbot wrote: | If they are simply masking out low values weights after doing | full training, they've basically recreated what I worked on for | my graduate thesis. In contrast to this work, what we worked on | starts from iteration 0, and constantly adapts the lottery mask. | You can reset weights back to their initial value or set them | back to their initial value tied to a decay term To zero initial | values slowly to get computational sparsity if needed. | https://arxiv.org/abs/1806.06949 | https://open.library.ubc.ca/cIRcle/collections/ubctheses/24/... | vikramkr wrote: | As someone not familiar with AI - I'm wondering if this is really | as simple and revolutionary as the article states? MIT is kind of | known for highly optimistic press releases that maybe oversell | sometimes, which makes it hard to know what's actually a real | breakthrough sometimes | deepnotderp wrote: | It's like 99% hyperbole, the gain is like 20% versus fine | tuning. An interesting idea, but incremental in terms of | improvements. | itronitron wrote: | There is a lot of pie-in-the-sky PR filler in the article, and | it's very light on details. It likely is as simple as the | article states, but probably not very revolutionary. At some | point the learned model will cease to be useful. | orange3xchicken wrote: | Yeah, the original work this paper follows up (by the same | group: https://arxiv.org/abs/1803.03635) received a lot of | attention in 2018 when it was uploaded to Arxiv & spawned a lot | of followup work. Even though it's been demonstrated that these | lottery tickets - sparse trainable subnetworks exist and can | approximate the complete nn arbitrarily well, their properties | are still not really understood. What is understood is that | these subnetworks depend heavily on the initialization of the | network, but that training the entire network together is | necessary for generalization. These findings generally advocate | for two-stage pruning approaches as opposed to cts | regularization/sparsification throughout training. The question | is how best to find these lottery tickets, and encourage them | from the get-go. | | A lot of this work is also related to training adversarially | robust networks. A composition of ReLU layers corresponds to a | piecewise linear function, where the # of 'pieces' is like | exponential in the # of neurons. It's well known that standard | training of networks results in a highly non-linear pwl that is | easily fooled by adversarial examples. The robustness of a | neural network against adversarial examples is typically | characterized by its smoothness. One question is how to train | the network or prune neurons to encourage smoothness & reduce | the complexity (i.e. # of linear pieces) of the nn. | MarkusQ wrote: | > ...and repeat, until the model is as tiny as you want. | | Cool! So if I repeat long enough I can get any network down to a | single neuron (as long as I really want to)? That is awesome! | mkolodny wrote: | Not quite. The Lottery Ticket Hypothesis paper showed that | models could shrink to around 10% of their original size | without a loss of accuracy [0]. So around a million neurons | instead of 10 million. | | [0] https://arxiv.org/abs/1903.01611v1 | fishmaster wrote: | I mean yeah... but the performance will be accordingly. | natch wrote: | What are the real world tradeoffs here compared to quantization | of models? Quantization is super easy but I take it this has some | advantages? ___________________________________________________________________ (page generated 2020-05-08 23:00 UTC)