[HN Gopher]  Researchers unveil a pruning algorithm to shrink de...
       ___________________________________________________________________
        
        Researchers unveil a pruning algorithm to shrink deep learning
       models
        
       Author : headalgorithm
       Score  : 66 points
       Date   : 2020-05-07 16:29 UTC (1 days ago)
        
 (HTM) web link (news.mit.edu)
 (TXT) w3m dump (news.mit.edu)
        
       | asparagui wrote:
       | Comparing Rewinding and Fine-tuning in Neural Network Pruning
       | 
       | https://arxiv.org/abs/2003.02389
        
       | CShorten wrote:
       | Check out our interview on Machine Learning Street Talk with
       | Jonathan Frankle explaining Rewinding and why you can only reset
       | the Learning Rate rather than the weights!
       | https://www.youtube.com/watch?v=SfjJoevBbjU&t=1177s
        
       | brilee wrote:
       | This technique figures out how to find a subset of neural network
       | edge weights that can replicate most of the full model's
       | performance, and is indeed quite simple. The catch is that this
       | sparse subset of NN edge weights has no structure, so it doesn't
       | get efficient execution on a GPU. If you're on a CPU with no
       | specialized matrix math, this does in fact cut down on execution
       | costs, but if you have an embedded ML chip, this doesn't really
       | help.
       | 
       | Distillation is another shrinkage technique that is also pretty
       | foolproof, but allows you to pick an arbitrary architecture -
       | this way, you can make full use of whatever hardware you're
       | deploying too.
        
         | walrus wrote:
         | I'm just speculating (and haven't read the paper yet), but it
         | may be possible to achieve similar speedups on GPUs by pruning
         | the smallest 20% of blocks of size >=KxK to produce block-
         | sparse weights[0], rather than pruning the smallest 20% of
         | weights.
         | 
         | [0] https://openai.com/blog/block-sparse-gpu-kernels/
        
         | lonelappde wrote:
         | Does a GPU run a pruned network slower than unpruned, or just
         | not as faster as the modern shrinkages would suggest?
        
           | fxtentacle wrote:
           | A naive GPU implementation on a non-block sparse network will
           | be just as slow as the full un-pruned network.
        
       | buildbot wrote:
       | If they are simply masking out low values weights after doing
       | full training, they've basically recreated what I worked on for
       | my graduate thesis. In contrast to this work, what we worked on
       | starts from iteration 0, and constantly adapts the lottery mask.
       | You can reset weights back to their initial value or set them
       | back to their initial value tied to a decay term To zero initial
       | values slowly to get computational sparsity if needed.
       | https://arxiv.org/abs/1806.06949
       | https://open.library.ubc.ca/cIRcle/collections/ubctheses/24/...
        
       | vikramkr wrote:
       | As someone not familiar with AI - I'm wondering if this is really
       | as simple and revolutionary as the article states? MIT is kind of
       | known for highly optimistic press releases that maybe oversell
       | sometimes, which makes it hard to know what's actually a real
       | breakthrough sometimes
        
         | deepnotderp wrote:
         | It's like 99% hyperbole, the gain is like 20% versus fine
         | tuning. An interesting idea, but incremental in terms of
         | improvements.
        
         | itronitron wrote:
         | There is a lot of pie-in-the-sky PR filler in the article, and
         | it's very light on details. It likely is as simple as the
         | article states, but probably not very revolutionary. At some
         | point the learned model will cease to be useful.
        
         | orange3xchicken wrote:
         | Yeah, the original work this paper follows up (by the same
         | group: https://arxiv.org/abs/1803.03635) received a lot of
         | attention in 2018 when it was uploaded to Arxiv & spawned a lot
         | of followup work. Even though it's been demonstrated that these
         | lottery tickets - sparse trainable subnetworks exist and can
         | approximate the complete nn arbitrarily well, their properties
         | are still not really understood. What is understood is that
         | these subnetworks depend heavily on the initialization of the
         | network, but that training the entire network together is
         | necessary for generalization. These findings generally advocate
         | for two-stage pruning approaches as opposed to cts
         | regularization/sparsification throughout training. The question
         | is how best to find these lottery tickets, and encourage them
         | from the get-go.
         | 
         | A lot of this work is also related to training adversarially
         | robust networks. A composition of ReLU layers corresponds to a
         | piecewise linear function, where the # of 'pieces' is like
         | exponential in the # of neurons. It's well known that standard
         | training of networks results in a highly non-linear pwl that is
         | easily fooled by adversarial examples. The robustness of a
         | neural network against adversarial examples is typically
         | characterized by its smoothness. One question is how to train
         | the network or prune neurons to encourage smoothness & reduce
         | the complexity (i.e. # of linear pieces) of the nn.
        
       | MarkusQ wrote:
       | > ...and repeat, until the model is as tiny as you want.
       | 
       | Cool! So if I repeat long enough I can get any network down to a
       | single neuron (as long as I really want to)? That is awesome!
        
         | mkolodny wrote:
         | Not quite. The Lottery Ticket Hypothesis paper showed that
         | models could shrink to around 10% of their original size
         | without a loss of accuracy [0]. So around a million neurons
         | instead of 10 million.
         | 
         | [0] https://arxiv.org/abs/1903.01611v1
        
         | fishmaster wrote:
         | I mean yeah... but the performance will be accordingly.
        
       | natch wrote:
       | What are the real world tradeoffs here compared to quantization
       | of models? Quantization is super easy but I take it this has some
       | advantages?
        
       ___________________________________________________________________
       (page generated 2020-05-08 23:00 UTC)