[HN Gopher] Gated Linear Networks
       ___________________________________________________________________
        
       Gated Linear Networks
        
       Author : asparagui
       Score  : 109 points
       Date   : 2020-06-15 15:23 UTC (7 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | The_rationalist wrote:
       | Where could this shine? Could it beat SOTA on Nlp tasks?
        
       | Immortal333 wrote:
       | "We show that this architecture gives rise to universal learning
       | capabilities in the limit, with effective model capacity
       | increasing as a function of network size in a manner comparable
       | with deep ReLU networks."
       | 
       | What exactly this statement means?
        
         | T-A wrote:
         | Presumably (haven't read the paper yet) that their network
         | provably becomes a universal function approximator in the limit
         | of infinite size.
         | 
         | Reading... actually the proof seems to be in
         | 
         | https://arxiv.org/abs/1712.01897
        
         | jawarner wrote:
         | As the network size increases, it can learn more complex
         | functions. When the network gets bigger and bigger, it gets
         | closer to being able to learn any arbitrary function.
        
         | fxtentacle wrote:
         | They mean that if you add parameters, the learning capability
         | of their approach grows by a similar amount as if you would add
         | the same number of parameters to a conv+ReLu network (the
         | standard approach).
         | 
         | That "universal" is a weird claim in my opinion, but they mean
         | that with enough parameters, this architecture can learn
         | everything.
        
           | Immortal333 wrote:
           | I was able to get the second part of the statement. but I
           | haven't seen the use of "in limit" in a statement like this.
           | 
           | Yes, the universal approximation is a strong claim. NN has
           | been proven to have universal approximation theoretically.
        
             | friendly_aixi wrote:
             | The result here is stronger, in the sense that typical NN
             | universality results are statements with respect to just
             | capacity (and not how you optimise them). Here, the result
             | holds with respect to both capacity + a choice of suitable
             | no regret online convex optimisation algorithm (e.g. online
             | gradient descent). Of course, this is just one desirable
             | property of a general purpose learning algorithm.
        
             | fxtentacle wrote:
             | "in limit" here means as you approach the edge case of
             | having an unlimited number of parameters.
        
             | jacksnipe wrote:
             | "In limit" is just shorthand for "as N tends to infinity"
             | (not necessarily N, but you get the idea).
        
         | janosett wrote:
         | > "We show that this architecture"
         | 
         | They are demonstrating a new technique, Gated Linear Networks.
         | 
         | > gives rise to universal learning capabilities in the limit
         | 
         | they claim to show that with an unbounded amount of time and
         | memory (network size / # params) this architecture can be used
         | to learn/approximate any function
         | 
         | > with effective model capacity increasing as a function of
         | network size
         | 
         | Model capacity here refers to the ability to memorize a mapping
         | between inputs and outputs. They show that a network with more
         | layers/weights will "memorize" more.
         | 
         | > in a manner comparable with deep ReLU networks
         | 
         | "Deep ReLU networks" are referring to commonly used modern deep
         | neural network architectures. ReLU is a popular activation
         | function:
         | https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
        
       | caretak3r wrote:
       | As a relative neophyte in this realm, this is fascniating to
       | read. Comparing this to the the models/methods to derive said
       | properties, is good education for me.
        
       | fxtentacle wrote:
       | That is an amazing paper, a great result and new neutral
       | architectures are long overdue.
       | 
       | But I don't believe that this has any significance in practice.
       | 
       | GPU memory is the limiting factor for most current AI approaches.
       | And that's where the typical convolutional architectures shine,
       | because they effectively compress the input data, then work on
       | the compressed representation, then decompress the results. With
       | gated linear networks, I'm required to always work on the full
       | input data, because it's a one step prediction. As the result,
       | I'll run out of GPU memory before I reach a learning capacity
       | that is comparable to conv nets.
        
         | friendly_aixi wrote:
         | Convolution is a linear operation; in the case of images, you
         | can view it as a multiplication with a doubly block circulant
         | matrix. I can't see any barriers to hybrid approaches here,
         | though it seems difficult to avoid using backpropagation for
         | credit assignment within the convolutional layers.
         | 
         | Re: significance, how about their application in regression:
         | https://arxiv.org/abs/2006.05964 ? Or in contextual bandits:
         | https://arxiv.org/abs/2002.11611 ?
         | 
         | Disclaimer: I am one of the authors (Joel).
        
         | BrokrnAlgorithm wrote:
         | What about findings w. R. T. Online learning? I find continuous
         | learning quality of algorithms to be a topic which often seems
         | to be more of a side-concern, although it carries a lot of
         | relevance in applied settings.
        
           | fxtentacle wrote:
           | I believe that to be a red herring. Their approach cannot
           | learn any features that provide a lower-dimensional
           | approximation of the input data. As the result, there is no
           | intermediate representation which could change and thereby
           | negatively affect previously learned classifiers.
           | 
           | But if I train 10 independent traditional networks, I also
           | won't have newly learned data affect old performance. So in
           | effect they give up the possibility to do transfer learning
           | in exchange for avoiding the disadvantages of transfer
           | learning. But that's a bad tradeoff.
           | 
           | With their approach you always train from scratch, which
           | brings with it the need for huge training data sets.
           | 
           | So I can train a bird classifier on the traditional
           | architecture with 500 labeled images and a pretrained resnet.
           | Our I use a million bird images and this approach.
        
             | friendly_aixi wrote:
             | 1) Indeed, GLNs don't learn features... but I would claim
             | they do learn some notion of an intermediate
             | representation, it's just different from the DL mainstream
             | -- in particular its closely related to the inverse Radon
             | transform in medical imaging.
             | 
             | 2) Inputs which are similar in terms of cosine similarity
             | will map to similar (data dependent) products of weight
             | matrices, and thus behave similarly, which of course can
             | affect performance in both good and bad ways. With the
             | results we show on permuted MNIST, its well... just not
             | particularly likely that they will interfere. This is a
             | good thing -- why should completely different data
             | distributions interfere with one another? The point is the
             | method is resiliant to catastrophic forgetting when the
             | cosine similarity between data items from different tasks
             | is small. This highlights the different kind of inductive
             | bias a halfspace gated GLN has compared to a Deep ReLu
             | network.
             | 
             | 3) Re bird example, that's slightly unfair. I am sure one
             | could easily make use of the pre-trained resnet to provide
             | informative features to a GLN -- it's early days for this
             | method, hybrid systems haven't been investigated, so I
             | don't know whether it would work better than current SOTA
             | methods for image classification. But I would be pretty
             | confident that some simple combination would work better
             | than chopping the head off a pretrained network and fitting
             | an SVM on top. This is all speculation on my part though.
             | :)
        
               | BrokrnAlgorithm wrote:
               | Good point as well - sometimes its not about
               | dimensionality reduction but more about persistent
               | representation, having this geared towards highly non-
               | stationary environments is nice thing to have.
        
             | BrokrnAlgorithm wrote:
             | Still, there are a lot of domains where transfer learning
             | is no the most applicable setting - I'm thinking of highly
             | noisy and non-stationary setting such as finance. In some
             | of these domains, especially time series, lack of data is
             | often not the issue, e.g. high frequency datasets.
             | 
             | Having models constantly re-train as the default setting is
             | essentially what a rolling regression would do - having a
             | rolling regression that doesn't catastrophically forget
             | would be quite valuable.
        
       ___________________________________________________________________
       (page generated 2020-06-15 23:00 UTC)