[HN Gopher] Gated Linear Networks ___________________________________________________________________ Gated Linear Networks Author : asparagui Score : 109 points Date : 2020-06-15 15:23 UTC (7 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | The_rationalist wrote: | Where could this shine? Could it beat SOTA on Nlp tasks? | Immortal333 wrote: | "We show that this architecture gives rise to universal learning | capabilities in the limit, with effective model capacity | increasing as a function of network size in a manner comparable | with deep ReLU networks." | | What exactly this statement means? | T-A wrote: | Presumably (haven't read the paper yet) that their network | provably becomes a universal function approximator in the limit | of infinite size. | | Reading... actually the proof seems to be in | | https://arxiv.org/abs/1712.01897 | jawarner wrote: | As the network size increases, it can learn more complex | functions. When the network gets bigger and bigger, it gets | closer to being able to learn any arbitrary function. | fxtentacle wrote: | They mean that if you add parameters, the learning capability | of their approach grows by a similar amount as if you would add | the same number of parameters to a conv+ReLu network (the | standard approach). | | That "universal" is a weird claim in my opinion, but they mean | that with enough parameters, this architecture can learn | everything. | Immortal333 wrote: | I was able to get the second part of the statement. but I | haven't seen the use of "in limit" in a statement like this. | | Yes, the universal approximation is a strong claim. NN has | been proven to have universal approximation theoretically. | friendly_aixi wrote: | The result here is stronger, in the sense that typical NN | universality results are statements with respect to just | capacity (and not how you optimise them). Here, the result | holds with respect to both capacity + a choice of suitable | no regret online convex optimisation algorithm (e.g. online | gradient descent). Of course, this is just one desirable | property of a general purpose learning algorithm. | fxtentacle wrote: | "in limit" here means as you approach the edge case of | having an unlimited number of parameters. | jacksnipe wrote: | "In limit" is just shorthand for "as N tends to infinity" | (not necessarily N, but you get the idea). | janosett wrote: | > "We show that this architecture" | | They are demonstrating a new technique, Gated Linear Networks. | | > gives rise to universal learning capabilities in the limit | | they claim to show that with an unbounded amount of time and | memory (network size / # params) this architecture can be used | to learn/approximate any function | | > with effective model capacity increasing as a function of | network size | | Model capacity here refers to the ability to memorize a mapping | between inputs and outputs. They show that a network with more | layers/weights will "memorize" more. | | > in a manner comparable with deep ReLU networks | | "Deep ReLU networks" are referring to commonly used modern deep | neural network architectures. ReLU is a popular activation | function: | https://en.wikipedia.org/wiki/Rectifier_(neural_networks) | caretak3r wrote: | As a relative neophyte in this realm, this is fascniating to | read. Comparing this to the the models/methods to derive said | properties, is good education for me. | fxtentacle wrote: | That is an amazing paper, a great result and new neutral | architectures are long overdue. | | But I don't believe that this has any significance in practice. | | GPU memory is the limiting factor for most current AI approaches. | And that's where the typical convolutional architectures shine, | because they effectively compress the input data, then work on | the compressed representation, then decompress the results. With | gated linear networks, I'm required to always work on the full | input data, because it's a one step prediction. As the result, | I'll run out of GPU memory before I reach a learning capacity | that is comparable to conv nets. | friendly_aixi wrote: | Convolution is a linear operation; in the case of images, you | can view it as a multiplication with a doubly block circulant | matrix. I can't see any barriers to hybrid approaches here, | though it seems difficult to avoid using backpropagation for | credit assignment within the convolutional layers. | | Re: significance, how about their application in regression: | https://arxiv.org/abs/2006.05964 ? Or in contextual bandits: | https://arxiv.org/abs/2002.11611 ? | | Disclaimer: I am one of the authors (Joel). | BrokrnAlgorithm wrote: | What about findings w. R. T. Online learning? I find continuous | learning quality of algorithms to be a topic which often seems | to be more of a side-concern, although it carries a lot of | relevance in applied settings. | fxtentacle wrote: | I believe that to be a red herring. Their approach cannot | learn any features that provide a lower-dimensional | approximation of the input data. As the result, there is no | intermediate representation which could change and thereby | negatively affect previously learned classifiers. | | But if I train 10 independent traditional networks, I also | won't have newly learned data affect old performance. So in | effect they give up the possibility to do transfer learning | in exchange for avoiding the disadvantages of transfer | learning. But that's a bad tradeoff. | | With their approach you always train from scratch, which | brings with it the need for huge training data sets. | | So I can train a bird classifier on the traditional | architecture with 500 labeled images and a pretrained resnet. | Our I use a million bird images and this approach. | friendly_aixi wrote: | 1) Indeed, GLNs don't learn features... but I would claim | they do learn some notion of an intermediate | representation, it's just different from the DL mainstream | -- in particular its closely related to the inverse Radon | transform in medical imaging. | | 2) Inputs which are similar in terms of cosine similarity | will map to similar (data dependent) products of weight | matrices, and thus behave similarly, which of course can | affect performance in both good and bad ways. With the | results we show on permuted MNIST, its well... just not | particularly likely that they will interfere. This is a | good thing -- why should completely different data | distributions interfere with one another? The point is the | method is resiliant to catastrophic forgetting when the | cosine similarity between data items from different tasks | is small. This highlights the different kind of inductive | bias a halfspace gated GLN has compared to a Deep ReLu | network. | | 3) Re bird example, that's slightly unfair. I am sure one | could easily make use of the pre-trained resnet to provide | informative features to a GLN -- it's early days for this | method, hybrid systems haven't been investigated, so I | don't know whether it would work better than current SOTA | methods for image classification. But I would be pretty | confident that some simple combination would work better | than chopping the head off a pretrained network and fitting | an SVM on top. This is all speculation on my part though. | :) | BrokrnAlgorithm wrote: | Good point as well - sometimes its not about | dimensionality reduction but more about persistent | representation, having this geared towards highly non- | stationary environments is nice thing to have. | BrokrnAlgorithm wrote: | Still, there are a lot of domains where transfer learning | is no the most applicable setting - I'm thinking of highly | noisy and non-stationary setting such as finance. In some | of these domains, especially time series, lack of data is | often not the issue, e.g. high frequency datasets. | | Having models constantly re-train as the default setting is | essentially what a rolling regression would do - having a | rolling regression that doesn't catastrophically forget | would be quite valuable. ___________________________________________________________________ (page generated 2020-06-15 23:00 UTC)