[HN Gopher] Every Model Learned by Gradient Descent Is Approxima...
       ___________________________________________________________________
        
       Every Model Learned by Gradient Descent Is Approximately a Kernel
       Machine
        
       Author : scottlocklin
       Score  : 282 points
       Date   : 2020-12-05 14:40 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | scottlocklin wrote:
       | Looking at you, deep learning.
        
         | api wrote:
         | There is probably a ton of isomorphism between different
         | models. It may come down to what is easiest to understand and
         | fastest to implement in code.
        
           | segfaultbuserr wrote:
           | See also:
           | 
           | A visual proof that neural networks can approximate any
           | function
           | 
           | https://news.ycombinator.com/item?id=19708620
        
             | api wrote:
             | So a "neural network" is actually a type of parameterized
             | mathematical function that can be fit to any curve
             | including higher dimensional surfaces, etc.?
        
               | laingc wrote:
               | Yes.
        
               | xksteven wrote:
               | That's correct. See https://en.wikipedia.org/wiki/Univers
               | al_approximation_theore... for more details
        
         | wongarsu wrote:
         | A linear SVM can in turn be expressed as a very shallow neutral
         | network. The main difference is that with SVMs you put all your
         | effort into transforming inputs for the model (e.g. all the
         | popular kernels) while with neural networks usually most of the
         | effort goes into clever model architectures.
        
       | Anka33 wrote:
       | It's just a bunch of if clauses.
        
       | GlenTheMachine wrote:
       | I'm confused by findings like this one, and I'm hoping someone
       | here can educated me.
       | 
       | There are many known universal approximations. Deep networks are
       | one. SVMs are one. Heck, cubic splines are one, and they've been
       | in use for nearly a hundred years IIRC.
       | 
       | The problem has never been one of finding a sufficiently powerful
       | approximator. It has been training that approximator. My
       | understanding of the significant advancement made by deep
       | learning is that we finally figured out how to train a specific
       | kind of universal approximator in a way such that it finds very
       | good separation surfaces for what used to be impossible-to-solve
       | classification problems.
       | 
       | But it should be no surprise to anyone that there exist, in
       | theory, other universal approximations that approximately
       | reproduce the same separation surfaces, should it? I'd expect
       | _any_ universal approximator to be powerful enough to reproduce
       | the separation surfaces, hence the meaning of the word
       | "universal". The problem was always finding the right weights,
       | not finding the right approximator architecture.
       | 
       | Am I missing something?
        
         | xksteven wrote:
         | My understanding of the read was to show "how" they're
         | equivalent as opposed to how to actually construct such an
         | approximator or learn it.
         | 
         | Similar to showing a problem falls in NP, you can reduce the
         | problem down to another problem in NP and be done with it.
        
           | sdenton4 wrote:
           | Agree, but also think the result may be too general to be
           | useful. Proving that you can rewrite any network learned with
           | gradient descent this way kinda suggests that the
           | architecture doesn't matter, but we know that's not true. Eg,
           | why are networks with skip connections SO much better than
           | networks without? What about batch normalization? This makes
           | me suspicious that it's a nice theoretical result a bit too
           | high level to be useful. Yes, it was proved years ago that
           | you can train an arbitrary function with a wide enough two-
           | layer net, but it's not a terribly practical way to approach
           | the world. Now we have architectures much better than two-
           | layer networks, and, for that matter, SVMs.
           | 
           | There's a number of problems with svms; complexity for
           | training and inference scales with the amount of training
           | data, which is pretty sad panda for complex problems.
           | 
           | Extremely spicy/cynical take: it's not cool to say "you all
           | should go look at all these possible applications" when the
           | thrust is the paper is to prop up the relevance of an
           | obsolete approach. You gotta do the actual work to close the
           | gap if you still want your PhD to be worth something...
           | 
           | That said, I haven't read the paper terribly closely, and am
           | always happy to be proven wrong!
        
             | runT1ME wrote:
             | >suggests that the architecture doesn't matter, but we know
             | that's not true. Eg, why are networks with skip connections
             | SO much better than networks without? What about batch
             | normalization?
             | 
             | Is this true though, or does network architecture only
             | matter in terms of _efficiency_? This is non rhetorical, I
             | really don 't know much about deep learning. :) I guess i'm
             | asking if with enough data and compute, is architecture
             | still relevant?
        
               | sdenton4 wrote:
               | These things matter a lot in practice. Imagine a giant
               | million dimensional loss surface, where each point is a
               | set of weights for the model. Then the gradient is
               | pushing us around on this surface, trying to find a
               | 'minimum.' Current understanding (for a while, actually)
               | is that you never really hit minima so much as giant
               | mostly-flat regions where further improvement maybe takes
               | a million years. The loss surfaces for models with skip
               | connections seem to be much, much nicer.
               | 
               | https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9
               | 067...
               | 
               | In effect, there's a big gap between an existence proof
               | and actually workable models, and the tricks of the trade
               | do quite a lot to close the gap. (And there are almost
               | certainly more tricks that we're still not aware of! I'm
               | still amazed at how late in the game batch normalization
               | was discovered.)
               | 
               | OTOH, so long as you're using the basic tricks of the
               | trade, IME architecture doesn't really matter much. Our
               | recent kaggle competition for birdsong identification was
               | a great example of this: pretty much everyone reported
               | that the difference between five or so 'best practices'
               | feature extraction architectures (various permutations of
               | resnet/efficientnet) was negligible.
        
             | mycall wrote:
             | > why are networks with skip connections SO much better
             | than networks without?
             | 
             | What are the leading theories for why this seems to be the
             | case? Less nodes to capture and direct decisions?
        
               | sdenton4 wrote:
               | Oh there's plenty of good explantation in the neural
               | network literature (my eli5: the skip connections make
               | the default mapping an identity instead of a zero
               | mapping; you can start by doing no harm, and improve from
               | there). The method was suggested by knowledge from
               | differential equations. All I'm saying is that the
               | "everything is secretly an svm" viewpoint is probably too
               | coarse to explain these interesting and effective
               | structural differences.
        
             | nightski wrote:
             | I'd be curious if re-framing a trained neural network model
             | as a SVM gives you insight into it's support vectors and
             | maybe a little understanding on why the NN works the way it
             | does?
        
         | riku_iki wrote:
         | > It has been training that approximator.
         | 
         | Also it is representation of approximator itself and data
         | compression. For cubic splines to approximate some NN you
         | likely would need enormous number of intervals covering input
         | space.
        
           | VHRanger wrote:
           | Also, cubic splines don't extrapolate like the other
           | mentioned models
        
         | make3 wrote:
         | An issue with deep learning is that it is very hard to analyse
         | from a theoretical mathematical perspective, to prove things
         | about them.
         | 
         | Kernels have been studied thoroughly from a theoretical
         | perspective and people have proven things about them.
         | 
         | The goal of papers such as these is to find ways in which deep
         | neural networks and kernel methods are similar, so that
         | theoretical results and tools found for kernel methods may be
         | adapted to deep neural networks.
        
         | talolard wrote:
         | 50/50 I can shed some light or am way over my head:
         | 
         | The point of the paper is that if you train a deep learning
         | model with gradient descent then the resulting model is
         | effectively a kernel machine, regardless of model architecture.
         | 
         | The nice thing about a kernel machine is that it is simple
         | (just one hidden layer) and we are able to use to analyze a
         | kernel machine more effectively and conveniently.
         | 
         | So, I think the contribution here isn't "these sets of
         | universal aproximators are equivalent" but rather "We have this
         | effective but opauge deep learning thing, turns it it's actual
         | a kernel machine in retropsect so we can bring 'kernel tooling'
         | to analyze the deep learning mode"
        
         | sebasv_ wrote:
         | Finding the right architecture, or more in general the right
         | model, is very much still the main problem.
         | 
         | You should be careful with the meaning you ascribe to the word
         | 'universal'. The list of universal approximators is massive,
         | and the sub-list of universal approximators that can be trained
         | with OLS is still substantial. Still these models can differ
         | significantly:
         | 
         | - How efficient are they (in #parameters required for a certain
         | error) for specific tasks? There is a known 'maximum
         | efficiency' for general tasks, but in high dimensions this
         | efficiency is terrible, such that many models will fail
         | terribly on high-dimensional data. Hence, you should pick a
         | model that is exceptionally good for a specific task, although
         | it might be less efficient for other tasks.
         | 
         | - How well can the model cope with noise? If your dependent
         | variable is severely distorted (think financial data) then you
         | need a model that can balance between interpolating datapoints
         | and averaging out the noise.
         | 
         | Just to name my two favorite properties. The first one is _kind
         | of_ related to learnability, since an inefficient model is
         | often pretty much impossible to learn.
        
       | cs702 wrote:
       | "Here we show that every model learned by this method [SGD],
       | regardless of architecture, is approximately equivalent to a
       | kernel machine [i.e., a support vector machine or SVM] with a
       | particular type of kernel" -- a type of kernel which Domingos,
       | the author, calls a "path kernel."
       | 
       | As defined in the paper, a "path kernel" measures, for any two
       | data points, how similarly a model varies (specifically, how
       | similarly the model's gradients change) at those two data points
       | during training via SGD. This isn't exactly your usual, plain-
       | vanilla, radial-basis type of kernel.
       | 
       | We've known for a long time that SVMs are universal
       | approximators, i.e., in theory they can approximate any target
       | function. The importance of this work is that it has found a new,
       | surprising, deep connection between _any_ model trained via SGD
       | and SVMs, which are well understood :-)
        
         | nuclearnice1 wrote:
         | Great explanation of the intuitive understanding of the path
         | kernel which seems to be the main takeaway from this paper.
         | 
         | One minor technical correction, the proof relief on the
         | continuous model gradient flow not SGD. So it's proven for GD
         | and likely true for SGD and your intuitive explanation likely
         | still holds, but it's not obvious.
        
           | QuesnayJr wrote:
           | They sketch the argument for SGD, but they don't know if it
           | actually holds (see Remark 5 in the paper).
        
           | cs702 wrote:
           | You're absolutely right. I substituted SGD for GD without
           | giving it any thought because everyone uses SGD!
        
       | edwardjhu wrote:
       | The claim here is a bit misleading, as already pointed out by
       | other comments, since the kernel is an evolving one that is
       | essentially learned after seeing the data.
       | 
       | Contrary to many related works that compare wide neural networks
       | to kernel methods, our recent work shows that one can study a
       | feature learning infinite-width limit with realistic learning
       | rate.
       | 
       | https://arxiv.org/abs/2011.14522
       | 
       | We identified what separates the kernel regime (e.g., NTK) and
       | the feature learning regime. In the infinite-width limit, OP's
       | work could belong to either regime depending on the
       | parametrization, i.e. the path kernel is either equal to the NTK
       | or performing feature learning.
       | 
       | It's an incredibly interesting research topic. Please feel free
       | to comment with thoughts on our work :)
        
       | bicepjai wrote:
       | Watch your stock NVIDIA :)
        
         | rlili wrote:
         | Why, aren't GPUs pretty useful in training SVMs as well?
        
           | abeppu wrote:
           | Well, and though the author shows an approximate equivalence,
           | which helps us understand a class of models better, it's not
           | obvious that it's preferable to use SVMs of the type
           | described. In particular, it seems like often it would be
           | preferable to deal with model weights (even if they are
           | mathematically a "superposition" of datapoints) than to ship
           | around and revisit the whole dataset.
        
           | xksteven wrote:
           | SVMs are solved via convex optimization methods which have
           | taken more time to get on the GPU train.
           | 
           | On the other hand there are GPU accelerated SVM training such
           | as: https://github.com/Xtra-Computing/thundersvm
           | 
           | A github or Google search will reveal other GPU accelerated
           | SVM training.
        
             | knuthsat wrote:
             | SVMs can also be trained using gradient descent.
        
       | abeppu wrote:
       | From a skim, I think one asterisk which may be a gap between the
       | claim in the title and what's actually shown in the paper is that
       | the theorem focuses on gradient descent-trained models which
       | minimize a loss function which is the sum of loss L(y_i, y*_i) on
       | points from a given dataset. While that's clearly very broad, I
       | think it _doesn't_ include things like GANs, where parts of the
       | model produce fake data to train against.
        
         | Flashtoo wrote:
         | The claim also applies to GANs as you can simply use a masking
         | function to indicate which inputs were used for each
         | optimization timestep, like the author suggests for stochastic
         | gradient descent in remark 5.
        
       | api wrote:
       | Another thought: are neural nets just a weird analog / floating
       | point kind of probabilistic data structure or lossy compression
       | algorithm?
        
       | sarosh wrote:
       | See also W. Brendel and M. Bethge. "Approximating CNNs with bag-
       | of-local-features models works surprisingly well on ImageNet."
       | https://arxiv.org/abs/1904.00760 (which is referenced in this
       | paper) and explains "[t]his suggests that the improvements of
       | DNNs over previous bag-of-feature classifiers in the last few
       | years is mostly achieved by better fine-tuning rather than by
       | qualitatively different decision strategies."
        
       | stosto88 wrote:
       | Pretty sure it means a complex algorithm can be approximated by a
       | super simple model.
        
         | Iv wrote:
         | Thing is, gradient descent is not really a complex algorithm.
        
       | 6gvONxR4sf7o wrote:
       | This is a really cool new bit of intuition:
       | 
       | > A key property of path kernels is that they combat the curse of
       | dimensionality by incorporating derivatives into the kernel: two
       | data points are similar if the candidate function's derivatives
       | at them are similar, rather than if they are close in the input
       | space. This can greatly improve kernel machines' ability to
       | approximate highly variable functions (Bengio et al., 2005). It
       | also means that points that are far in Euclidean space can be
       | close in gradient space, potentially improving the ability to
       | model complex functions. (For example, the maxima of a sine wave
       | are all close in gradient space, even though they can be
       | arbitrarily far apart in the input space.)
        
       | FRGabriel wrote:
       | So at the end, it rephrased a statement from "Neural Tangent
       | Kernel: Convergence and Generalization in Neural Networks"
       | [https://arxiv.org/abs/1806.07572], besides in a way which is
       | kind of miss-leading.
       | 
       | The assertion is known by the community at least since 2018, if
       | not even well before.
       | 
       | I find this article and the buzz around a little awkward.
        
         | QuesnayJr wrote:
         | To be fair, the article you cite is from 2018, and the OP is
         | from 2012 (if the arXiv dates are right).
        
           | ajtulloch wrote:
           | The OP article was submitted on Mon, 30 Nov 2020 23:02:47 -
           | the arXiv identifier 2012.xyz means it was published in the
           | 12th month of 2020, not 2012.
        
             | QuesnayJr wrote:
             | Oh God that's embarrassing. I really thought that's how
             | arXiv identifiers worked, and have been under the wrong
             | impression for some time.
        
           | [deleted]
        
       | screye wrote:
       | I do have a tangential technical question for someone who knows
       | more math than I do.
       | 
       | A kernel SVM always finds (one-of) the _global best fit lines_ in
       | the kernel space.
       | 
       | A gradient descent model explicitly converges to one of the
       | nearest _local minimas_ by definition.
       | 
       | Does this paper conclude that the local minimas that neural
       | networks converge to are one of the many equivalent global maxima
       | ? Won't this be a major revelation by itself ?
        
         | dumb1224 wrote:
         | I don't have a strong math background but I think during the
         | optimisation process of finding the hyperplane, the solver
         | (algorithm that attempts to find best separating hyperplane)
         | uses soft margin to allow mis-classified instances. Its
         | tolerance is controlled by a hyper-parameter so it will
         | comprise to find the best fit within the set parameter. So it
         | is a 'best solution' with a condition. However there are many
         | variations of implementations from different solvers to handle
         | it.
         | 
         | Example:https://towardsdatascience.com/support-vector-machine-
         | simply...
        
         | altarius wrote:
         | I don't think the output of the conversion is guaranteed to be
         | equivalent to a hyperplane learned by an SVM.
         | 
         | I didn't have time to read the paper, but reading the abstract
         | I don't see a claim that the gradient descent model
         | approximated by a kernel machine is equivalent to an optimal
         | fit obtained by SVM maximum margin hyperplane fitting.
         | 
         | I assume one likely ends up with different hyperplane fits from
         | converting a NN/gradient-desc-learned model to kernel machine
         | vs learning a kernel machine directly via SVM learning.
        
         | edjrage wrote:
         | Nitpick: _Minima_ is already plural.
        
         | kaczordon wrote:
         | Doesn't gradient descent use a convex cost function so that it
         | always generates a global minimum?
        
           | wxnx wrote:
           | No, not necessarily. The objective functions used to train
           | neural networks are generally non-convex (the nets themselves
           | being non-convex as well), but are traditionally trained
           | using stochastic gradient descent (and its variants).
        
             | mrtranscendence wrote:
             | This really threw me off when I first started learning
             | about ANNs, coming from a traditional econometrics
             | background. B-but ... the parameters aren't identified!
        
         | wxnx wrote:
         | In the context of this paper, you can think of the "gradient
         | descent step" as optimizing the parameters of the kernel (i.e.
         | the parameters that generate the kernel space). There are no
         | explicitly optimality guarantees beyond those of standard
         | gradient descent.
         | 
         | The "SVM step" would still find a global optima within the
         | kernel space, but the qualifications of the previous step mean
         | that the kernel space generated might be useless.
        
         | andreareina wrote:
         | Isn't one of the features of high-dimensional spaces that local
         | minima are rare and there's usually _some_ direction that
         | slopes towards a lower loss?
        
           | dragontamer wrote:
           | I mean, obviously not in the general case. Cryptography is
           | designed to be a high dimensional space with pretty much no
           | slope.
           | 
           | I'd assume that a lot of the binary decision tree / chip
           | level optimizations are similar: almost no slope worth
           | analyzing.
        
             | sp332 wrote:
             | Sure, and unbreakable crypto is notoriously difficult to
             | make. I wouldn't expect a situation like that to come up in
             | a real-world problem.
        
               | dragontamer wrote:
               | Binary decision trees are basically super optimizers:
               | minimizing the number of gates needed to make a Chip's
               | logic (multiply circuits or whatever)
               | 
               | > Sure, and unbreakable crypto is notoriously difficult
               | to make
               | 
               | Not really. It's the unbreakable AND efficient
               | requirement combined that's hard.
               | 
               | Just run anything over a random xor shift add kernel
               | about 1 million times and it's unbreakable (so long as
               | xor / shift forms a bijection). It's just inefficient.
               | 
               | Finding the smallest number of iterations that works is
               | the hard part. Usually, people try to break it (ex, AES
               | was originally broken over 4 iterations) then double or
               | triple your best attempt, and you are set.
               | 
               | AES is now broken over 5 iterations IIRC, but still a
               | long way to go to break full 8 iteration AES.
               | 
               | More modern ciphers (SHA512) are like 80 iterations of a
               | simpler kernel.
               | 
               | Even 'broken' ciphers like TEA or RC4 are hard to break
               | in practice if you just increase the iteration count up
               | the wazzoo. The problem is: AES gets to security in just
               | 8 iterations.
               | 
               | So to be better than AES requires both efficiency AND
               | security. After all, 100 iterations of AES is going to be
               | more secure if you don't care about efficiency. (10x more
               | iterations than the default spec)
        
       | burlesona wrote:
       | This sounds interesting, but a little over my head. Can anyone
       | offer an explanation for a software engineer with no AI/ML
       | background?
        
         | moralestapia wrote:
         | Not an explanation, but a benefit could be that SVMs can be
         | evaluated much faster and are more explainable (* Citation
         | needed, I know).
        
           | eugenhotaj wrote:
           | Don't kernel SVMs need a full pass through the data they were
           | trained on to make predictions? How is that faster?
        
             | cscheid wrote:
             | No, they require a full pass over the _support vectors_ ,
             | which are potentially a much smaller set. (That's part of
             | why everyone was so excited about SVMs when they were
             | invented) The support vectors are the training values with
             | nonzero hinge loss, or alternatively, training values
             | sufficiently close to the decision boundary.
        
               | eugenhotaj wrote:
               | Fair enough, but the number of support vectors for non
               | trivial problems is still pretty large (as I understand
               | but could be wrong), e.g. 20-30% of the dataset. Having
               | to iterate over 30% of say imagenet on each batch of
               | predictions seems unfeasible.
        
             | alexilliamson wrote:
             | You only need the "Support Vectors" to make predictions,
             | not the whole dataset.
        
             | somurzakov wrote:
             | neural nets at the same time require multiple passes
             | through the data (epochs). if we can train a model in one
             | epoch jnstead of 10000 epochs thats a breakthrough!
        
               | sdenton4 wrote:
               | Epochs are more about the training data than the model...
               | If you've got a big enough dataset, one epoch or less is
               | fine!
        
               | eugenhotaj wrote:
               | True, but it sounds like you're just shifting computation
               | from training to inference. And I'm not sure that's a
               | very good trade off to make, you're likely to predict on
               | much more data than you trained on (e.g. ranking models
               | at google, fb, etc)
        
               | somurzakov wrote:
               | not sure I get your point, both DNNs and SVMs require one
               | forward pass for inference, so there is no difference. if
               | SVM model can converge in one epoch, how is it not less
               | efficient than the status quo with DNNs?
        
               | eugenhotaj wrote:
               | For kernel SVMs, one needs to keep around part of the
               | training data (the support vectors) right? With DNNs,
               | after training, all you need are the model parameters.
               | For very large datasets, keeping around even a small part
               | of your training data may not be feasible.
               | 
               | Furthermore, number of parameters do not (necessarily)
               | grow with the size of the training data, can be reused if
               | you get more data, can be quantized/pruned/etc. There's
               | not really an easy way to do these things with SVMs as
               | far as I understand.
        
         | peteretep wrote:
         | "You don't need a fancy model, you can just find a similar
         | example directly from the training data and get similar
         | results"
        
           | ironSkillet wrote:
           | If you look at how they actually "translate" the fancy model
           | to the simple one, it requires fully fitting the original
           | model (and keeping track of the evolution of gradients over
           | the training). So it wouldn't make training more efficient,
           | but perhaps it would be useful in inference or probing the
           | characteristics of the original model.
        
           | lumost wrote:
           | This has always anecdotally appeared to be the case when
           | investigating the predictions of neural nets. Particularly
           | when it comes time to answer the question "what does this
           | model not handle"
        
           | smallnamespace wrote:
           | Defining 'similar' robustly is the meat of the problem, and
           | what we're finding deep NNs to do well.
        
           | [deleted]
        
         | hansvm wrote:
         | Part of the reason this is interesting is that deep learning
         | has to have some kind of inductive bias to perform as well as
         | we've seen (given a learning problem, it's biased toward
         | learning in particular ways). In general though, a neural
         | network can approximate _any_ function, so reigning in that
         | complexity and uncovering which functions are actually
         | learnable (or efficiently learnable) by deep learning is an
         | important research direction. This paper says that the
         | functions uncovered by deep learning (with caveats) are
         | precisely those which are close to functions represented by a
         | different learning technique, which is notable because this new
         | class of functions is not "all functions" and because the
         | characterization is explainable in some sense, giving insight
         | into how deep learning works. That connection also winds up
         | having a bunch of other interesting implications that somebody
         | else can cover if they'd like.
        
       | ajtulloch wrote:
       | For everyone saying "oh wow, we can go back to SVMs now, huge
       | speedups, etc" - that's more than a little premature. This is
       | purely a formal equivalence, but not very useful computationally.
       | 
       | The math here is pretty much first-year undergraduate level
       | calculus, and it's worth going through section 2 since it's quite
       | clearly written (thanks Prof Domingos).
       | 
       | Essentially, what the author does is show that any model trained
       | with "infinitely-small step size full-batch gradient descent"
       | (i.e. a model following a gradient flow) can be written in a
       | "kernelized" form                  y(x) = \sum_{(x_i, y_i) \in L}
       | a_i K(x, x_i) + b.
       | 
       | The intuition most people have for SVMs is that the constants a_i
       | are, well, _constant_ , that the a_i are sparse, and that the
       | kernel function K(x, x_i) is cheap to compute (partly why it's
       | called the 'kernel trick').
       | 
       | However, none of those properties apply here, which is why this
       | isn't particularly exciting computationally. The "trick" is that
       | both a_i and K(x, x_i) involve path integrals along the gradient
       | flow for the input x, and so doing a single inference is
       | approximately as expensive as training the model from scratch (if
       | I understand this correctly).
        
       | pietroppeter wrote:
       | It is worth noting that the author is the same of a very nice
       | book to popularize machine learning, "the master algorithm".
       | 
       | https://homes.cs.washington.edu/~pedrod/
        
       | hobofan wrote:
       | Is this a new finding? I'm not an expert on the deeper
       | mathematical side of ML, but I remember a friend of mine already
       | telling me about something that sounded exactly like this (ca.
       | 2016-17).
        
         | ironSkillet wrote:
         | The fact that kernel machines can approximate other models
         | isn't new. I think the novel idea is that they have explicitly
         | constructed the "translation" between the arbitrary model to
         | the kernel machine, and it's quite clean.
        
       | Straw wrote:
       | The kernel they find is a function of the gradient descent path,
       | which is a function of the data. So no, its nothing at all like a
       | normal kernel machine, where we pick the kernel before seeing the
       | data.
       | 
       | It also only applies to the continuous limit of non-stochastic
       | GD, far from the real training methods used.
       | 
       | We don't gain any understanding either; understanding implies
       | predictive power about some new situation, and I don't see any-
       | and nor does the paper suggest them.
       | 
       | Looks like yet another attempt to attract attention by
       | "understanding" NNs. Look, humans can't explain or understand how
       | we drive, speak, translate, play chess, etc, so why should we
       | expect to understand how models that do these work? Of course, we
       | can understand the principles of the training process, and in
       | fact we already do- the theory of SGD is well understood.
        
         | xiphias2 wrote:
         | ,,Look, humans can't explain or understand how we drive, speak,
         | translate, play chess, etc, so why should we expect to
         | understand how models that do these work?''
         | 
         | I agree with you, but also it's amazing how much deepmind has
         | achieved by putting neuroscientists and machine learning
         | experts in the same room, and trying to make systems that work
         | inside the human brain work efficiently on metal.
         | 
         | If you look at this talk for 2010, Demis was already listing
         | attention as an example (which was responsible for the recent
         | improvement in protein folding prediction as an example):
         | 
         | https://www.youtube.com/watch?v=F5PSyu7booU
        
           | Straw wrote:
           | Absolutely, modern NN architectures have been inspired by
           | biological ones- despite their massive differences.
           | 
           | Even in cases like attention, the modern version (that
           | actually works in GPT-3, AlphaFold2, etc), has little in
           | common with both the english word and what we think of as
           | attention. Its a formula with two matmuls and a softmax:
           | softmax(AB)C. In particular, it doesn't necessarily look
           | anywhere at all- just a weighted sum of the inputs. Nothing
           | like the hard attention used by the human visual cortex. Its
           | not even that different from a convolution where you allow
           | the weights to be a function of the input.
           | 
           | So the inspiration might have come from humans, but the
           | actual architectures have largely come from pure trial and
           | error, with limited, difficult to explain intuition on what
           | tends to work.
        
             | xiphias2 wrote:
             | Actually self attention is a generalization of convolution:
             | 
             | https://openreview.net/pdf?id=HJlnC1rKPB
             | 
             | ,,This work provides evidence that attention layers can
             | perform convolution and, indeed, they often learn to do so
             | in practice. Specifically, we prove that a multi-head self-
             | attention layer with sufficient number of heads is at least
             | as expressive as any convolutional layer. Our numerical
             | experiments then show that self-attention layers attend to
             | pixel-grid patterns similarly to CNN layers, corroborating
             | our analysis.''
        
             | smallnamespace wrote:
             | How do we really know that brains use hard attention?
        
               | Straw wrote:
               | Eyes only have high resolution in a tiny spot.
        
           | Isinlor wrote:
           | As far as I'm aware, attention does not even attempt
           | biological plausibility, nor was it in any way inspired by
           | biology. The issues attention addresses are very specific to
           | sequential nature of so called Recurrent Neural Networks. The
           | first issue is known as exploding / vanishing gradients -
           | basically as you keep multiplying some vector with matrices
           | you will either explode that vector to infinity or squeeze to
           | zero, the same happens with derivatives. The second issue is
           | that you can not parallelize sequential operation. Attention
           | address this issues by removing recurrence by using a
           | specific invented mathematical structure. There was no name
           | for it, but attention gives good intuition for what that
           | mathematical structure is trying to do. Kind of like quantum
           | chromodynamics uses the term "colors" in a way that has
           | nothing to do with light, photons or even electromagnetic
           | force.
        
             | whymauri wrote:
             | >As far as I'm aware, attention does not even attempt
             | biological plausibility, nor was it in any way inspired by
             | biology.
             | 
             | It may not have been the intention, but associative memory
             | is the one of the only mechanisms that computational
             | neuroscientists can agree on broadly. There's been recent
             | work on energy-based models that suggest biologically
             | plausible methods adjacent to attention. [0]
             | 
             | [0] https://arxiv.org/abs/2008.06996
        
       | how_strange wrote:
       | If, as in this paper, we allow ourselves to set the kernel after
       | seeing the data, then the statement in the title is trivial: if
       | my learning algorithm outputs function f, I can take the kernel
       | K(x,x')=f(x)*f(x').
       | 
       | The result is interesting insofar as the path kernel is
       | interesting, which requires some more thought.
        
         | moultano wrote:
         | If I'm understanding correctly, it doesn't just set the kernel
         | after seeing the data, but also after training the entire
         | model, because the path kernel can't be defined without the
         | optimization process to define the path.
         | 
         | I can't tell if this paper is a useful insight or not.
        
       ___________________________________________________________________
       (page generated 2020-12-06 23:00 UTC)