[HN Gopher] Every Model Learned by Gradient Descent Is Approxima... ___________________________________________________________________ Every Model Learned by Gradient Descent Is Approximately a Kernel Machine Author : scottlocklin Score : 282 points Date : 2020-12-05 14:40 UTC (1 days ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | scottlocklin wrote: | Looking at you, deep learning. | api wrote: | There is probably a ton of isomorphism between different | models. It may come down to what is easiest to understand and | fastest to implement in code. | segfaultbuserr wrote: | See also: | | A visual proof that neural networks can approximate any | function | | https://news.ycombinator.com/item?id=19708620 | api wrote: | So a "neural network" is actually a type of parameterized | mathematical function that can be fit to any curve | including higher dimensional surfaces, etc.? | laingc wrote: | Yes. | xksteven wrote: | That's correct. See https://en.wikipedia.org/wiki/Univers | al_approximation_theore... for more details | wongarsu wrote: | A linear SVM can in turn be expressed as a very shallow neutral | network. The main difference is that with SVMs you put all your | effort into transforming inputs for the model (e.g. all the | popular kernels) while with neural networks usually most of the | effort goes into clever model architectures. | Anka33 wrote: | It's just a bunch of if clauses. | GlenTheMachine wrote: | I'm confused by findings like this one, and I'm hoping someone | here can educated me. | | There are many known universal approximations. Deep networks are | one. SVMs are one. Heck, cubic splines are one, and they've been | in use for nearly a hundred years IIRC. | | The problem has never been one of finding a sufficiently powerful | approximator. It has been training that approximator. My | understanding of the significant advancement made by deep | learning is that we finally figured out how to train a specific | kind of universal approximator in a way such that it finds very | good separation surfaces for what used to be impossible-to-solve | classification problems. | | But it should be no surprise to anyone that there exist, in | theory, other universal approximations that approximately | reproduce the same separation surfaces, should it? I'd expect | _any_ universal approximator to be powerful enough to reproduce | the separation surfaces, hence the meaning of the word | "universal". The problem was always finding the right weights, | not finding the right approximator architecture. | | Am I missing something? | xksteven wrote: | My understanding of the read was to show "how" they're | equivalent as opposed to how to actually construct such an | approximator or learn it. | | Similar to showing a problem falls in NP, you can reduce the | problem down to another problem in NP and be done with it. | sdenton4 wrote: | Agree, but also think the result may be too general to be | useful. Proving that you can rewrite any network learned with | gradient descent this way kinda suggests that the | architecture doesn't matter, but we know that's not true. Eg, | why are networks with skip connections SO much better than | networks without? What about batch normalization? This makes | me suspicious that it's a nice theoretical result a bit too | high level to be useful. Yes, it was proved years ago that | you can train an arbitrary function with a wide enough two- | layer net, but it's not a terribly practical way to approach | the world. Now we have architectures much better than two- | layer networks, and, for that matter, SVMs. | | There's a number of problems with svms; complexity for | training and inference scales with the amount of training | data, which is pretty sad panda for complex problems. | | Extremely spicy/cynical take: it's not cool to say "you all | should go look at all these possible applications" when the | thrust is the paper is to prop up the relevance of an | obsolete approach. You gotta do the actual work to close the | gap if you still want your PhD to be worth something... | | That said, I haven't read the paper terribly closely, and am | always happy to be proven wrong! | runT1ME wrote: | >suggests that the architecture doesn't matter, but we know | that's not true. Eg, why are networks with skip connections | SO much better than networks without? What about batch | normalization? | | Is this true though, or does network architecture only | matter in terms of _efficiency_? This is non rhetorical, I | really don 't know much about deep learning. :) I guess i'm | asking if with enough data and compute, is architecture | still relevant? | sdenton4 wrote: | These things matter a lot in practice. Imagine a giant | million dimensional loss surface, where each point is a | set of weights for the model. Then the gradient is | pushing us around on this surface, trying to find a | 'minimum.' Current understanding (for a while, actually) | is that you never really hit minima so much as giant | mostly-flat regions where further improvement maybe takes | a million years. The loss surfaces for models with skip | connections seem to be much, much nicer. | | https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9 | 067... | | In effect, there's a big gap between an existence proof | and actually workable models, and the tricks of the trade | do quite a lot to close the gap. (And there are almost | certainly more tricks that we're still not aware of! I'm | still amazed at how late in the game batch normalization | was discovered.) | | OTOH, so long as you're using the basic tricks of the | trade, IME architecture doesn't really matter much. Our | recent kaggle competition for birdsong identification was | a great example of this: pretty much everyone reported | that the difference between five or so 'best practices' | feature extraction architectures (various permutations of | resnet/efficientnet) was negligible. | mycall wrote: | > why are networks with skip connections SO much better | than networks without? | | What are the leading theories for why this seems to be the | case? Less nodes to capture and direct decisions? | sdenton4 wrote: | Oh there's plenty of good explantation in the neural | network literature (my eli5: the skip connections make | the default mapping an identity instead of a zero | mapping; you can start by doing no harm, and improve from | there). The method was suggested by knowledge from | differential equations. All I'm saying is that the | "everything is secretly an svm" viewpoint is probably too | coarse to explain these interesting and effective | structural differences. | nightski wrote: | I'd be curious if re-framing a trained neural network model | as a SVM gives you insight into it's support vectors and | maybe a little understanding on why the NN works the way it | does? | riku_iki wrote: | > It has been training that approximator. | | Also it is representation of approximator itself and data | compression. For cubic splines to approximate some NN you | likely would need enormous number of intervals covering input | space. | VHRanger wrote: | Also, cubic splines don't extrapolate like the other | mentioned models | make3 wrote: | An issue with deep learning is that it is very hard to analyse | from a theoretical mathematical perspective, to prove things | about them. | | Kernels have been studied thoroughly from a theoretical | perspective and people have proven things about them. | | The goal of papers such as these is to find ways in which deep | neural networks and kernel methods are similar, so that | theoretical results and tools found for kernel methods may be | adapted to deep neural networks. | talolard wrote: | 50/50 I can shed some light or am way over my head: | | The point of the paper is that if you train a deep learning | model with gradient descent then the resulting model is | effectively a kernel machine, regardless of model architecture. | | The nice thing about a kernel machine is that it is simple | (just one hidden layer) and we are able to use to analyze a | kernel machine more effectively and conveniently. | | So, I think the contribution here isn't "these sets of | universal aproximators are equivalent" but rather "We have this | effective but opauge deep learning thing, turns it it's actual | a kernel machine in retropsect so we can bring 'kernel tooling' | to analyze the deep learning mode" | sebasv_ wrote: | Finding the right architecture, or more in general the right | model, is very much still the main problem. | | You should be careful with the meaning you ascribe to the word | 'universal'. The list of universal approximators is massive, | and the sub-list of universal approximators that can be trained | with OLS is still substantial. Still these models can differ | significantly: | | - How efficient are they (in #parameters required for a certain | error) for specific tasks? There is a known 'maximum | efficiency' for general tasks, but in high dimensions this | efficiency is terrible, such that many models will fail | terribly on high-dimensional data. Hence, you should pick a | model that is exceptionally good for a specific task, although | it might be less efficient for other tasks. | | - How well can the model cope with noise? If your dependent | variable is severely distorted (think financial data) then you | need a model that can balance between interpolating datapoints | and averaging out the noise. | | Just to name my two favorite properties. The first one is _kind | of_ related to learnability, since an inefficient model is | often pretty much impossible to learn. | cs702 wrote: | "Here we show that every model learned by this method [SGD], | regardless of architecture, is approximately equivalent to a | kernel machine [i.e., a support vector machine or SVM] with a | particular type of kernel" -- a type of kernel which Domingos, | the author, calls a "path kernel." | | As defined in the paper, a "path kernel" measures, for any two | data points, how similarly a model varies (specifically, how | similarly the model's gradients change) at those two data points | during training via SGD. This isn't exactly your usual, plain- | vanilla, radial-basis type of kernel. | | We've known for a long time that SVMs are universal | approximators, i.e., in theory they can approximate any target | function. The importance of this work is that it has found a new, | surprising, deep connection between _any_ model trained via SGD | and SVMs, which are well understood :-) | nuclearnice1 wrote: | Great explanation of the intuitive understanding of the path | kernel which seems to be the main takeaway from this paper. | | One minor technical correction, the proof relief on the | continuous model gradient flow not SGD. So it's proven for GD | and likely true for SGD and your intuitive explanation likely | still holds, but it's not obvious. | QuesnayJr wrote: | They sketch the argument for SGD, but they don't know if it | actually holds (see Remark 5 in the paper). | cs702 wrote: | You're absolutely right. I substituted SGD for GD without | giving it any thought because everyone uses SGD! | edwardjhu wrote: | The claim here is a bit misleading, as already pointed out by | other comments, since the kernel is an evolving one that is | essentially learned after seeing the data. | | Contrary to many related works that compare wide neural networks | to kernel methods, our recent work shows that one can study a | feature learning infinite-width limit with realistic learning | rate. | | https://arxiv.org/abs/2011.14522 | | We identified what separates the kernel regime (e.g., NTK) and | the feature learning regime. In the infinite-width limit, OP's | work could belong to either regime depending on the | parametrization, i.e. the path kernel is either equal to the NTK | or performing feature learning. | | It's an incredibly interesting research topic. Please feel free | to comment with thoughts on our work :) | bicepjai wrote: | Watch your stock NVIDIA :) | rlili wrote: | Why, aren't GPUs pretty useful in training SVMs as well? | abeppu wrote: | Well, and though the author shows an approximate equivalence, | which helps us understand a class of models better, it's not | obvious that it's preferable to use SVMs of the type | described. In particular, it seems like often it would be | preferable to deal with model weights (even if they are | mathematically a "superposition" of datapoints) than to ship | around and revisit the whole dataset. | xksteven wrote: | SVMs are solved via convex optimization methods which have | taken more time to get on the GPU train. | | On the other hand there are GPU accelerated SVM training such | as: https://github.com/Xtra-Computing/thundersvm | | A github or Google search will reveal other GPU accelerated | SVM training. | knuthsat wrote: | SVMs can also be trained using gradient descent. | abeppu wrote: | From a skim, I think one asterisk which may be a gap between the | claim in the title and what's actually shown in the paper is that | the theorem focuses on gradient descent-trained models which | minimize a loss function which is the sum of loss L(y_i, y*_i) on | points from a given dataset. While that's clearly very broad, I | think it _doesn't_ include things like GANs, where parts of the | model produce fake data to train against. | Flashtoo wrote: | The claim also applies to GANs as you can simply use a masking | function to indicate which inputs were used for each | optimization timestep, like the author suggests for stochastic | gradient descent in remark 5. | api wrote: | Another thought: are neural nets just a weird analog / floating | point kind of probabilistic data structure or lossy compression | algorithm? | sarosh wrote: | See also W. Brendel and M. Bethge. "Approximating CNNs with bag- | of-local-features models works surprisingly well on ImageNet." | https://arxiv.org/abs/1904.00760 (which is referenced in this | paper) and explains "[t]his suggests that the improvements of | DNNs over previous bag-of-feature classifiers in the last few | years is mostly achieved by better fine-tuning rather than by | qualitatively different decision strategies." | stosto88 wrote: | Pretty sure it means a complex algorithm can be approximated by a | super simple model. | Iv wrote: | Thing is, gradient descent is not really a complex algorithm. | 6gvONxR4sf7o wrote: | This is a really cool new bit of intuition: | | > A key property of path kernels is that they combat the curse of | dimensionality by incorporating derivatives into the kernel: two | data points are similar if the candidate function's derivatives | at them are similar, rather than if they are close in the input | space. This can greatly improve kernel machines' ability to | approximate highly variable functions (Bengio et al., 2005). It | also means that points that are far in Euclidean space can be | close in gradient space, potentially improving the ability to | model complex functions. (For example, the maxima of a sine wave | are all close in gradient space, even though they can be | arbitrarily far apart in the input space.) | FRGabriel wrote: | So at the end, it rephrased a statement from "Neural Tangent | Kernel: Convergence and Generalization in Neural Networks" | [https://arxiv.org/abs/1806.07572], besides in a way which is | kind of miss-leading. | | The assertion is known by the community at least since 2018, if | not even well before. | | I find this article and the buzz around a little awkward. | QuesnayJr wrote: | To be fair, the article you cite is from 2018, and the OP is | from 2012 (if the arXiv dates are right). | ajtulloch wrote: | The OP article was submitted on Mon, 30 Nov 2020 23:02:47 - | the arXiv identifier 2012.xyz means it was published in the | 12th month of 2020, not 2012. | QuesnayJr wrote: | Oh God that's embarrassing. I really thought that's how | arXiv identifiers worked, and have been under the wrong | impression for some time. | [deleted] | screye wrote: | I do have a tangential technical question for someone who knows | more math than I do. | | A kernel SVM always finds (one-of) the _global best fit lines_ in | the kernel space. | | A gradient descent model explicitly converges to one of the | nearest _local minimas_ by definition. | | Does this paper conclude that the local minimas that neural | networks converge to are one of the many equivalent global maxima | ? Won't this be a major revelation by itself ? | dumb1224 wrote: | I don't have a strong math background but I think during the | optimisation process of finding the hyperplane, the solver | (algorithm that attempts to find best separating hyperplane) | uses soft margin to allow mis-classified instances. Its | tolerance is controlled by a hyper-parameter so it will | comprise to find the best fit within the set parameter. So it | is a 'best solution' with a condition. However there are many | variations of implementations from different solvers to handle | it. | | Example:https://towardsdatascience.com/support-vector-machine- | simply... | altarius wrote: | I don't think the output of the conversion is guaranteed to be | equivalent to a hyperplane learned by an SVM. | | I didn't have time to read the paper, but reading the abstract | I don't see a claim that the gradient descent model | approximated by a kernel machine is equivalent to an optimal | fit obtained by SVM maximum margin hyperplane fitting. | | I assume one likely ends up with different hyperplane fits from | converting a NN/gradient-desc-learned model to kernel machine | vs learning a kernel machine directly via SVM learning. | edjrage wrote: | Nitpick: _Minima_ is already plural. | kaczordon wrote: | Doesn't gradient descent use a convex cost function so that it | always generates a global minimum? | wxnx wrote: | No, not necessarily. The objective functions used to train | neural networks are generally non-convex (the nets themselves | being non-convex as well), but are traditionally trained | using stochastic gradient descent (and its variants). | mrtranscendence wrote: | This really threw me off when I first started learning | about ANNs, coming from a traditional econometrics | background. B-but ... the parameters aren't identified! | wxnx wrote: | In the context of this paper, you can think of the "gradient | descent step" as optimizing the parameters of the kernel (i.e. | the parameters that generate the kernel space). There are no | explicitly optimality guarantees beyond those of standard | gradient descent. | | The "SVM step" would still find a global optima within the | kernel space, but the qualifications of the previous step mean | that the kernel space generated might be useless. | andreareina wrote: | Isn't one of the features of high-dimensional spaces that local | minima are rare and there's usually _some_ direction that | slopes towards a lower loss? | dragontamer wrote: | I mean, obviously not in the general case. Cryptography is | designed to be a high dimensional space with pretty much no | slope. | | I'd assume that a lot of the binary decision tree / chip | level optimizations are similar: almost no slope worth | analyzing. | sp332 wrote: | Sure, and unbreakable crypto is notoriously difficult to | make. I wouldn't expect a situation like that to come up in | a real-world problem. | dragontamer wrote: | Binary decision trees are basically super optimizers: | minimizing the number of gates needed to make a Chip's | logic (multiply circuits or whatever) | | > Sure, and unbreakable crypto is notoriously difficult | to make | | Not really. It's the unbreakable AND efficient | requirement combined that's hard. | | Just run anything over a random xor shift add kernel | about 1 million times and it's unbreakable (so long as | xor / shift forms a bijection). It's just inefficient. | | Finding the smallest number of iterations that works is | the hard part. Usually, people try to break it (ex, AES | was originally broken over 4 iterations) then double or | triple your best attempt, and you are set. | | AES is now broken over 5 iterations IIRC, but still a | long way to go to break full 8 iteration AES. | | More modern ciphers (SHA512) are like 80 iterations of a | simpler kernel. | | Even 'broken' ciphers like TEA or RC4 are hard to break | in practice if you just increase the iteration count up | the wazzoo. The problem is: AES gets to security in just | 8 iterations. | | So to be better than AES requires both efficiency AND | security. After all, 100 iterations of AES is going to be | more secure if you don't care about efficiency. (10x more | iterations than the default spec) | burlesona wrote: | This sounds interesting, but a little over my head. Can anyone | offer an explanation for a software engineer with no AI/ML | background? | moralestapia wrote: | Not an explanation, but a benefit could be that SVMs can be | evaluated much faster and are more explainable (* Citation | needed, I know). | eugenhotaj wrote: | Don't kernel SVMs need a full pass through the data they were | trained on to make predictions? How is that faster? | cscheid wrote: | No, they require a full pass over the _support vectors_ , | which are potentially a much smaller set. (That's part of | why everyone was so excited about SVMs when they were | invented) The support vectors are the training values with | nonzero hinge loss, or alternatively, training values | sufficiently close to the decision boundary. | eugenhotaj wrote: | Fair enough, but the number of support vectors for non | trivial problems is still pretty large (as I understand | but could be wrong), e.g. 20-30% of the dataset. Having | to iterate over 30% of say imagenet on each batch of | predictions seems unfeasible. | alexilliamson wrote: | You only need the "Support Vectors" to make predictions, | not the whole dataset. | somurzakov wrote: | neural nets at the same time require multiple passes | through the data (epochs). if we can train a model in one | epoch jnstead of 10000 epochs thats a breakthrough! | sdenton4 wrote: | Epochs are more about the training data than the model... | If you've got a big enough dataset, one epoch or less is | fine! | eugenhotaj wrote: | True, but it sounds like you're just shifting computation | from training to inference. And I'm not sure that's a | very good trade off to make, you're likely to predict on | much more data than you trained on (e.g. ranking models | at google, fb, etc) | somurzakov wrote: | not sure I get your point, both DNNs and SVMs require one | forward pass for inference, so there is no difference. if | SVM model can converge in one epoch, how is it not less | efficient than the status quo with DNNs? | eugenhotaj wrote: | For kernel SVMs, one needs to keep around part of the | training data (the support vectors) right? With DNNs, | after training, all you need are the model parameters. | For very large datasets, keeping around even a small part | of your training data may not be feasible. | | Furthermore, number of parameters do not (necessarily) | grow with the size of the training data, can be reused if | you get more data, can be quantized/pruned/etc. There's | not really an easy way to do these things with SVMs as | far as I understand. | peteretep wrote: | "You don't need a fancy model, you can just find a similar | example directly from the training data and get similar | results" | ironSkillet wrote: | If you look at how they actually "translate" the fancy model | to the simple one, it requires fully fitting the original | model (and keeping track of the evolution of gradients over | the training). So it wouldn't make training more efficient, | but perhaps it would be useful in inference or probing the | characteristics of the original model. | lumost wrote: | This has always anecdotally appeared to be the case when | investigating the predictions of neural nets. Particularly | when it comes time to answer the question "what does this | model not handle" | smallnamespace wrote: | Defining 'similar' robustly is the meat of the problem, and | what we're finding deep NNs to do well. | [deleted] | hansvm wrote: | Part of the reason this is interesting is that deep learning | has to have some kind of inductive bias to perform as well as | we've seen (given a learning problem, it's biased toward | learning in particular ways). In general though, a neural | network can approximate _any_ function, so reigning in that | complexity and uncovering which functions are actually | learnable (or efficiently learnable) by deep learning is an | important research direction. This paper says that the | functions uncovered by deep learning (with caveats) are | precisely those which are close to functions represented by a | different learning technique, which is notable because this new | class of functions is not "all functions" and because the | characterization is explainable in some sense, giving insight | into how deep learning works. That connection also winds up | having a bunch of other interesting implications that somebody | else can cover if they'd like. | ajtulloch wrote: | For everyone saying "oh wow, we can go back to SVMs now, huge | speedups, etc" - that's more than a little premature. This is | purely a formal equivalence, but not very useful computationally. | | The math here is pretty much first-year undergraduate level | calculus, and it's worth going through section 2 since it's quite | clearly written (thanks Prof Domingos). | | Essentially, what the author does is show that any model trained | with "infinitely-small step size full-batch gradient descent" | (i.e. a model following a gradient flow) can be written in a | "kernelized" form y(x) = \sum_{(x_i, y_i) \in L} | a_i K(x, x_i) + b. | | The intuition most people have for SVMs is that the constants a_i | are, well, _constant_ , that the a_i are sparse, and that the | kernel function K(x, x_i) is cheap to compute (partly why it's | called the 'kernel trick'). | | However, none of those properties apply here, which is why this | isn't particularly exciting computationally. The "trick" is that | both a_i and K(x, x_i) involve path integrals along the gradient | flow for the input x, and so doing a single inference is | approximately as expensive as training the model from scratch (if | I understand this correctly). | pietroppeter wrote: | It is worth noting that the author is the same of a very nice | book to popularize machine learning, "the master algorithm". | | https://homes.cs.washington.edu/~pedrod/ | hobofan wrote: | Is this a new finding? I'm not an expert on the deeper | mathematical side of ML, but I remember a friend of mine already | telling me about something that sounded exactly like this (ca. | 2016-17). | ironSkillet wrote: | The fact that kernel machines can approximate other models | isn't new. I think the novel idea is that they have explicitly | constructed the "translation" between the arbitrary model to | the kernel machine, and it's quite clean. | Straw wrote: | The kernel they find is a function of the gradient descent path, | which is a function of the data. So no, its nothing at all like a | normal kernel machine, where we pick the kernel before seeing the | data. | | It also only applies to the continuous limit of non-stochastic | GD, far from the real training methods used. | | We don't gain any understanding either; understanding implies | predictive power about some new situation, and I don't see any- | and nor does the paper suggest them. | | Looks like yet another attempt to attract attention by | "understanding" NNs. Look, humans can't explain or understand how | we drive, speak, translate, play chess, etc, so why should we | expect to understand how models that do these work? Of course, we | can understand the principles of the training process, and in | fact we already do- the theory of SGD is well understood. | xiphias2 wrote: | ,,Look, humans can't explain or understand how we drive, speak, | translate, play chess, etc, so why should we expect to | understand how models that do these work?'' | | I agree with you, but also it's amazing how much deepmind has | achieved by putting neuroscientists and machine learning | experts in the same room, and trying to make systems that work | inside the human brain work efficiently on metal. | | If you look at this talk for 2010, Demis was already listing | attention as an example (which was responsible for the recent | improvement in protein folding prediction as an example): | | https://www.youtube.com/watch?v=F5PSyu7booU | Straw wrote: | Absolutely, modern NN architectures have been inspired by | biological ones- despite their massive differences. | | Even in cases like attention, the modern version (that | actually works in GPT-3, AlphaFold2, etc), has little in | common with both the english word and what we think of as | attention. Its a formula with two matmuls and a softmax: | softmax(AB)C. In particular, it doesn't necessarily look | anywhere at all- just a weighted sum of the inputs. Nothing | like the hard attention used by the human visual cortex. Its | not even that different from a convolution where you allow | the weights to be a function of the input. | | So the inspiration might have come from humans, but the | actual architectures have largely come from pure trial and | error, with limited, difficult to explain intuition on what | tends to work. | xiphias2 wrote: | Actually self attention is a generalization of convolution: | | https://openreview.net/pdf?id=HJlnC1rKPB | | ,,This work provides evidence that attention layers can | perform convolution and, indeed, they often learn to do so | in practice. Specifically, we prove that a multi-head self- | attention layer with sufficient number of heads is at least | as expressive as any convolutional layer. Our numerical | experiments then show that self-attention layers attend to | pixel-grid patterns similarly to CNN layers, corroborating | our analysis.'' | smallnamespace wrote: | How do we really know that brains use hard attention? | Straw wrote: | Eyes only have high resolution in a tiny spot. | Isinlor wrote: | As far as I'm aware, attention does not even attempt | biological plausibility, nor was it in any way inspired by | biology. The issues attention addresses are very specific to | sequential nature of so called Recurrent Neural Networks. The | first issue is known as exploding / vanishing gradients - | basically as you keep multiplying some vector with matrices | you will either explode that vector to infinity or squeeze to | zero, the same happens with derivatives. The second issue is | that you can not parallelize sequential operation. Attention | address this issues by removing recurrence by using a | specific invented mathematical structure. There was no name | for it, but attention gives good intuition for what that | mathematical structure is trying to do. Kind of like quantum | chromodynamics uses the term "colors" in a way that has | nothing to do with light, photons or even electromagnetic | force. | whymauri wrote: | >As far as I'm aware, attention does not even attempt | biological plausibility, nor was it in any way inspired by | biology. | | It may not have been the intention, but associative memory | is the one of the only mechanisms that computational | neuroscientists can agree on broadly. There's been recent | work on energy-based models that suggest biologically | plausible methods adjacent to attention. [0] | | [0] https://arxiv.org/abs/2008.06996 | how_strange wrote: | If, as in this paper, we allow ourselves to set the kernel after | seeing the data, then the statement in the title is trivial: if | my learning algorithm outputs function f, I can take the kernel | K(x,x')=f(x)*f(x'). | | The result is interesting insofar as the path kernel is | interesting, which requires some more thought. | moultano wrote: | If I'm understanding correctly, it doesn't just set the kernel | after seeing the data, but also after training the entire | model, because the path kernel can't be defined without the | optimization process to define the path. | | I can't tell if this paper is a useful insight or not. ___________________________________________________________________ (page generated 2020-12-06 23:00 UTC)