[HN Gopher] Gradient Descent Models Are Kernel Machines
       ___________________________________________________________________
        
       Gradient Descent Models Are Kernel Machines
        
       Author : dilap
       Score  : 77 points
       Date   : 2021-02-08 19:41 UTC (3 hours ago)
        
 (HTM) web link (infoproc.blogspot.com)
 (TXT) w3m dump (infoproc.blogspot.com)
        
       | scythmic_waves wrote:
       | The paper discussed showed on reddit a few months back [1].
       | Another paper showed up shortly after claiming the exact opposite
       | [2]. Some discussion of contradiction this can be found in this
       | part of the thread: [3].
       | 
       | I myself am very interested in [2]. It's fairly dense, but I've
       | been meaning to go through it and the larger Tensor Programs
       | framework ever since.
       | 
       | [1]
       | https://www.reddit.com/r/MachineLearning/comments/k7wj5s/r_e...
       | 
       | [2]
       | https://www.reddit.com/r/MachineLearning/comments/k8h01q/r_w...
       | 
       | [3]
       | https://www.reddit.com/r/MachineLearning/comments/k8h01q/r_w...
        
         | Grimm1 wrote:
         | Thank you! Number 2 was exactly what I was looking for and I
         | just couldn't find the link.
        
       | ur-whale wrote:
       | > Gradient Descent Models Are Kernel Machines
       | 
       | ... that also happen to actually work.
        
       | tlb wrote:
       | If you're surprised by this result because you've used kernel
       | machines and didn't find them very good at generalization, keep
       | in mind that this assumes a kernel function that accurately
       | reflects similarity of input samples. Most work with kernel
       | machines just uses Euclidean distance. For instance, in an image
       | recognition model it would have to identify 2 images with dogs as
       | more similar than an image with a dog and an image with a cat.
       | 
       | With a sufficiently magical kernel function, indeed you can get
       | great results with a kernel machine. But it's not so easy to
       | write a kernel function for a domain like image processing, where
       | shifts, scales, and small rotations shouldn't affect similarity
       | much. Let alone for text processing, where it should recognize 2
       | sentences with similar meaning as similar.
        
         | yudlejoza wrote:
         | I may or may not be surprised by the result, I'm definitely not
         | surprised by the 'Thing A is thing B' in machine learning.
         | 
         | Every tom, machine-learner, and harry, is an expert on proving
         | to the whole world that thing-A is thing-B. The only problem is
         | people don't hire them with a million dollars a year total
         | compensation.
        
           | memming wrote:
           | But the converse is not true in this case, so still
           | interesting.
        
         | mywittyname wrote:
         | >With a sufficiently magical kernel function, indeed you can
         | get great results with a kernel machine. But it's not so easy
         | to write a kernel function for a domain like image processing,
         | where shifts, scales, and small rotations shouldn't affect
         | similarity much. Let alone for text processing, where it should
         | recognize 2 sentences with similar meaning as similar.
         | 
         | I think the key issue at hand is that gradient descent is
         | easier to train than a model using kernel functions. Someone
         | could absolutely devise a mechanism for back propagation of
         | errors with kernel functions, but at that point, it is
         | basically a neural network.
        
       | throwawaysea wrote:
       | > This result makes it very clear that without regularity imposed
       | by the ground truth mechanism which generates the actual data
       | (e.g., some natural process), a neural net is unlikely to perform
       | well on an example which deviates strongly (as defined by the
       | kernel) from all training examples.
       | 
       | Is this another way of saying that neural networks are just
       | another statistical estimation method and not a path to general
       | artificial intelligence? Or is it saying that problems like self-
       | driving cars are not suitable to the current state of the art for
       | AI since we have to ensure that reality doesn't deviate from
       | training examples? Or both?
       | 
       | I'd love to understand the "real life" implications of this
       | finding better.
        
         | phreeza wrote:
         | Is it not possible to achieve AGI with a good statistical
         | estimation method?
        
           | viraptor wrote:
           | Depends if you think "a human unlikely to perform well on an
           | example which deviates strongly (as defined by experience)
           | from all training examples." is true.
        
         | Grimm1 wrote:
         | When this was discussed about two months ago the conclusion I
         | took away was not very many at the moment beyond somewhat
         | formal equivalence.
         | 
         | https://news.ycombinator.com/item?id=25314830
        
         | 6gvONxR4sf7o wrote:
         | Neither. It's more akin to the fundamental theorem of calculus.
         | If you follow your data from point A to point B along some
         | differential path, you can sum up/integrate the steps to follow
         | your data in a single jump from point A to point B directly
         | (with the jump in terms of those integrals). It's a super
         | interesting viewpoint on gradient descent and models that use
         | it that could be really useful in looking at and understanding
         | those models abstractly, but isn't saying anything about
         | suitability for different tasks.
        
       | SubiculumCode wrote:
       | somedays I feel that, in the final analysis, everything is just
       | linear regression.
        
         | jtmcmc wrote:
         | given the properly transformed space that may very well be
         | true...
        
         | wenc wrote:
         | As someone who has studied nonlinear nonconvex optimization, I
         | don't think linear regression is the final word here. In the
         | universe of optimization problems for curve-fitting, the linear
         | case is only one case (albeit a very useful one).
         | 
         | In many cases though, it is often insufficient. The next step
         | is piecewise linearity and then convexity. It is said that
         | convexity is a way to weaken a linearity requirement while
         | leaving a problem tractable.
         | 
         | Many real world systems are nonlinear (think physics models),
         | and often nonconvex. You can approximate them using locally
         | linear functions to be sure, but you lose a lot of fidelity in
         | the process. Sometimes this is ok, sometimes this is not, so it
         | depends on the final application.
         | 
         | It happens that linear regression is good enough for a lot of
         | stuff out there, but there are many places where it doesn't
         | work.
        
           | nightcracker wrote:
           | > Many real world systems are nonlinear (think physics
           | models)
           | 
           | Technically, only if you don't zoom in too far, quantum
           | mechanics is linear.
        
           | SubiculumCode wrote:
           | Yes there are non-linear functions...but often these result
           | from combinations of linear functions.
        
           | SubiculumCode wrote:
           | Or rather, that all statistical methods are dressed up
           | regression.
        
         | IdiocyInAction wrote:
         | FC neural nets are iterated linear regression, in some sense.
        
         | tobmlt wrote:
         | hey, hey!, we can over-fit higher order approximations too. The
         | nerve on some of ya.
         | 
         | (Absolutely just kidding around here)
        
         | tqi wrote:
         | https://twitter.com/theotheredmund/status/134945323076219699...
        
       | 6gvONxR4sf7o wrote:
       | Discussion of the actual paper here:
       | 
       | https://news.ycombinator.com/item?id=25314830
       | 
       | It's a really neat one for people who care about what's going on
       | under the hood, but not immediately applicable to the more
       | applied folks. I saw some good quotes at the time to the tune of
       | "I can't wait to see the papers citing this one in a year or
       | two."
        
       | kdisorte wrote:
       | Quoting an expert from the last time this was posted:
       | 
       | So at the end, it rephrased a statement from "Neural Tangent
       | Kernel: Convergence and Generalization in Neural Networks"
       | [https://arxiv.org/abs/1806.07572], besides in a way which is
       | kind of miss-leading.
       | 
       | The assertion is known by the community at least since 2018, if
       | not even well before.
       | 
       | I find this article and the buzz around a little awkward.
        
       | robrenaud wrote:
       | Yannic Kilcher's paper explained series covers this pretty paper
       | well. I feel like I have a decent understanding of it just after
       | watching the video with a few pauses/rewinds.
       | 
       | https://www.youtube.com/watch?v=ahRPdiCop3E
        
       ___________________________________________________________________
       (page generated 2021-02-08 23:00 UTC)