_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
 (HTM) Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
 (HTM)   Gradient descent visualization
       
       
        umvi wrote 20 hours 48 min ago:
        I thought gradient descent always followed the steepest slope, this
        looks like a physics simulation where marbles are rolling down hills.
        In mathematical gradient descent do you really oscillate around the
        minimum like a marble rocking back and forth in a pit?
        
        Edit: Oh, I see. The animations are "momentum" gradient descent
       
        j_bum wrote 23 hours 13 min ago:
        I love creating things to solidify my intuition about topics. When I
        learned about gradient descent, I saw this repo and was inspired to
        create my own toy Python package for visualizing gradient descent.
        
        My package, which uses PyVista for visualization:
        
 (HTM)  [1]: https://github.com/JacobBumgarner/grad-descent-visualizer
       
        levocardia wrote 1 day ago:
        These are nice animations. However I've always hesitated to get too
        enamored with these simple 2D visualizations of gradient descent,
        because one of the strange takeaways from deep learning is that
        behavior in high dimensions is very different from behavior in low
        dimensions.
        
        In a 2D problem with many local minima, like the "eggholder functions"
        [1], gradient descent will be hopeless. But neural net optimization in
        high dimensions really is a similar situation with many local minima,
        except gradient descent does great.
        
        Gradient descent in high dimensions also seems to have the ability to
        "step over" areas of high loss, which you can see by looking at the
        loss of a linear interpolation between weights at successive steps of
        gradient descent. This, again, seems like extremely strange behavior
        with no low-dimensional analogue.
        
 (HTM)  [1]: https://www.sfu.ca/~ssurjano/egg.html
       
          DeathArrow wrote 12 hours 3 min ago:
          >behavior in high dimensions is very different from behavior in low
          dimensions
          
          Keeping that in mind it is still useful. Maybe one useful addition
          would be to reduce dimensionality to show that if we pick some
          dimensions we don't have a local minimum, hence we don't have a local
          minimum in the larger dimension we started with.
       
          Grimblewald wrote 12 hours 16 min ago:
          Not super strange if you think of it going the other way. Think of a
          saddle like construct, in one dimension you get stuck easily, in
          another tgere is a way forward where over time the other previously
          'stuck' dimension is freed up again. In higher dimensions you have a
          much higher chance of havibg such alternative pathways. During
          training this looks lime breif plateaus in loss reduction followed by
          a sudden rapid decrease again. Sorry for typos, bumpy train, fix one
          get another, will try edit later.
       
          epipolar wrote 14 hours 56 min ago:
          Might I suggest an insightful seminar for anyone curious about why
          gradient descent is even tractable in such high dimensions:
          
          Stanford Seminar - A Picture of the Prediction Space of Deep Networks
          by  Pratik Chadhari:
          
 (HTM)    [1]: https://youtu.be/ZD2cL-QoI5g?si=hVFU5_4CeoLYtyuB
       
          uoaei wrote 18 hours 18 min ago:
          Behavior in high dimensions is even weirder than you might think,
          because you're nearly always close to a saddle point. There's way
          more saddle points than minima or maxima (consider modeling the sign
          of eigenvalues for the Jacobian as a binomial distribution -- how
          likely is it they're all positive or all negative?).
          
          However usually only a relatively small subset of dimensions are
          important (read: represent a stable optimum), see the 'lottery
          ticket' hypothesis for more.
          
          What's also weird is that "convergence" in these cases isn't really
          convergence, it's just slowly floating between saddle points as you
          begin to overfit. Hence early stopping.
          
          > Gradient descent in high dimensions also seems to have the ability
          to "step over" areas of high loss
          
          This has much less to do with gradients per se and way more to do
          with step size. It's an artifact of discretization, not of the
          algorithm per se.
       
          huevosabio wrote 22 hours 46 min ago:
          My understanding of this phenomenon in DL is that its not due to
          anything intrinsic to gradient descent, the same principles and
          understanding apply.
          
          Rather, it is that with very large dimensionality the probability
          that you spuriously get all derivatives to be zero is vanishingly
          small. That is, local minima are less likely because you need a lot
          of dimensions to agree that df(x_i)/dx_i = 0.
          
          I may be wrong though!
       
            dxbydt wrote 22 hours 0 min ago:
            if the probability that you get a derivative close to 0 is small,
            say only 10%, then you need just 3 dimensions to get that
            multiplicative probability equal to a tenth of a percent. 3 is
            hardly “very large dimensionality”
            
            you can assign different numbers, but still you will find you
            don’t need more than say 10 dimensions for this effect to happen.
       
          cafaxo wrote 1 day ago:
          Does gradient descent really do well for deep learning when the
          gradient is computed with respect to the whole dataset? I assumed
          that the noise in SGD played an important role for escaping local
          minima.
       
            cameldrv wrote 22 hours 29 min ago:
            There aren't really local minima in most deep networks.  When you
            get into millions/billions of parameters, there will essentially
            always be some directions that point downwards.  You have to get
            really really close to the true minimum for there to be no
            direction to go that improves the loss.
            
            Incidentally this same phenomenon is IMO how evolution is able to
            build things like the eye.  Naively you'd think that since you need
            so many parts arranged so well, it's impossible to find a step by
            step path where fitness goes up at every step, i.e. if you just
            have a retina with no lens or vice-versa, it doesn't work. 
            However, due to the high dimensionality of DNA, there is
            essentially guaranteed to be a path with monotonically increasing
            fitness just because there are so many different possible paths
            from A to B in the high dimensional space that at least one is
            bound to work.
            
            Now this isn't strictly true for every high dimensional system. 
            You need to have a degree of symmetry or redundancy in the encoding
            for it to work.  For example, in convolutional neural networks, you
            see this phenomenon where some filters get "stuck" in training, and
            those are local minima for that subspace.  What happens though is
            that if one filter gets stuck, the network will just use another
            one that had a better initialization.  This is why pruning works,
            lottery tickets, etc.  Things like residual connections enhance
            this effect since you can even be stuck in a whole layer and the
            training process can just bypass it.
            
            You see the same thing with life, where you could put a sequence
            for the same protein in different parts of the genome and it could
            still be produced, regardless of the position.    There are also many
            different ways to encode the exact same protein, and many different
            possible proteins that will have the same shape in the critical
            areas.    Life finds a way.
       
              whatever1 wrote 19 hours 32 min ago:
              If that was the case we would be finding globally optimal
              solutions for complicated non-convex optimization problems.
              
              The reality is different, you need to really explore the space to
              find the truly global optimal solution.
              
              A better explanation is that for ml you don't want a globally
              optimal solution that overindexes on your training set. You want
              a suboptimal solution that might also fit an unseen data set.
       
                blt wrote 18 hours 2 min ago:
                That is what people thought until around 2018, but it was
                wrong. It turns out that deep learning optimization problems
                have many global optima. In fact, when the #parameters exceeds
                the #data, SGD reliably finds parameters that interpolate the
                training data with 0 loss. Surprisingly, most of these
                generalize well and overfitting is not a big problem.
                
                In other words, deep learning is a very special nonconvex
                optimization problem. A lot of our old intuition about
                optimization for ML is invalid in the overparameterized regime.
       
                  l33tman wrote 8 hours 25 min ago:
                  Why DL generalizes well is still an open research problem
                  AFAIK. I've read numerous papers that tries to argue one way
                  or another why this works, and they are all interesting! One
                  paper (that I found compelling, even though it didn't propose
                  a thorough solution) showed that SGD successfully navigated
                  around "bad" local minimas (with bad generalization) and
                  ended up in a "good" local minima (that generalized well),
                  and their interpretation was that due to the S in SGD, it
                  will only find wide loss basins, and thus the conclusion was
                  that solutions that generalize well seem to have wider basins
                  (for some reason).
                  
                  Another paper showed that networks trained on roughly the
                  same dataset but initialized from different random
                  initializations, had a symmetry relation in the loss
                  landscape by a permutation of all the weights. You could
                  always find a permutation that allowed you to then linearly
                  interpolate between the two weight sets without climbing over
                  a loss mountain. Also very interesting even if it wasn't
                  directly related to generalization performance. It has
                  potential applications in network merging I guess.
       
                  patrick451 wrote 16 hours 56 min ago:
                  I have read this in several places and want to learn more. Do
                  you have a reference handy?
       
                    blt wrote 2 hours 7 min ago:
                    [1] Was an empirical paper that inspired much theoretical
                    follow-up. [2] Is one such follow-up, and the references
                    therein should point to many of the other key works in the
                    years between. [3] Introduces the neural tangent kernel
                    (NTK), a theoretical tool used in much of this work. (Not
                    everyone agrees that reliance on NTK is the right way
                    towards long-term theoretical progress.) [4] Is a more
                    recent paper I haven't read yet that goes into more detail
                    on interpolation. Its authors were well known in more
                    "clean" parts of ML theory (e.g. bandits) and recently
                    began studying deep learning.
                    
                    --- [1] Understanding deep learning requires rethinking
                    generalization. Zhang et al., arXiv, 2016. [1] [2]
                    Stochastic Mirror Descent on Overparameterized Nonlinear
                    Models: Convergence, Implicit Regularization, and
                    Generalization. Azizan et al., arXiv, 2019. [2] . [3]
                    Neural Tangent Kernel: Convergence and Generalization in
                    Neural Networks. Jacot et al., NeurIPS, 2018. [3] [4] A
                    Universal Law of Robustness via Isoperimetry. Bubeck et
                    al., NeurIPS, 2021.
                    
 (HTM)              [1]: https://arxiv.org/abs/1611.03530
 (HTM)              [2]: https://arxiv.org/abs/1906.03830
 (HTM)              [3]: https://proceedings.neurips.cc/paper/2018/hash/5a4...
 (HTM)              [4]: https://proceedings.neurips.cc/paper/2021/hash/f19...
       
                    nephanth wrote 3 hours 8 min ago:
                    This is something I saw a talk about a while ago. There are
                    probably more recent papers on this topic, you might want
                    to look browse the citations of this one
                    
 (HTM)              [1]: https://arxiv.org/abs/2003.00307
       
              dheera wrote 22 hours 5 min ago:
              > There aren't really local minima in most deep networks.
              
              How so?
              
              If there are no local minima other than the global one there are
              convex optimization methods that are far faster than SGD or Adam.
              The only reason these methods exist is because deep networks is a
              non-convex optimization problem.
       
                aaplok wrote 21 hours 36 min ago:
                There are many nonconvex functions where every local minimum is
                global, and even many nonconvex functions with a unique local
                minimum (which is de facto global). Convex methods can fail on
                those.
                
                The reason why GD and friends are a good choice in deep
                learning is that computing the gradient is cheap (and
                approximating it even more so). Every descent method relies on
                solving a subproblem of sorts, typically projecting the current
                iterate on a sublevel set of an approximation of the function,
                for some definition of projection. With GD, it's as cheap as it
                gets, just subtract a shrinked version of the gradient.
                Subproblems in other algorithms are a lot more expensive
                computationally, particularly at high dimensions. So more
                efficient as in requiring fewer function evaluations, yes, but
                at the cost of doing a lot more work at each step.
       
        can16358p wrote 1 day ago:
        As a extremely visual thinker/learner, thank you for creating this!
        
        Many things that were too abstract on paper and formulas are MUCH
        easier to understand this way.
       
        GistNoesis wrote 1 day ago:
        Here is a project I created for myself (some time ago) to help
        visualize the gradient as a vector field. [1] Probably best used as a
        support material with someone teaching along the way the right mental
        picture one should have.
        
        A great exercise is to have the student (or yourself) draw this
        visualization with pen and paper, (both in 2d and 3d), for various
        functions. And you can make the connection to the usual "tangent" on
        the curve of a derivative.
        
 (HTM)  [1]: https://github.com/GistNoesis/VisualizeGradient
       
        alok-g wrote 1 day ago:
        This is great!
        
        Which open source license is this under?  (Absence of license by
        default implies 'copyrighted', which in this case could be in conflict
        with the Qt open source license.  Note:  I am not a lawyer.)
       
        albertzeyer wrote 1 day ago:
        Note that such a 2D parameter space gives often the wrong intuition
        when thinking about applying gradient descent on high-dimensional
        parameter space.
        
        Also, mini-batch stochastic gradient descent behaves more stochastic
        than just gradient descent.
       
          pictureofabear wrote 7 hours 45 min ago:
          This visualization appears to be for batch gradient descent only, not
          stochastic or mini-batch.
          
          That said, your critique would be true of similar visualizations for
          gravity warping space-time, yet these visualizations are still widely
          used because they help with the initial understanding of the
          idea--why, for example, you hold one variable constant when taking
          the partial differential in another dimension, and what effect
          changing the step size for the gradient descent has on the ability to
          find a local minimum.
       
          dukeofdoom wrote 1 day ago:
          I vaguely remember something like this explained in one of my math
          classes 
          (calculus or linear equations). Not sure if the math teacher was
          referring the same problem:
          
          If you were walking down a mountain following an algorithm that told
          you just to keep going down. You might get stuck in a local minimum.
          Since you might reach a valley. Sometimes you need to go up to get
          out of the valley to keep going down.
       
            Imnimo wrote 1 day ago:
            In very high dimensional spaces (like trying to optimize a neural
            network with billions of parameters), to be "in a valley", you must
            be in a valley with respect to every one of the billions of
            dimensions. It turns out that in many practical settings, loss
            landscapes are pretty well-behaved in this regard, and you can
            almost always find some direction to continue going downward in
            that lets you go around the next hill rather than over it.
            
            This 2015 paper has some good examples (although it does sort of
            sweep some issues under the rug):
            
 (HTM)      [1]: https://arxiv.org/pdf/1412.6544
       
              jonathan_landy wrote 18 hours 1 min ago:
              Is the claim that there aren’t local many local minima for high
              dimensional problems eg in neural network loss functions?
       
                Imnimo wrote 16 hours 27 min ago:
                Yes. To be more specific, it's that nearly all points where the
                derivative is zero are saddle points rather than minima. Note
                that some portion of this nice behavior seems to be due to
                design choices in modern architectures, like residual
                connections, rather than being a general fact about all high
                dimensional problems. [1] This paper has some nice
                visualizations.
                
 (HTM)          [1]: https://arxiv.org/pdf/1712.09913
       
            Feathercrown wrote 1 day ago:
            When I took an ML class we solved that with a temperature
            parameter, to allow random deviations out of a local minimum. I
            wonder if there are novel methods now or if they're just improved
            versions of temperature?
       
          GaggiX wrote 1 day ago:
          >gives often the wrong intuition when thinking about applying
          gradient descent on high-dimensional parameter space
          
          Can you give some examples?
       
            DeathArrow wrote 12 hours 11 min ago:
            Probably there are very few local minimums since the chances of all
            local derivatives to be 0 decrease with the number of dimensions.
       
            euW3EeBe wrote 23 hours 14 min ago:
            Not an expert in this field, but I'm guessing this is related to
            unintuitive nature of geometry in high-dimensional spaces.
            
            One rough example I can think of is the fact that the number of
            ways to move away from the origin increases exponentially (or even
            faster?) as the dimensionality goes up. There is _way_ more volume
            away from the origin than near it, I've seen this explained as
            something like "most of the volume of a high-dimensional orange is
            in the peel". One result of this is the fact that samples of a
            standard gaussian end up forming a "shell" as opposed to a "ball"
            that you would expect (this phenomenon is called the concentration
            of measure in general).
            
            Also, very roughly, high-dimensional objects have lots of corners,
            and these corners are also very sharp. I would guess that gradient
            descent would get stuck in these corners and have a hard time
            getting out.
            
            Some links related to this:
            
            - Spikey Spheres: [1] - Thinking outside the 10-dimensional box -
            3blue1brown: [2] - This is a fairly long talk about HMC, but it
            does talk about some problems that come up when sampling
            high-dimensional distributions:
            
 (HTM)      [1]: http://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres#HN2
 (HTM)      [2]: https://www.youtube.com/watch?v=zwAD6dRSVyI
 (HTM)      [3]: https://www.youtube.com/watch?v=pHsuIaPbNbY
       
          GrantMoyer wrote 1 day ago:
          Agreed, but I still think this is a useful tool for building
          intuition about a subset of problems gradient descent faces. The
          animation in the readme is similar to what I picture in my head for
          local minima for instance.
       
        DrNosferatu wrote 1 day ago:
        Looks great!
        
        Killer feature: allow arbitrary surfaces to be used (choice of 2D
        projection if in higher dimensions). And integrate in Python to allow
        use in production! Allow arbitrary zoom and level of detail of surface.
       
       
 (DIR) <- back to front page