[HN Gopher] Differentiable Programming - A Simple Introduction
       ___________________________________________________________________
        
       Differentiable Programming - A Simple Introduction
        
       Author : dylanbfox
       Score  : 97 points
       Date   : 2022-04-12 10:21 UTC (1 days ago)
        
 (HTM) web link (www.assemblyai.com)
 (TXT) w3m dump (www.assemblyai.com)
        
       | yauneyz wrote:
       | My professor has talked about this. He thinks that the real gem
       | of the deep learning revolution is the ability to take the
       | derivative of arbitrary code and use that to optimize. Deep
       | learning is just one application of that, but there are tons
       | more.
        
         | SleekEagle wrote:
         | That's part of why Julia is so exciting! Building it
         | specifically to be a differentiable programming language opens
         | so many doors ...
        
           | mountainriver wrote:
           | Julia wasn't really built specifically to be differentiable,
           | it was just built in a way that you have access to the IR,
           | which is what zygote does. Enzyme AD is the most exciting to
           | me because any LLVM language can be differentiable
        
             | SleekEagle wrote:
             | Ah I see, thank you for clarifying. And thank you for
             | bringing Enzyme to my attention - I've never seen it
             | before!
        
         | melony wrote:
         | I am just happy that the previously siloed fields of operations
         | research and various control theory sub-disciplines are now
         | incentivized to pool their research together thanks to the
         | funding in ML. Also many expensive and proprietary optimization
         | software in industry are finally getting some competition.
        
           | SleekEagle wrote:
           | Hm I didn't know different areas of control theory were
           | siloed. Learning about control theory in graduate school was
           | awesome and it seems like a field that would benefit from ML
           | a lot. I know they use RL agents for control networks for
           | e.g. cartpole, but I would've thought it would be more
           | widespread! Do you think the development of Differentiable
           | Programming (i.e. the observation of more generality beyond
           | pure ML/DL) was really the missing piece?
           | 
           | Also, just curious, what are your studies in?
        
             | melony wrote:
             | Control theory has a very, very long parallel history
             | alongside ML. ML, specifically probabilistic and
             | reinforcement learning, uses a lot of dynamic programming
             | ideas and Bellman equations in its theoretical modeling.
             | Lookup the term cybernetics, it is an old term in the pre-
             | internet era to mean control theory and optimization. The
             | Soviets even had a grand scheme to build networked
             | factories that could be centrally optimized and resource
             | allocated. Their Slavic communist AWS-meets-Walmart efforts
             | spawned a Nobel laureate; Kantorovich was given the award
             | for inventing linear programming.
             | 
             | Unfortunately the CS field is only just rediscovering
             | control theory while it has been a staple of EE for years.
             | However, there haven't been many new innovations in the
             | field until recently when ML became the new hottest thing.
        
               | SleekEagle wrote:
               | This is some insanely cool history! I had no idea the
               | Soviets had such a technical vision, that's actually
               | pretty amazing. I've heard the term "cybernetics" but
               | honestly just thought it was some movie-tech term, lol.
               | 
               | It seems really weird that control theory is in EE
               | departments considering it's sooo much more mathematical
               | than most EE subdisciplines except signals processing. I
               | remember a math professor of mine telling us about
               | optimization techniques that control systems
               | practitioners would know more about than applied
               | mathematicians because they were developed specifically
               | for the field, can't remember what the techniques were
               | though ...
        
               | melony wrote:
               | There is this excellent HN-recommended fiction called
               | _Red Plenty_ that dramatised the efforts on the other
               | side of the Atlantic.
               | 
               | https://news.ycombinator.com/item?id=8417882
               | 
               | > _It seems really weird that control theory is in EE
               | departments considering it 's sooo much more mathematical
               | than most EE subdisciplines except signals processing._
               | 
               | I agree, apparently Bellman's reasoning for calling
               | dynamic programming what it is was because he needed
               | grant funding during the Cold War days and was advised to
               | give his mathematical theories a more "interesting" name.
               | 
               | https://en.m.wikipedia.org/wiki/Dynamic_programming#Histo
               | ry
               | 
               | The generalised form of the Bellman Equation (co-
               | formulated by Kalman of the Kalman filters fame) to
               | control theory and EE is in some ways what the Maximum
               | Likelihood function is to ML.
               | 
               | https://en.m.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E
               | 2%8...
        
               | SleekEagle wrote:
               | Looks really cool, added to my amazon cart. Thanks for
               | the rec!
               | 
               | That hilarious and sadly insightful. I remember thinking
               | "what the hell is so 'dynamic' about this?" the first
               | time I learned about dynamic programming. Although
               | "memoitative programming" sounds pretty fancy too, lol
        
         | potbelly83 wrote:
         | How do you differentiate a string? Enum?
        
           | adgjlsfhk1 wrote:
           | generally you consider them to be piecewise constant.
        
             | tome wrote:
             | Or more precisely, discrete.
        
           | 6gvONxR4sf7o wrote:
           | The answer to that is a huge part of the NLP field. The
           | current answer is that you break down the string into
           | constituent parts and map each of them into a high
           | dimensional space. "cat" becomes a large vector whose
           | position is continuous and therefore differentiable. "the
           | cat" probably becomes a pair of vectors.
        
           | titanomachy wrote:
           | If you were dealing with e.g. English words rather than
           | arbitrary strings, one approach would be to treat each word
           | as a point in n-dimensional space. Then you can use
           | continuous (and differentiable) functions to output into that
           | space.
        
       | noobermin wrote:
       | The article is okay but it would have helped to have labelled the
       | axes of the graphs.
        
       | choeger wrote:
       | Nice article, but the intro is a little lengthy.
       | 
       | I have one remark, though: If your language allows for automatic
       | differentiation already, why do you bother with a neural network
       | in the first place?
       | 
       | I think you should have a good reason why you choose a neural
       | network for your approximation of the inverse function and why it
       | has exactly that amount of layers. For instance, why shouldn't a
       | simple polynomial suffice? Could it be that your neural network
       | ends up as an approximation of the Taylor expansion of your
       | inverse function?
        
         | SleekEagle wrote:
         | I think for more complicated examples like RL control systems a
         | neural network is the natural choice. If you can incorporate
         | physics into your world model then you'd need differentiable
         | programming + NNs, right? Or am I misunderstanding the
         | question.
         | 
         | If you're talking about the specific cannon problem, you don't
         | need to do any learning at all you can just solve the
         | kinematics, so in some sense you could ask why you're using
         | _any_ approximation function,
        
       | infogulch wrote:
       | The most interesting thing I've seen on AD is "The simple essence
       | of automatic differentiation" (2018) [1]. See past discussion
       | [2], and talk [3]. I think the main idea is that by compiling to
       | categories and pairing up a function with its derivative, the
       | pair becomes trivially composable in forward mode, and the whole
       | structure is easily converted to reverse mode afterwards.
       | 
       | [1]: https://dl.acm.org/doi/10.1145/3236765
       | 
       | [2]: https://news.ycombinator.com/item?id=18306860
       | 
       | [3]: Talk at Microsoft Research:
       | https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations
       | listed here: https://github.com/conal/essence-of-ad
        
         | tome wrote:
         | > the whole structure is easily converted to reverse mode
         | afterwards.
         | 
         | Unfortunately it's not. Elliot never actually demonstrates in
         | the paper how to implement such an algorithm, and it's _very_
         | hard to write compiler transformations in  "categorical form".
         | 
         | (Disclosure: I'm the other of another paper on AD.)
        
           | amkkma wrote:
           | which paper?
        
           | orbifold wrote:
           | I think JAX effectively demonstrates that this is indeed
           | possible. The approach they use is to first linearise the
           | JAXPR and then transpose it, pretty much in the same fashion
           | as the Elliot paper did.
        
       | PartiallyTyped wrote:
       | The nice thing about differentiable programming is that we can
       | use all sorts of different optimizers compared to gradient
       | descent that can offer quadratic convergence instead of linear!
        
         | SleekEagle wrote:
         | Yes exactly! This is huge. Hessian optimization is really easy
         | with JAX, haven't tried it in Julia though
        
           | PartiallyTyped wrote:
           | And very fast given that you compile the procedure! I am
           | considering writing an article on this and posting it here
           | because I have seen enormous improvements over non jitted
           | code, and that excluded jax.vmap.
        
             | SleekEagle wrote:
             | There's a comparison of JAX with PyTorch for Hessian
             | calculation here!
             | 
             | https://www.assemblyai.com/blog/why-you-should-or-
             | shouldnt-b...
             | 
             | Would definitely be interested in an article like that if
             | you decide to write it
        
           | ChrisRackauckas wrote:
           | Here's Hessian-Free Newton-Krylov on neural ODEs with Julia: 
           | https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin.
           | .. . It's just standard tutorial stuff at this point.
        
         | applgo443 wrote:
         | Why can't we use this quadratic convergence in deep learning?
        
           | PartiallyTyped wrote:
           | Well, quadratic convergence usually requires the Hessian, or
           | an approximation of it, and that's difficult to get in deep
           | learning due to memory constrains, and difficulty computing
           | second order derivatives.
           | 
           | Computing the derivatives is not very difficult with e.g.
           | Jax, but ... you get back to the memory issue. The Hessian is
           | a square matrix, so in Deep Learning, if we have a million of
           | parameters, then the Hessian is a 1 trillion square matrix...
        
             | tome wrote:
             | Not only does it have 1 trillion elements, you also have to
             | invert it!
        
               | PartiallyTyped wrote:
               | Indeed! BFGS (and derivatives) approximate the inverse
               | but they have other issues that make them prohibitively
               | expensive.
        
               | SleekEagle wrote:
               | https://c.tenor.com/enoxmmTG1wEAAAAC/heart-attack-in-
               | pain.gi...
        
       ___________________________________________________________________
       (page generated 2022-04-13 23:01 UTC)