[HN Gopher] Differentiable Programming - A Simple Introduction ___________________________________________________________________ Differentiable Programming - A Simple Introduction Author : dylanbfox Score : 97 points Date : 2022-04-12 10:21 UTC (1 days ago) (HTM) web link (www.assemblyai.com) (TXT) w3m dump (www.assemblyai.com) | yauneyz wrote: | My professor has talked about this. He thinks that the real gem | of the deep learning revolution is the ability to take the | derivative of arbitrary code and use that to optimize. Deep | learning is just one application of that, but there are tons | more. | SleekEagle wrote: | That's part of why Julia is so exciting! Building it | specifically to be a differentiable programming language opens | so many doors ... | mountainriver wrote: | Julia wasn't really built specifically to be differentiable, | it was just built in a way that you have access to the IR, | which is what zygote does. Enzyme AD is the most exciting to | me because any LLVM language can be differentiable | SleekEagle wrote: | Ah I see, thank you for clarifying. And thank you for | bringing Enzyme to my attention - I've never seen it | before! | melony wrote: | I am just happy that the previously siloed fields of operations | research and various control theory sub-disciplines are now | incentivized to pool their research together thanks to the | funding in ML. Also many expensive and proprietary optimization | software in industry are finally getting some competition. | SleekEagle wrote: | Hm I didn't know different areas of control theory were | siloed. Learning about control theory in graduate school was | awesome and it seems like a field that would benefit from ML | a lot. I know they use RL agents for control networks for | e.g. cartpole, but I would've thought it would be more | widespread! Do you think the development of Differentiable | Programming (i.e. the observation of more generality beyond | pure ML/DL) was really the missing piece? | | Also, just curious, what are your studies in? | melony wrote: | Control theory has a very, very long parallel history | alongside ML. ML, specifically probabilistic and | reinforcement learning, uses a lot of dynamic programming | ideas and Bellman equations in its theoretical modeling. | Lookup the term cybernetics, it is an old term in the pre- | internet era to mean control theory and optimization. The | Soviets even had a grand scheme to build networked | factories that could be centrally optimized and resource | allocated. Their Slavic communist AWS-meets-Walmart efforts | spawned a Nobel laureate; Kantorovich was given the award | for inventing linear programming. | | Unfortunately the CS field is only just rediscovering | control theory while it has been a staple of EE for years. | However, there haven't been many new innovations in the | field until recently when ML became the new hottest thing. | SleekEagle wrote: | This is some insanely cool history! I had no idea the | Soviets had such a technical vision, that's actually | pretty amazing. I've heard the term "cybernetics" but | honestly just thought it was some movie-tech term, lol. | | It seems really weird that control theory is in EE | departments considering it's sooo much more mathematical | than most EE subdisciplines except signals processing. I | remember a math professor of mine telling us about | optimization techniques that control systems | practitioners would know more about than applied | mathematicians because they were developed specifically | for the field, can't remember what the techniques were | though ... | melony wrote: | There is this excellent HN-recommended fiction called | _Red Plenty_ that dramatised the efforts on the other | side of the Atlantic. | | https://news.ycombinator.com/item?id=8417882 | | > _It seems really weird that control theory is in EE | departments considering it 's sooo much more mathematical | than most EE subdisciplines except signals processing._ | | I agree, apparently Bellman's reasoning for calling | dynamic programming what it is was because he needed | grant funding during the Cold War days and was advised to | give his mathematical theories a more "interesting" name. | | https://en.m.wikipedia.org/wiki/Dynamic_programming#Histo | ry | | The generalised form of the Bellman Equation (co- | formulated by Kalman of the Kalman filters fame) to | control theory and EE is in some ways what the Maximum | Likelihood function is to ML. | | https://en.m.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E | 2%8... | SleekEagle wrote: | Looks really cool, added to my amazon cart. Thanks for | the rec! | | That hilarious and sadly insightful. I remember thinking | "what the hell is so 'dynamic' about this?" the first | time I learned about dynamic programming. Although | "memoitative programming" sounds pretty fancy too, lol | potbelly83 wrote: | How do you differentiate a string? Enum? | adgjlsfhk1 wrote: | generally you consider them to be piecewise constant. | tome wrote: | Or more precisely, discrete. | 6gvONxR4sf7o wrote: | The answer to that is a huge part of the NLP field. The | current answer is that you break down the string into | constituent parts and map each of them into a high | dimensional space. "cat" becomes a large vector whose | position is continuous and therefore differentiable. "the | cat" probably becomes a pair of vectors. | titanomachy wrote: | If you were dealing with e.g. English words rather than | arbitrary strings, one approach would be to treat each word | as a point in n-dimensional space. Then you can use | continuous (and differentiable) functions to output into that | space. | noobermin wrote: | The article is okay but it would have helped to have labelled the | axes of the graphs. | choeger wrote: | Nice article, but the intro is a little lengthy. | | I have one remark, though: If your language allows for automatic | differentiation already, why do you bother with a neural network | in the first place? | | I think you should have a good reason why you choose a neural | network for your approximation of the inverse function and why it | has exactly that amount of layers. For instance, why shouldn't a | simple polynomial suffice? Could it be that your neural network | ends up as an approximation of the Taylor expansion of your | inverse function? | SleekEagle wrote: | I think for more complicated examples like RL control systems a | neural network is the natural choice. If you can incorporate | physics into your world model then you'd need differentiable | programming + NNs, right? Or am I misunderstanding the | question. | | If you're talking about the specific cannon problem, you don't | need to do any learning at all you can just solve the | kinematics, so in some sense you could ask why you're using | _any_ approximation function, | infogulch wrote: | The most interesting thing I've seen on AD is "The simple essence | of automatic differentiation" (2018) [1]. See past discussion | [2], and talk [3]. I think the main idea is that by compiling to | categories and pairing up a function with its derivative, the | pair becomes trivially composable in forward mode, and the whole | structure is easily converted to reverse mode afterwards. | | [1]: https://dl.acm.org/doi/10.1145/3236765 | | [2]: https://news.ycombinator.com/item?id=18306860 | | [3]: Talk at Microsoft Research: | https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations | listed here: https://github.com/conal/essence-of-ad | tome wrote: | > the whole structure is easily converted to reverse mode | afterwards. | | Unfortunately it's not. Elliot never actually demonstrates in | the paper how to implement such an algorithm, and it's _very_ | hard to write compiler transformations in "categorical form". | | (Disclosure: I'm the other of another paper on AD.) | amkkma wrote: | which paper? | orbifold wrote: | I think JAX effectively demonstrates that this is indeed | possible. The approach they use is to first linearise the | JAXPR and then transpose it, pretty much in the same fashion | as the Elliot paper did. | PartiallyTyped wrote: | The nice thing about differentiable programming is that we can | use all sorts of different optimizers compared to gradient | descent that can offer quadratic convergence instead of linear! | SleekEagle wrote: | Yes exactly! This is huge. Hessian optimization is really easy | with JAX, haven't tried it in Julia though | PartiallyTyped wrote: | And very fast given that you compile the procedure! I am | considering writing an article on this and posting it here | because I have seen enormous improvements over non jitted | code, and that excluded jax.vmap. | SleekEagle wrote: | There's a comparison of JAX with PyTorch for Hessian | calculation here! | | https://www.assemblyai.com/blog/why-you-should-or- | shouldnt-b... | | Would definitely be interested in an article like that if | you decide to write it | ChrisRackauckas wrote: | Here's Hessian-Free Newton-Krylov on neural ODEs with Julia: | https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin. | .. . It's just standard tutorial stuff at this point. | applgo443 wrote: | Why can't we use this quadratic convergence in deep learning? | PartiallyTyped wrote: | Well, quadratic convergence usually requires the Hessian, or | an approximation of it, and that's difficult to get in deep | learning due to memory constrains, and difficulty computing | second order derivatives. | | Computing the derivatives is not very difficult with e.g. | Jax, but ... you get back to the memory issue. The Hessian is | a square matrix, so in Deep Learning, if we have a million of | parameters, then the Hessian is a 1 trillion square matrix... | tome wrote: | Not only does it have 1 trillion elements, you also have to | invert it! | PartiallyTyped wrote: | Indeed! BFGS (and derivatives) approximate the inverse | but they have other issues that make them prohibitively | expensive. | SleekEagle wrote: | https://c.tenor.com/enoxmmTG1wEAAAAC/heart-attack-in- | pain.gi... ___________________________________________________________________ (page generated 2022-04-13 23:01 UTC)