[HN Gopher] Trade-Offs in Automatic Differentiation: TensorFlow,... ___________________________________________________________________ Trade-Offs in Automatic Differentiation: TensorFlow, PyTorch, Jax, and Julia Author : ChrisRackauckas Score : 175 points Date : 2021-12-25 11:50 UTC (11 hours ago) (HTM) web link (www.stochasticlifestyle.com) (TXT) w3m dump (www.stochasticlifestyle.com) | carterschonwald wrote: | Part of the challenge is that most formulations of (reverse mode | )auto diff wind up requiring extra runtime data structures for | the backwards computation step. | | There's been some great work in this space in the past 5 years. | | I've got some stuff I worked out this fall I'm overdue to write | up and share some prototypes : there is a way to do reverse mode | auto diff isolated to just being an invisible compiler pass! | Without any of the extra complexity in what are otherwise | equivalent formulations | The_rationalist wrote: | https://github.com/breandan/kotlingrad#coroutines | snicker7 wrote: | Even though this posts thesis is "trade offs", it doesn't really | talk about any technical advantages that the Python's AD | ecosystem (Tensorflow, PyTorch, JAX) has over Julia's (Zygote.jl, | Diffractor.jl). | adgjlsfhk1 wrote: | It touches on the main one which is simplicity. it's much | easier to write an AD system for a more static language. | civilized wrote: | Maybe it has no technical advantages, unless you count being | very popular and in an accessible language as a technical | advantage (which it definitely could be depending on your | definition of "technical"). | | Julia is designed for advanced numerical computing and Python | isn't. The metaprogramming affordances needed for AD are much | better developed in Julia than they ever will be in Python. And | let's not forget the immense utility of multiple dispatch in | Julia, another feature Python will probably never have. So it's | not surprising that Julia is simply way more capable. | jaggirs wrote: | One disatvantage of the language itself is the need for | compilation, which isn't that fast in my limited experience. | But I would love to hear how much this affects iteration | speed. | calaphos wrote: | The same issue exists with Jax. XLA compilation can take up | quite a bit of time, especially on larger NN models. And | theres no persistent compile cache, so even if you don't | change the jitted function you need to wait for compilation | again as you restart the process. | alevskaya wrote: | Jax does actually already support a persistent | compilation cache for TPU, and support for caching GPU | compiles is being worked on currently. | civilized wrote: | Yeah, I'd imagine that for things both Python and Julia can | do AD-wise, Python may be preferable since it's interpreted | and thus instant-feedback, but all the numerical heavy | lifting in packages like Jax and PyTorch is done in fast | C++. So you should be getting a more appealing environment | for experimentation without losing out on speed. | mountainriver wrote: | The Julia crowd touting multiple dispatch all the time is so | strange, it's actually one of the main reasons the language | hasn't had much uptake from what I can tell. | | Python is just more approachable and natural to people. Julia | should learn from that | civilized wrote: | Python is just more OO style, so people who have been | taught OOP in school are comfortable with it. That will | include the vast majority of generic SWEs writing generic | CRUD apps. | | But personally I find OOP ugly and unnatural, and Julia's | model elegant and natural. And far more powerful - Julia | programmers are using multiple dispatch to build out | scientific computing to a sophistication not seen in any | other language. | | It might not be your cup of tea if you need to see | object.method() in your code, but if you're more mentally | flexible and want to build the next generation of technical | computing tools, Julia is the place to be right now. | mountainriver wrote: | Yeah I'm definitely mentally flexible and have coded in | many paradigms, I don't love OO and generally don't write | that way but multiple dispatch as a primary design | pattern is odd. | | I've tried it for close to a year and the ergonomics | still felt off, it reminds me of how the scala crowd | talked about functional programming, and we've seen how | that turned out. | | I hear this from a lot of people that try Julia and yet | the Julia crowds answer is always that they are dumb. | Sounds a lot like the scala crowd... | civilized wrote: | I think 90% of the ergonomics issue is that people want | dot notation and tab-autocomplete in their IDE so they | can type obj.<tab> and get the methods that operate on | obj. Which I agree, some version of that should exist, | and there's no real reason it can't exist in Julia. The | tooling is just not as mature as other languages. | | Julia is far ahead in affordances to write fancy | technical code and fairly behind in simple things, like | standard affordances to write more ordinary code, or the | ability to quickly load in data and make a plot. | | I just think it's a misdiagnosis to blame multiple | dispatch for this issue. It's much more about the Julia | community prioritizing the needs of their target market. | cl3misch wrote: | > fun fact, the Jax folks at Google Brain did have a Python | source code transform AD at one point but it was scrapped | essentially because of these difficulties | | I assume you mean autograd? | | https://github.com/HIPS/autograd | ChrisRackauckas wrote: | No, autograd acts similarly to PyTorch in that it builds a tape | that it reverses while PyTorch just comes with more optimized | kernels (and kernels that act on GPUs). The AD that I was | referencing was tangent (https://github.com/google/tangent). It | was an interesting project but it's hard to see who the | audience is. Generating Python source code makes things harder | to analyze, and you cannot JIT compile the generated code | unless you could JIT compile Python. So you might as well first | trace to a JIT-compliable sublanguage and do the actions there, | which is precisely what Jax does. In theory tangent is a bit | more general, and maybe you could mix it with Numba, but then | it's hard to justify. If it's more general then it's not for | the standard ML community for the same reason as the Julia | tools, but then it better do better than the Julia tools in the | specific niche that they are targeting. That generality means | that it cannot use XLA, and thus from day 1 it wouldn't get the | extra compiler optimizations that some which uses XLA does | (Jax). Jax just makes much more sense for the people who were | building it, it chose its niche very well. | brilee wrote: | FYI - Tangent evolved into TF2's AutoGraph. | fault1 wrote: | it's quite interesting how at least in ML, the transformer | architecture has 'won out', at least for the time being, it | appears to be everywhere these days: | https://threadreaderapp.com/thread/1468370605229547522.html | | the advantage of transformers (computationally) seems to be how | little sophistication the attention mechanism needs from AD | systems (and how well it appears to scale with data). it's also a | very static architecture in terms of a data flow/control flow | perspective. | | as far as I understand, this is far different from systems | needing to be modeled in continuous time, especially things like | SDEs. I am curious if things like delay embeddings will ever be | modeled in terms of mechanisms similar to attention however. | liuliu wrote: | Like other comments, CNNs and LSTMs are still in wide use | today. If you dig deep enough, position encoding doesn't really | capture time-based series information that well. | blovescoffee wrote: | Could you elaborate? I've built some Causal CNN's but never | used a transformer for time series data. What are the | challenges? | jowday wrote: | Outside of research transformers are rarely used for computer | vision problems and CNNs remain the go to architecture. And you | actually need to do some hacks to get transformers to work with | computer vision at a meaningful scale (splitting images into | patches and convoluting the patches to produce features to feed | into the transformer). | fault1 wrote: | > some hacks to get transformers to work with computer vision | at a meaningful scale (splitting images into patches and | convoluting the patches to produce features to feed into the | transformer). | | sounds a lot like 'classical computer vision'. e.g, when I | learned the subject (mid 2000s), topological features were | all the rage: https://en.wikipedia.org/wiki/Digital_topology | xiphias2 wrote: | It will be interesting when Tesla and Waymo moves to | transformer architecture, but as you wrote my guess is that | it's not yet in production for vision tasks. | jowday wrote: | I'm not sure they will, at least not with the research in | the state it is presently. Researchers are interested in | vision transformers because they're competitive with CNNs | if you give them enough training data - they don't | drastically outperform them. | | Right now switching over to them would require a ton of | code changes, relearning intuitions, debugging, profiling, | etc. for not a ton of benefit. | xiphias2 wrote: | Sure, I think the same, but the tweets came from Andrew | Karpathy, he's watching this space like an eagle. | liuliu wrote: | Tesla did, as mentioned in their AI Day. It is not full | transformer (aka ViT). The use transformer decoder to | synthesize data from different cameras and decode 3d | coordinates directly (aka DETR). | xiphias2 wrote: | Thanks, sounds great, I'll read the DETR paper | joconde wrote: | I've looked into transformers for semantic segmentation, but | the patching aspect seems to make it hard too. Do you have | some sources that describe these hacks in detail? | lowdose wrote: | You could do a code search on GitHub. I'm pretty lazy in | the aspect of coding. I always seem to find a repo that has | implemented an MVP with what I already had in mind. There | are some gold nuggets on GitHub like Googles DDSP | implementation they have academically published anonymous. | The_rationalist wrote: | Kotlingrad make any autodiff library looks pale in comparison.. | mark_l_watson wrote: | Interesting read. I was very disappointed when the Swift | TensorFlow project withered away. A good general purpose | programming language combined with deep learning seemed like a | great idea. Apple actually provides a good dev experience with | Swift and CoreML (for fun I wrote a Swift/SwiftUI/CoreML app that | uses two deep learning models that is in the App Store). | | Wolfram Language takes an approach similar to Apple's in | providing a good number of pre trained models, but I haven't yet | discovered any automatic differentiation examples. | | Of the frameworks described in the article, I find Julia most | interesting but I need to use Python and TensorFlow in my work. | ChrisRackauckas wrote: | I remember reading an early S4TF manifesto talking about | natural language processing, image processing, etc. thinking, | that cannot be your audience because that audience already has | AD systems which support their domain. Building something that | is more general for the sake of being more general is never a | good idea, that's bad engineering. It had a lot of great ideas, | and indeed the dev experience seemed nice. But I would venture | to guess that the Google overlords had to question what the | true value of S4TF in that light. "Standard ML" cannot be your | target audience if you want to work on AD extensions. | | Following this thread, you can also see what how the Julia | tools evolved. If you see the paper that was the synthesis for | Zygote.jl, it was all about while loops and scalar operations | (https://arxiv.org/abs/1810.07951). Why did that not completely | change ML? Well, ML doesn't use those kinds of operations. I | would say the project kind of started as a "tool looking for a | problem". It did get a bit lucky that it found a problem: | scientific applications need to be able to use automatic | differentiation without rewriting the whole codebase to an ML | library, leading to the big Julia AD manifesto of language-wide | differentiable programming by directly acting on the Julia | source itself rather than a language subset (http://ceur- | ws.org/Vol-2587/article_8.pdf). Zygote was a good AD, but not a | great AD, why? Because it could not hit this goal, mostly | because of its lack of mutation handling. Yes, it does handle | standard ML just fine, but does not justify its added | complexity. | | What has actually kept Julia AD research going is that some | scientific machine learning applications, specifically physics- | informed neural networks (PINNs), require very high order | derivatives. For example, to solve the PDE u_t = u_xx with | neural networks, you need to take the third derivative of the | neural network. With Jax this can only be done with a separate | language subset (https://openreview.net/pdf?id=SkxEF3FNPH), and | thus a new AD for Julia to replace Zygote, known as | Diffractor.jl, was devised to automatically incorporate higher | order AD optimizations as part of the regular usage | (https://www.youtube.com/watch?v=mQnSRfseu0c). It is these PINN | SciML applications that have funded its development and is its | built-in audience: it solves a problem nothing else does, even | if it is potentially niche. Similarly with Enzyme, it solved | the problem of how to do mutation well, which is where you can | see in the paper that its applications are mostly ODE and PDE | solvers (Euler, RK4, the Bruss semilinear PDE) (https://proceed | ings.neurips.cc/paper/2020/file/9332c513ef44b...). Torchscript | and Jax do not handle this domain well, so it has an audience, | which may (or may not?) be niche. | | A big part of writing this blog post was to highlight this to | the Julia AD crew that I regularly work with. What will keep | these projects alive is understanding the engineering trade- | offs that are made and who the audience is. The complexity has | a cost so it better have a benefit. If that target is lost, if | any benefit is a theoretical "but you may need more features | some day", then the projects will lose traction. The project | needs to be two-fold: identify new architectures and | applications that would benefit from expanded language support | from AD and build good support for those projects. Otherwise it | is just training a transformer in Julia vs training a | transformer in Python, and that is not justifiable. | p1esk wrote: | My impression from your comment is that you don't care that | much about "standard" ML users. As a "standard" ML user | (pytorch/jax), and a potential Julia user in the future, this | is not what I like to hear. | borodi wrote: | The idea, I imagine, is to differentiate what the julia ML | stack offers over what is already on python, if it offers | the same thing, but without the funding from facebook or | google, why bother switching? It has to offer something | more. | The_rationalist wrote: | Note however that Kotlin is backed by Facebook | https://ai.facebook.com/blog/paving-the-way-for-software-20-... | and that there is a mature library for autodiff | https://github.com/breandan/kotlingrad | 2sk21 wrote: | I am really glad to hear this. As it happens, my main post- | retirement has been to learn Swift to try out CoreML. I'm | really enjoying learning this so far. | hnarayanan wrote: | This is really interesting to me. Could you please share your | learning pathway? | albertzeyer wrote: | This misses some discussion on tf.function which does a Python | AST level transformation to a static TF computation graph, | including dynamic control flow like loops and conditional | branches. | chillee wrote: | Tf.function is largely morally equivalent to Torchscript, which | he does discuss. | spacetracks wrote: | "The second factor, and probably the more damning one, is that | most ML codes don't actually use that much dynamism." I would | argue that this in true precisely because it is not available in | an AD system. When I tell friends and coworkers about what zygote | can do they light up and start describing different use cases | they have that could benefit from AD. Diff eq solving is a big | one. | KKKKkkkk1 wrote: | This is because continuous optimization is useless when | crossing a discontinuity, which is what control flow creates. | Even in a trivial situation like ReLU, where the control flow | is mimicking a continuous transition, you have the "dead ReLU" | problem, where you have to start training on the correct side | of the discontinuity and make sure to never cross. | ogogmad wrote: | I don't know whether this belongs here, but... | | Formally, there is a generalisation of differentiation which | can handle functions like ReLU (i.e. locally Lipschitz non- | differentiable functions) by allowing a derivative to be set- | valued. It's called the _Clarke gradient_. The Clarke | gradient of ReLU at 0 is the closed interval [0,1]. Note that | the Clarke gradient doesn 't satisfy the chain rule (except | in a weakened form) which might seriously mess up some | assumptions about autodiff. Is this generalised derivative | useful in autodiff? | | I imagine that this is a largely theoretical tool that's | useful in analysing algorithms but useless for actually | computing things. | medo-bear wrote: | i havent heard of clarke gradient before. convex analysis | has something called subgradient [0], is it different? | | [0] https://en.m.wikipedia.org/wiki/Subderivative | ogogmad wrote: | The subgradient in convex analysis is a special case of | the Clarke gradient. The subgradient is precisely the | Clarke gradient for convex functions. Convex functions | are always locally Lipschitz except in weird cases. | | [edit] | | Question: Are there numerical applications in which the | subgradient is actually computed, or is it a purely | analytical tool? | agnosticmantis wrote: | (Stochastic) subgradient methods are used in practice to | optimize non-differentiable convex functions. They have a | slower convergence rate than (stochastic) gradient | descent though. | | See for example : https://www.stat.cmu.edu/~ryantibs/conv | exopt-F15/lectures/07... | fault1 wrote: | yes, see: https://juliadiff.org/ChainRulesCore.jl/dev/mat | hs/nondiff_po... | | and related neat usages in set based optimization methods | in MathOptInterface (part of JuMP.jl): | https://matbesancon.xyz/post/2020-12-24-chains_sets2/ | sockfish wrote: | I often wonder why so much effort is being put into shoehorning | everything into a single language. Wouldn't it make much more | sense to use a fully differentiable DSL for machine learning / | xla, then call it from whatever host language you use? This | approach has worked really well for SQL for the past couple of | decades. | woadwarrior01 wrote: | You might like dex[1]. | | [1]: https://github.com/google-research/dex-lang | niklasd wrote: | Has it worked really well? I feel ORMs are a sign it hasn't. | Though I really enjoy having learned SQL and being able to | interact with almost all relational databases. | handzhiev wrote: | Imo ORM is mostly a sign that (for some odd reason) many | developers don't want to learn / use SQL. | | But what actual problem is ORM solving beyond that? | viraptor wrote: | They solve 99% of your queries in much less time while | allowing to drop down to SQL when you really want/need it. | dnautics wrote: | ORMs solve "don't accidentally introduce an SQL injection" | mbStavola wrote: | Neither do prepared statements | baq wrote: | Boilerplate. Writing serializers and deserializers by hand | is not an efficient use of developer time. | | Related to ORMs, but not quite on topic - query building. | Type checked queries, parts of which can be passed around | business logic, are very powerful and flexible. | ithkuil wrote: | There are more and more libraries that let you to write | SQL and bind the results into native records (objects, | structs) in the host language. I find it an interesting | middle ground | vkkhare wrote: | What do people think of automatic differentiation support | facebook was trying for Kotlin. | | They called it differentiable programming | https://ai.facebook.com/blog/paving-the-way-for-software-20-... | secondcoming wrote: | This [0] video has a quite interesting proposal to add AD to | compilers. | | [0] https://youtu.be/1QQj1mAV-eY ___________________________________________________________________ (page generated 2021-12-25 23:00 UTC)