[HN Gopher] Trade-Offs in Automatic Differentiation: TensorFlow,...
       ___________________________________________________________________
        
       Trade-Offs in Automatic Differentiation: TensorFlow, PyTorch, Jax,
       and Julia
        
       Author : ChrisRackauckas
       Score  : 175 points
       Date   : 2021-12-25 11:50 UTC (11 hours ago)
        
 (HTM) web link (www.stochasticlifestyle.com)
 (TXT) w3m dump (www.stochasticlifestyle.com)
        
       | carterschonwald wrote:
       | Part of the challenge is that most formulations of (reverse mode
       | )auto diff wind up requiring extra runtime data structures for
       | the backwards computation step.
       | 
       | There's been some great work in this space in the past 5 years.
       | 
       | I've got some stuff I worked out this fall I'm overdue to write
       | up and share some prototypes : there is a way to do reverse mode
       | auto diff isolated to just being an invisible compiler pass!
       | Without any of the extra complexity in what are otherwise
       | equivalent formulations
        
         | The_rationalist wrote:
         | https://github.com/breandan/kotlingrad#coroutines
        
       | snicker7 wrote:
       | Even though this posts thesis is "trade offs", it doesn't really
       | talk about any technical advantages that the Python's AD
       | ecosystem (Tensorflow, PyTorch, JAX) has over Julia's (Zygote.jl,
       | Diffractor.jl).
        
         | adgjlsfhk1 wrote:
         | It touches on the main one which is simplicity. it's much
         | easier to write an AD system for a more static language.
        
         | civilized wrote:
         | Maybe it has no technical advantages, unless you count being
         | very popular and in an accessible language as a technical
         | advantage (which it definitely could be depending on your
         | definition of "technical").
         | 
         | Julia is designed for advanced numerical computing and Python
         | isn't. The metaprogramming affordances needed for AD are much
         | better developed in Julia than they ever will be in Python. And
         | let's not forget the immense utility of multiple dispatch in
         | Julia, another feature Python will probably never have. So it's
         | not surprising that Julia is simply way more capable.
        
           | jaggirs wrote:
           | One disatvantage of the language itself is the need for
           | compilation, which isn't that fast in my limited experience.
           | But I would love to hear how much this affects iteration
           | speed.
        
             | calaphos wrote:
             | The same issue exists with Jax. XLA compilation can take up
             | quite a bit of time, especially on larger NN models. And
             | theres no persistent compile cache, so even if you don't
             | change the jitted function you need to wait for compilation
             | again as you restart the process.
        
               | alevskaya wrote:
               | Jax does actually already support a persistent
               | compilation cache for TPU, and support for caching GPU
               | compiles is being worked on currently.
        
             | civilized wrote:
             | Yeah, I'd imagine that for things both Python and Julia can
             | do AD-wise, Python may be preferable since it's interpreted
             | and thus instant-feedback, but all the numerical heavy
             | lifting in packages like Jax and PyTorch is done in fast
             | C++. So you should be getting a more appealing environment
             | for experimentation without losing out on speed.
        
           | mountainriver wrote:
           | The Julia crowd touting multiple dispatch all the time is so
           | strange, it's actually one of the main reasons the language
           | hasn't had much uptake from what I can tell.
           | 
           | Python is just more approachable and natural to people. Julia
           | should learn from that
        
             | civilized wrote:
             | Python is just more OO style, so people who have been
             | taught OOP in school are comfortable with it. That will
             | include the vast majority of generic SWEs writing generic
             | CRUD apps.
             | 
             | But personally I find OOP ugly and unnatural, and Julia's
             | model elegant and natural. And far more powerful - Julia
             | programmers are using multiple dispatch to build out
             | scientific computing to a sophistication not seen in any
             | other language.
             | 
             | It might not be your cup of tea if you need to see
             | object.method() in your code, but if you're more mentally
             | flexible and want to build the next generation of technical
             | computing tools, Julia is the place to be right now.
        
               | mountainriver wrote:
               | Yeah I'm definitely mentally flexible and have coded in
               | many paradigms, I don't love OO and generally don't write
               | that way but multiple dispatch as a primary design
               | pattern is odd.
               | 
               | I've tried it for close to a year and the ergonomics
               | still felt off, it reminds me of how the scala crowd
               | talked about functional programming, and we've seen how
               | that turned out.
               | 
               | I hear this from a lot of people that try Julia and yet
               | the Julia crowds answer is always that they are dumb.
               | Sounds a lot like the scala crowd...
        
               | civilized wrote:
               | I think 90% of the ergonomics issue is that people want
               | dot notation and tab-autocomplete in their IDE so they
               | can type obj.<tab> and get the methods that operate on
               | obj. Which I agree, some version of that should exist,
               | and there's no real reason it can't exist in Julia. The
               | tooling is just not as mature as other languages.
               | 
               | Julia is far ahead in affordances to write fancy
               | technical code and fairly behind in simple things, like
               | standard affordances to write more ordinary code, or the
               | ability to quickly load in data and make a plot.
               | 
               | I just think it's a misdiagnosis to blame multiple
               | dispatch for this issue. It's much more about the Julia
               | community prioritizing the needs of their target market.
        
       | cl3misch wrote:
       | > fun fact, the Jax folks at Google Brain did have a Python
       | source code transform AD at one point but it was scrapped
       | essentially because of these difficulties
       | 
       | I assume you mean autograd?
       | 
       | https://github.com/HIPS/autograd
        
         | ChrisRackauckas wrote:
         | No, autograd acts similarly to PyTorch in that it builds a tape
         | that it reverses while PyTorch just comes with more optimized
         | kernels (and kernels that act on GPUs). The AD that I was
         | referencing was tangent (https://github.com/google/tangent). It
         | was an interesting project but it's hard to see who the
         | audience is. Generating Python source code makes things harder
         | to analyze, and you cannot JIT compile the generated code
         | unless you could JIT compile Python. So you might as well first
         | trace to a JIT-compliable sublanguage and do the actions there,
         | which is precisely what Jax does. In theory tangent is a bit
         | more general, and maybe you could mix it with Numba, but then
         | it's hard to justify. If it's more general then it's not for
         | the standard ML community for the same reason as the Julia
         | tools, but then it better do better than the Julia tools in the
         | specific niche that they are targeting. That generality means
         | that it cannot use XLA, and thus from day 1 it wouldn't get the
         | extra compiler optimizations that some which uses XLA does
         | (Jax). Jax just makes much more sense for the people who were
         | building it, it chose its niche very well.
        
           | brilee wrote:
           | FYI - Tangent evolved into TF2's AutoGraph.
        
       | fault1 wrote:
       | it's quite interesting how at least in ML, the transformer
       | architecture has 'won out', at least for the time being, it
       | appears to be everywhere these days:
       | https://threadreaderapp.com/thread/1468370605229547522.html
       | 
       | the advantage of transformers (computationally) seems to be how
       | little sophistication the attention mechanism needs from AD
       | systems (and how well it appears to scale with data). it's also a
       | very static architecture in terms of a data flow/control flow
       | perspective.
       | 
       | as far as I understand, this is far different from systems
       | needing to be modeled in continuous time, especially things like
       | SDEs. I am curious if things like delay embeddings will ever be
       | modeled in terms of mechanisms similar to attention however.
        
         | liuliu wrote:
         | Like other comments, CNNs and LSTMs are still in wide use
         | today. If you dig deep enough, position encoding doesn't really
         | capture time-based series information that well.
        
           | blovescoffee wrote:
           | Could you elaborate? I've built some Causal CNN's but never
           | used a transformer for time series data. What are the
           | challenges?
        
         | jowday wrote:
         | Outside of research transformers are rarely used for computer
         | vision problems and CNNs remain the go to architecture. And you
         | actually need to do some hacks to get transformers to work with
         | computer vision at a meaningful scale (splitting images into
         | patches and convoluting the patches to produce features to feed
         | into the transformer).
        
           | fault1 wrote:
           | > some hacks to get transformers to work with computer vision
           | at a meaningful scale (splitting images into patches and
           | convoluting the patches to produce features to feed into the
           | transformer).
           | 
           | sounds a lot like 'classical computer vision'. e.g, when I
           | learned the subject (mid 2000s), topological features were
           | all the rage: https://en.wikipedia.org/wiki/Digital_topology
        
           | xiphias2 wrote:
           | It will be interesting when Tesla and Waymo moves to
           | transformer architecture, but as you wrote my guess is that
           | it's not yet in production for vision tasks.
        
             | jowday wrote:
             | I'm not sure they will, at least not with the research in
             | the state it is presently. Researchers are interested in
             | vision transformers because they're competitive with CNNs
             | if you give them enough training data - they don't
             | drastically outperform them.
             | 
             | Right now switching over to them would require a ton of
             | code changes, relearning intuitions, debugging, profiling,
             | etc. for not a ton of benefit.
        
               | xiphias2 wrote:
               | Sure, I think the same, but the tweets came from Andrew
               | Karpathy, he's watching this space like an eagle.
        
             | liuliu wrote:
             | Tesla did, as mentioned in their AI Day. It is not full
             | transformer (aka ViT). The use transformer decoder to
             | synthesize data from different cameras and decode 3d
             | coordinates directly (aka DETR).
        
               | xiphias2 wrote:
               | Thanks, sounds great, I'll read the DETR paper
        
           | joconde wrote:
           | I've looked into transformers for semantic segmentation, but
           | the patching aspect seems to make it hard too. Do you have
           | some sources that describe these hacks in detail?
        
             | lowdose wrote:
             | You could do a code search on GitHub. I'm pretty lazy in
             | the aspect of coding. I always seem to find a repo that has
             | implemented an MVP with what I already had in mind. There
             | are some gold nuggets on GitHub like Googles DDSP
             | implementation they have academically published anonymous.
        
       | The_rationalist wrote:
       | Kotlingrad make any autodiff library looks pale in comparison..
        
       | mark_l_watson wrote:
       | Interesting read. I was very disappointed when the Swift
       | TensorFlow project withered away. A good general purpose
       | programming language combined with deep learning seemed like a
       | great idea. Apple actually provides a good dev experience with
       | Swift and CoreML (for fun I wrote a Swift/SwiftUI/CoreML app that
       | uses two deep learning models that is in the App Store).
       | 
       | Wolfram Language takes an approach similar to Apple's in
       | providing a good number of pre trained models, but I haven't yet
       | discovered any automatic differentiation examples.
       | 
       | Of the frameworks described in the article, I find Julia most
       | interesting but I need to use Python and TensorFlow in my work.
        
         | ChrisRackauckas wrote:
         | I remember reading an early S4TF manifesto talking about
         | natural language processing, image processing, etc. thinking,
         | that cannot be your audience because that audience already has
         | AD systems which support their domain. Building something that
         | is more general for the sake of being more general is never a
         | good idea, that's bad engineering. It had a lot of great ideas,
         | and indeed the dev experience seemed nice. But I would venture
         | to guess that the Google overlords had to question what the
         | true value of S4TF in that light. "Standard ML" cannot be your
         | target audience if you want to work on AD extensions.
         | 
         | Following this thread, you can also see what how the Julia
         | tools evolved. If you see the paper that was the synthesis for
         | Zygote.jl, it was all about while loops and scalar operations
         | (https://arxiv.org/abs/1810.07951). Why did that not completely
         | change ML? Well, ML doesn't use those kinds of operations. I
         | would say the project kind of started as a "tool looking for a
         | problem". It did get a bit lucky that it found a problem:
         | scientific applications need to be able to use automatic
         | differentiation without rewriting the whole codebase to an ML
         | library, leading to the big Julia AD manifesto of language-wide
         | differentiable programming by directly acting on the Julia
         | source itself rather than a language subset (http://ceur-
         | ws.org/Vol-2587/article_8.pdf). Zygote was a good AD, but not a
         | great AD, why? Because it could not hit this goal, mostly
         | because of its lack of mutation handling. Yes, it does handle
         | standard ML just fine, but does not justify its added
         | complexity.
         | 
         | What has actually kept Julia AD research going is that some
         | scientific machine learning applications, specifically physics-
         | informed neural networks (PINNs), require very high order
         | derivatives. For example, to solve the PDE u_t = u_xx with
         | neural networks, you need to take the third derivative of the
         | neural network. With Jax this can only be done with a separate
         | language subset (https://openreview.net/pdf?id=SkxEF3FNPH), and
         | thus a new AD for Julia to replace Zygote, known as
         | Diffractor.jl, was devised to automatically incorporate higher
         | order AD optimizations as part of the regular usage
         | (https://www.youtube.com/watch?v=mQnSRfseu0c). It is these PINN
         | SciML applications that have funded its development and is its
         | built-in audience: it solves a problem nothing else does, even
         | if it is potentially niche. Similarly with Enzyme, it solved
         | the problem of how to do mutation well, which is where you can
         | see in the paper that its applications are mostly ODE and PDE
         | solvers (Euler, RK4, the Bruss semilinear PDE) (https://proceed
         | ings.neurips.cc/paper/2020/file/9332c513ef44b...). Torchscript
         | and Jax do not handle this domain well, so it has an audience,
         | which may (or may not?) be niche.
         | 
         | A big part of writing this blog post was to highlight this to
         | the Julia AD crew that I regularly work with. What will keep
         | these projects alive is understanding the engineering trade-
         | offs that are made and who the audience is. The complexity has
         | a cost so it better have a benefit. If that target is lost, if
         | any benefit is a theoretical "but you may need more features
         | some day", then the projects will lose traction. The project
         | needs to be two-fold: identify new architectures and
         | applications that would benefit from expanded language support
         | from AD and build good support for those projects. Otherwise it
         | is just training a transformer in Julia vs training a
         | transformer in Python, and that is not justifiable.
        
           | p1esk wrote:
           | My impression from your comment is that you don't care that
           | much about "standard" ML users. As a "standard" ML user
           | (pytorch/jax), and a potential Julia user in the future, this
           | is not what I like to hear.
        
             | borodi wrote:
             | The idea, I imagine, is to differentiate what the julia ML
             | stack offers over what is already on python, if it offers
             | the same thing, but without the funding from facebook or
             | google, why bother switching? It has to offer something
             | more.
        
         | The_rationalist wrote:
         | Note however that Kotlin is backed by Facebook
         | https://ai.facebook.com/blog/paving-the-way-for-software-20-...
         | and that there is a mature library for autodiff
         | https://github.com/breandan/kotlingrad
        
         | 2sk21 wrote:
         | I am really glad to hear this. As it happens, my main post-
         | retirement has been to learn Swift to try out CoreML. I'm
         | really enjoying learning this so far.
        
           | hnarayanan wrote:
           | This is really interesting to me. Could you please share your
           | learning pathway?
        
       | albertzeyer wrote:
       | This misses some discussion on tf.function which does a Python
       | AST level transformation to a static TF computation graph,
       | including dynamic control flow like loops and conditional
       | branches.
        
         | chillee wrote:
         | Tf.function is largely morally equivalent to Torchscript, which
         | he does discuss.
        
       | spacetracks wrote:
       | "The second factor, and probably the more damning one, is that
       | most ML codes don't actually use that much dynamism." I would
       | argue that this in true precisely because it is not available in
       | an AD system. When I tell friends and coworkers about what zygote
       | can do they light up and start describing different use cases
       | they have that could benefit from AD. Diff eq solving is a big
       | one.
        
         | KKKKkkkk1 wrote:
         | This is because continuous optimization is useless when
         | crossing a discontinuity, which is what control flow creates.
         | Even in a trivial situation like ReLU, where the control flow
         | is mimicking a continuous transition, you have the "dead ReLU"
         | problem, where you have to start training on the correct side
         | of the discontinuity and make sure to never cross.
        
           | ogogmad wrote:
           | I don't know whether this belongs here, but...
           | 
           | Formally, there is a generalisation of differentiation which
           | can handle functions like ReLU (i.e. locally Lipschitz non-
           | differentiable functions) by allowing a derivative to be set-
           | valued. It's called the _Clarke gradient_. The Clarke
           | gradient of ReLU at 0 is the closed interval [0,1]. Note that
           | the Clarke gradient doesn 't satisfy the chain rule (except
           | in a weakened form) which might seriously mess up some
           | assumptions about autodiff. Is this generalised derivative
           | useful in autodiff?
           | 
           | I imagine that this is a largely theoretical tool that's
           | useful in analysing algorithms but useless for actually
           | computing things.
        
             | medo-bear wrote:
             | i havent heard of clarke gradient before. convex analysis
             | has something called subgradient [0], is it different?
             | 
             | [0] https://en.m.wikipedia.org/wiki/Subderivative
        
               | ogogmad wrote:
               | The subgradient in convex analysis is a special case of
               | the Clarke gradient. The subgradient is precisely the
               | Clarke gradient for convex functions. Convex functions
               | are always locally Lipschitz except in weird cases.
               | 
               | [edit]
               | 
               | Question: Are there numerical applications in which the
               | subgradient is actually computed, or is it a purely
               | analytical tool?
        
               | agnosticmantis wrote:
               | (Stochastic) subgradient methods are used in practice to
               | optimize non-differentiable convex functions. They have a
               | slower convergence rate than (stochastic) gradient
               | descent though.
               | 
               | See for example : https://www.stat.cmu.edu/~ryantibs/conv
               | exopt-F15/lectures/07...
        
               | fault1 wrote:
               | yes, see: https://juliadiff.org/ChainRulesCore.jl/dev/mat
               | hs/nondiff_po...
               | 
               | and related neat usages in set based optimization methods
               | in MathOptInterface (part of JuMP.jl):
               | https://matbesancon.xyz/post/2020-12-24-chains_sets2/
        
       | sockfish wrote:
       | I often wonder why so much effort is being put into shoehorning
       | everything into a single language. Wouldn't it make much more
       | sense to use a fully differentiable DSL for machine learning /
       | xla, then call it from whatever host language you use? This
       | approach has worked really well for SQL for the past couple of
       | decades.
        
         | woadwarrior01 wrote:
         | You might like dex[1].
         | 
         | [1]: https://github.com/google-research/dex-lang
        
         | niklasd wrote:
         | Has it worked really well? I feel ORMs are a sign it hasn't.
         | Though I really enjoy having learned SQL and being able to
         | interact with almost all relational databases.
        
           | handzhiev wrote:
           | Imo ORM is mostly a sign that (for some odd reason) many
           | developers don't want to learn / use SQL.
           | 
           | But what actual problem is ORM solving beyond that?
        
             | viraptor wrote:
             | They solve 99% of your queries in much less time while
             | allowing to drop down to SQL when you really want/need it.
        
             | dnautics wrote:
             | ORMs solve "don't accidentally introduce an SQL injection"
        
               | mbStavola wrote:
               | Neither do prepared statements
        
             | baq wrote:
             | Boilerplate. Writing serializers and deserializers by hand
             | is not an efficient use of developer time.
             | 
             | Related to ORMs, but not quite on topic - query building.
             | Type checked queries, parts of which can be passed around
             | business logic, are very powerful and flexible.
        
               | ithkuil wrote:
               | There are more and more libraries that let you to write
               | SQL and bind the results into native records (objects,
               | structs) in the host language. I find it an interesting
               | middle ground
        
       | vkkhare wrote:
       | What do people think of automatic differentiation support
       | facebook was trying for Kotlin.
       | 
       | They called it differentiable programming
       | https://ai.facebook.com/blog/paving-the-way-for-software-20-...
        
       | secondcoming wrote:
       | This [0] video has a quite interesting proposal to add AD to
       | compilers.
       | 
       | [0] https://youtu.be/1QQj1mAV-eY
        
       ___________________________________________________________________
       (page generated 2021-12-25 23:00 UTC)