[HN Gopher] Deep neural networks as computational graphs (2018)
       ___________________________________________________________________
        
       Deep neural networks as computational graphs (2018)
        
       Author : softwaredoug
       Score  : 79 points
       Date   : 2023-05-21 13:09 UTC (1 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | dynamite-ready wrote:
       | Wasn't that the paradigm behind Tensorflow?
        
         | dkislyuk wrote:
         | As another commenter said, viewing a neural network as a
         | computation graph is how all automatic differentiation engines
         | work (particularly reverse-mode where one needs to traverse
         | through all the previous computations to correctly apply the
         | gradient), and there were several libraries predating
         | Tensorflow following this idea. The initial contribution of
         | Tensorflow and PyTorch was more about making the developer
         | interface much cleaner and enabling training on a wider range
         | of hardware by developing a bunch of useful kernels as part of
         | the library.
        
           | dekhn wrote:
           | I always thought of seeing most computations of functions as
           | computational graphs- at least, when I used MathLink to
           | connect Mathematica to Python, it basically gives you a
           | protocol to break any Mathematica function into its
           | recursively expanded definition. Konrad Hinsen suggested
           | using python's built-in operator overloading, so if you said
           | "1 + Symbol('x')" it would get converted to "Plus(1,
           | Symbol('x')) and then sent over MathLink to Mathematica,
           | which would evaluate Plus(1, x), and return an expression
           | which I'd then convert back to a Python object
           | representation.
           | 
           | I don't think we talked about doing any sort of automated
           | diff (in my day we figured out our own derivatives!) but
           | after I made a simple eigendecomp of a matrix of floats, the
           | mathematica folks contributed an example that did eigendecomp
           | of a matrix with symbols (IE, some of the terms weren't 5.7
           | but "1-x"). Still kind of blows my mind today how much
           | mathematica can do with computation graphs.
           | 
           | IIUC this is the basis of LISP as well.
        
         | ggr2342 wrote:
         | AFAIK it was first brought forward by PyTorch. I may be wrong
         | though.
        
           | mhh__ wrote:
           | PyTorch's insight was to be very dynamic and simple. The
           | underlying model is basically the same with the caveat that
           | tensorflow uses/used something called a "tape".
        
           | HarHarVeryFunny wrote:
           | Before PyTorch there was Torch (which was Lua-based, hence
           | the Py- prefix of the follow on PyTorch). With Torch, like
           | the later TensorFlow 1.0, you first explicitly constructed
           | the computational graph then executed it. PyTorch's later
           | innovation was define-by-run where you just ran the code
           | corresponding to your computational graph and the framework
           | traced your code and built the graph behind the scenes...
           | this was so much more convenient that PyTorch quickly became
           | more popular than TensorFlow, and by the time TF 2.0 "eager
           | mode" copied this approach it was too late (although TF shot
           | itself in the foot in many ways, which also accelerated it's
           | demise).
        
           | cgearhart wrote:
           | It certainly pre-dates PyTorch. Applying computational graphs
           | to neural networks was the central purpose of the Theano
           | framework in the early 2010s. TensorFlow heavily followed
           | those ideas, and Theano shut down a year or two later because
           | they were so close conceptually and a grad research lab can't
           | compete with the resources of Google.
        
           | 6gvONxR4sf7o wrote:
           | It's been a core part of automatic differentiation for
           | decades.
        
             | ggr2342 wrote:
             | Oh! Can you give some pointers where to read about it more?
        
               | 6gvONxR4sf7o wrote:
               | The wikipedia page for automatic differentiation and its
               | references would be a good start.
               | https://en.wikipedia.org/wiki/Automatic_differentiation
               | 
               | In particular, the section "Beyond forward and reverse
               | accumulation" tells you how hard it is to deal with the
               | graph in optimal ways, hence the popularity of simpler
               | forwards and backwards traversals of the graph.
        
               | PartiallyTyped wrote:
               | The original 2017 PyTorch paper is probably a good one.
        
       | gavi wrote:
       | I highly recommend the video from karpathy
       | https://www.youtube.com/watch?v=VMj-3S1tku0 - This explains the
       | same idea spelled out in code. It really helped me understand the
       | mathematical expression graph of neural networks. He uses
       | graphviz to show the expressions and initially calculate
       | gradients manually and then implement back propagation.
        
       | ttul wrote:
       | Indeed, the self-attention mechanism within the transformer
       | architecture is often described as a graph, where tokens are the
       | nodes and the key and query vectors are combined to establish the
       | affinity that each token has for each other token.
        
         | blackbear_ wrote:
         | These are two different things, actually. The graph you are
         | talking about is constructed using the inputs of the attention
         | layer (ie, nodes=tokens, edges=affinity), while the article
         | talks about graphs that represent the actual operations
         | performed in the attention layer (ie, nodes=weights and
         | intermediate results, edges=mathematical operations that
         | combine them).
        
       | garganzol wrote:
       | The article is a thought-provoking piece of art. A few personal
       | observations from me:                   a) An acyclic
       | computational graph is equivalent to a program without loops and
       | thus not Turing-complete              b) A cyclic computational
       | graph is equivalent to a program with recursion. The closest
       | analogy from the programming world is a program without loops but
       | with recursive calls. This means that it has Turing completeness,
       | as loops are fully interchangeable with recursion (just different
       | facets of the same thing).
       | 
       | As easy as 2+2=4. This means that a neural network is essentially
       | a program. The only difference is the way to write the program:
       | neural networks do it themselves by "learning".
       | 
       | Those simple observations explain everything. Brain is
       | essentially a biological Turing-complete processing unit with
       | full plasticity, as you can build any network (program) by
       | changing "weights" of the neurons. This also means that full AI
       | is possible, it is just a matter of the available computational
       | power. Why is that? Because Turing completeness guarantees the
       | absence of boundaries for achieving progressively sophisticated
       | system abilities.
        
         | gcr wrote:
         | Most neural networks these days aren't recursive however.
         | 
         | GPT-3 and friends certainly isn't, unless you count the
         | inference loop as recursion, which is bounded both by
         | vocabulary size and maximum token length (hardcoded for now).
        
           | garganzol wrote:
           | This only means that there is so much more space for AI to
           | grow and improve in the future.
           | 
           | As far as I know, having a recursion in neural networks poses
           | an extravagant dilemma: from one hand, recursive networks are
           | more capable; from another, they are harder to train and may
           | have issues with stability caused by occasional positive
           | feedback loops.
        
       | data_maan wrote:
       | This says nothing new. This idea was around since the 80s, if not
       | earlier, and used extensively for automatic differentiation
       | textbook.
       | 
       | But, for an audience that seems to think it rides on the frontier
       | of knowledge and that we have just discoveres things (when, in
       | fact, it's our ignorance of earlier research that is the reason
       | that we are re-discovering them), such a medium post might be
       | like a drug :D
        
         | garganzol wrote:
         | Sometimes simple (and old) ideas start to click and ignite a
         | tornado when the timing is right.
         | 
         | While many may argue that the notion of a computational graph
         | was associated with neural networks (NN) from the very
         | beginning, this post is quite novel. It sheds the light on how
         | NN and the general computation theory are actually
         | interconnected, it is basically the same thing.
         | 
         | For me, it was an instant blast: the computational graph
         | reminds me a typical intermediate representation (IR) tree of a
         | compiler. And because it is a formal graph, all mathematical
         | benefits of a graph theory suddenly start to click. For
         | instance, the graph theory formalizes the definitions of cyclic
         | directed graphs versus acyclic directed graphs. If we imagine
         | for a moment that a graph vertex represents a unit of
         | computation, it immediately starts to resonate with Lambda
         | calculus. It quickly becomes evident that when the graph is
         | cyclic, it represents a recursion in terms of Lambda calculus
         | and thus becomes Turing-complete.
         | 
         | If computational graph is acyclic, Turing completeness is not
         | achievable.
         | 
         | If you are going to say that this is an obvious and well-known
         | observation - I will be surprised, because it is not. And all
         | that enlightenment was possible thanks to this article combined
         | with a bit of knowledge about graphs, compilers, and lambda
         | calculus. (It just so happened that I'm relatively well-versed
         | in those topics due to a professional involvement.)
        
       | blackbear_ wrote:
       | I cannot stress enough how foundational of an idea it is to store
       | mathematical expressions as graphs! Analyzing and manipulating
       | such graphs enables, among other things, to automatically obtain
       | derivatives of a function with respect to its parameters, which
       | is one of the cornerstones of most AI methods.
       | 
       | Before this was invented, people had to manually find expressions
       | for these derivatives, and implement the code to compute them.
       | Needless to say, this took way too much time and was likely to
       | result in sub-optimal and unoptimized code, while nowadays nobody
       | has to deal with this anymore (except on their homeworks)!
       | 
       | For the curious of how this works, I present a simple
       | implementation on my blog [1].
       | 
       | [1]
       | https://e-dorigatti.github.io/math/deep%20learning/2020/04/0...
        
         | BaculumMeumEst wrote:
         | Excellent post, thank you for writing it.
        
         | joe_the_user wrote:
         | _I cannot stress enough how foundational of an idea it is to
         | store mathematical expressions as graphs!.._
         | 
         | Understanding a function's computational graph is certainly
         | useful but storing a function as a computation graph is, in
         | fact, quite expensive. Deep learning systems don't store their
         | computational graphs, the graphs are _implicit_ in their
         | computation process. Deep learning systems instead store their
         | functions as tensors; generalized arrays /matrices. This allows
         | both efficient storage and efficient computation. It's quite
         | possible do automatic differentiation on these structures as
         | well - that is the basis of "backpropagation".
         | 
         | It's important to distinguish useful conceptual structures like
         | computation graphs (or symbolic representations) and the
         | structures that are necessary for efficient computation.
         | Automatic differentiation itself is important because the
         | traditional symbolic differentiation one learns in calculus
         | "blows up", can have O(m^expression-length) cost, when
         | attempted on a large expression where automatic differentiation
         | has a cost that is not that much higher than the base cost of
         | computing a given function (but if that base cost is high, you
         | lose your advantage also).
         | 
         | Just sayin'
        
           | [deleted]
        
           | necroforest wrote:
           | you can't store a function as a tensor. the tensors are the
           | inputs/outputs that flow along the edges of the graph. TF
           | stores things directly as a graph: https://github.com/tensorf
           | low/tensorflow/blob/master/tensorf... while Jax represents
           | computations in an intermediate representation:
           | https://jax.readthedocs.io/en/latest/jaxpr.html
        
         | curiousgal wrote:
         | Another major application is in finance! Computing risk metrics
         | for hedging purposes used to to be based on a "bump and
         | reprice" method, i.e. you bump the input by a small amount and
         | see how that affects the price (the definition of a
         | derivative). But now in modern quant libraries, implementing
         | the pricing as a graph allows you to get all of the
         | sensitivities (derivatives) for free!
        
           | mhh__ wrote:
           | Not quite. Some people want bump Greeks. For example a
           | company may have a standard bump.
           | 
           | Make of that what you will.
        
           | KMag wrote:
           | Not quite free. Last I checked, adding automatic
           | differentiation to GPU computation for pricing exotic
           | derivatives roughly doubled the computation cost. (Though,
           | traditional bump-and-reprice has exponential cost in the
           | number of dimensions.)
        
           | blackbear_ wrote:
           | That's cool! But it relies on the model being an accurate
           | representation of reality, right? What kind of models are we
           | talking about anyways, and what inputs do they use? (Asking
           | out of curiosity, not skepticism)
        
             | mhh__ wrote:
             | A model is a computation based on its initial assumptions.
             | Sometimes that's fine.
             | 
             | The key thing is that the (good) practitioners know the
             | finance and know the models. If it's obviously wrong that's
             | a sign in itself - a simple model doesn't fit the market:
             | You might be about to lose some money, or you could take on
             | some risk from other people and get rewarded for it (people
             | are panicking).
             | 
             | Weirder derivatives and so on can get more dangerous, of
             | course. However even the really famous example from the 07
             | crash (pricing CDOs and CDS against them) was arguably due
             | to a deliberate ignorance of widespread fraud and
             | fictitious numbers than the models as per se. Garbage in
             | garbage out (the model was also not great but still)
        
       ___________________________________________________________________
       (page generated 2023-05-22 23:00 UTC)