[HN Gopher] Deep neural networks as computational graphs (2018) ___________________________________________________________________ Deep neural networks as computational graphs (2018) Author : softwaredoug Score : 79 points Date : 2023-05-21 13:09 UTC (1 days ago) (HTM) web link (medium.com) (TXT) w3m dump (medium.com) | dynamite-ready wrote: | Wasn't that the paradigm behind Tensorflow? | dkislyuk wrote: | As another commenter said, viewing a neural network as a | computation graph is how all automatic differentiation engines | work (particularly reverse-mode where one needs to traverse | through all the previous computations to correctly apply the | gradient), and there were several libraries predating | Tensorflow following this idea. The initial contribution of | Tensorflow and PyTorch was more about making the developer | interface much cleaner and enabling training on a wider range | of hardware by developing a bunch of useful kernels as part of | the library. | dekhn wrote: | I always thought of seeing most computations of functions as | computational graphs- at least, when I used MathLink to | connect Mathematica to Python, it basically gives you a | protocol to break any Mathematica function into its | recursively expanded definition. Konrad Hinsen suggested | using python's built-in operator overloading, so if you said | "1 + Symbol('x')" it would get converted to "Plus(1, | Symbol('x')) and then sent over MathLink to Mathematica, | which would evaluate Plus(1, x), and return an expression | which I'd then convert back to a Python object | representation. | | I don't think we talked about doing any sort of automated | diff (in my day we figured out our own derivatives!) but | after I made a simple eigendecomp of a matrix of floats, the | mathematica folks contributed an example that did eigendecomp | of a matrix with symbols (IE, some of the terms weren't 5.7 | but "1-x"). Still kind of blows my mind today how much | mathematica can do with computation graphs. | | IIUC this is the basis of LISP as well. | ggr2342 wrote: | AFAIK it was first brought forward by PyTorch. I may be wrong | though. | mhh__ wrote: | PyTorch's insight was to be very dynamic and simple. The | underlying model is basically the same with the caveat that | tensorflow uses/used something called a "tape". | HarHarVeryFunny wrote: | Before PyTorch there was Torch (which was Lua-based, hence | the Py- prefix of the follow on PyTorch). With Torch, like | the later TensorFlow 1.0, you first explicitly constructed | the computational graph then executed it. PyTorch's later | innovation was define-by-run where you just ran the code | corresponding to your computational graph and the framework | traced your code and built the graph behind the scenes... | this was so much more convenient that PyTorch quickly became | more popular than TensorFlow, and by the time TF 2.0 "eager | mode" copied this approach it was too late (although TF shot | itself in the foot in many ways, which also accelerated it's | demise). | cgearhart wrote: | It certainly pre-dates PyTorch. Applying computational graphs | to neural networks was the central purpose of the Theano | framework in the early 2010s. TensorFlow heavily followed | those ideas, and Theano shut down a year or two later because | they were so close conceptually and a grad research lab can't | compete with the resources of Google. | 6gvONxR4sf7o wrote: | It's been a core part of automatic differentiation for | decades. | ggr2342 wrote: | Oh! Can you give some pointers where to read about it more? | 6gvONxR4sf7o wrote: | The wikipedia page for automatic differentiation and its | references would be a good start. | https://en.wikipedia.org/wiki/Automatic_differentiation | | In particular, the section "Beyond forward and reverse | accumulation" tells you how hard it is to deal with the | graph in optimal ways, hence the popularity of simpler | forwards and backwards traversals of the graph. | PartiallyTyped wrote: | The original 2017 PyTorch paper is probably a good one. | gavi wrote: | I highly recommend the video from karpathy | https://www.youtube.com/watch?v=VMj-3S1tku0 - This explains the | same idea spelled out in code. It really helped me understand the | mathematical expression graph of neural networks. He uses | graphviz to show the expressions and initially calculate | gradients manually and then implement back propagation. | ttul wrote: | Indeed, the self-attention mechanism within the transformer | architecture is often described as a graph, where tokens are the | nodes and the key and query vectors are combined to establish the | affinity that each token has for each other token. | blackbear_ wrote: | These are two different things, actually. The graph you are | talking about is constructed using the inputs of the attention | layer (ie, nodes=tokens, edges=affinity), while the article | talks about graphs that represent the actual operations | performed in the attention layer (ie, nodes=weights and | intermediate results, edges=mathematical operations that | combine them). | garganzol wrote: | The article is a thought-provoking piece of art. A few personal | observations from me: a) An acyclic | computational graph is equivalent to a program without loops and | thus not Turing-complete b) A cyclic computational | graph is equivalent to a program with recursion. The closest | analogy from the programming world is a program without loops but | with recursive calls. This means that it has Turing completeness, | as loops are fully interchangeable with recursion (just different | facets of the same thing). | | As easy as 2+2=4. This means that a neural network is essentially | a program. The only difference is the way to write the program: | neural networks do it themselves by "learning". | | Those simple observations explain everything. Brain is | essentially a biological Turing-complete processing unit with | full plasticity, as you can build any network (program) by | changing "weights" of the neurons. This also means that full AI | is possible, it is just a matter of the available computational | power. Why is that? Because Turing completeness guarantees the | absence of boundaries for achieving progressively sophisticated | system abilities. | gcr wrote: | Most neural networks these days aren't recursive however. | | GPT-3 and friends certainly isn't, unless you count the | inference loop as recursion, which is bounded both by | vocabulary size and maximum token length (hardcoded for now). | garganzol wrote: | This only means that there is so much more space for AI to | grow and improve in the future. | | As far as I know, having a recursion in neural networks poses | an extravagant dilemma: from one hand, recursive networks are | more capable; from another, they are harder to train and may | have issues with stability caused by occasional positive | feedback loops. | data_maan wrote: | This says nothing new. This idea was around since the 80s, if not | earlier, and used extensively for automatic differentiation | textbook. | | But, for an audience that seems to think it rides on the frontier | of knowledge and that we have just discoveres things (when, in | fact, it's our ignorance of earlier research that is the reason | that we are re-discovering them), such a medium post might be | like a drug :D | garganzol wrote: | Sometimes simple (and old) ideas start to click and ignite a | tornado when the timing is right. | | While many may argue that the notion of a computational graph | was associated with neural networks (NN) from the very | beginning, this post is quite novel. It sheds the light on how | NN and the general computation theory are actually | interconnected, it is basically the same thing. | | For me, it was an instant blast: the computational graph | reminds me a typical intermediate representation (IR) tree of a | compiler. And because it is a formal graph, all mathematical | benefits of a graph theory suddenly start to click. For | instance, the graph theory formalizes the definitions of cyclic | directed graphs versus acyclic directed graphs. If we imagine | for a moment that a graph vertex represents a unit of | computation, it immediately starts to resonate with Lambda | calculus. It quickly becomes evident that when the graph is | cyclic, it represents a recursion in terms of Lambda calculus | and thus becomes Turing-complete. | | If computational graph is acyclic, Turing completeness is not | achievable. | | If you are going to say that this is an obvious and well-known | observation - I will be surprised, because it is not. And all | that enlightenment was possible thanks to this article combined | with a bit of knowledge about graphs, compilers, and lambda | calculus. (It just so happened that I'm relatively well-versed | in those topics due to a professional involvement.) | blackbear_ wrote: | I cannot stress enough how foundational of an idea it is to store | mathematical expressions as graphs! Analyzing and manipulating | such graphs enables, among other things, to automatically obtain | derivatives of a function with respect to its parameters, which | is one of the cornerstones of most AI methods. | | Before this was invented, people had to manually find expressions | for these derivatives, and implement the code to compute them. | Needless to say, this took way too much time and was likely to | result in sub-optimal and unoptimized code, while nowadays nobody | has to deal with this anymore (except on their homeworks)! | | For the curious of how this works, I present a simple | implementation on my blog [1]. | | [1] | https://e-dorigatti.github.io/math/deep%20learning/2020/04/0... | BaculumMeumEst wrote: | Excellent post, thank you for writing it. | joe_the_user wrote: | _I cannot stress enough how foundational of an idea it is to | store mathematical expressions as graphs!.._ | | Understanding a function's computational graph is certainly | useful but storing a function as a computation graph is, in | fact, quite expensive. Deep learning systems don't store their | computational graphs, the graphs are _implicit_ in their | computation process. Deep learning systems instead store their | functions as tensors; generalized arrays /matrices. This allows | both efficient storage and efficient computation. It's quite | possible do automatic differentiation on these structures as | well - that is the basis of "backpropagation". | | It's important to distinguish useful conceptual structures like | computation graphs (or symbolic representations) and the | structures that are necessary for efficient computation. | Automatic differentiation itself is important because the | traditional symbolic differentiation one learns in calculus | "blows up", can have O(m^expression-length) cost, when | attempted on a large expression where automatic differentiation | has a cost that is not that much higher than the base cost of | computing a given function (but if that base cost is high, you | lose your advantage also). | | Just sayin' | [deleted] | necroforest wrote: | you can't store a function as a tensor. the tensors are the | inputs/outputs that flow along the edges of the graph. TF | stores things directly as a graph: https://github.com/tensorf | low/tensorflow/blob/master/tensorf... while Jax represents | computations in an intermediate representation: | https://jax.readthedocs.io/en/latest/jaxpr.html | curiousgal wrote: | Another major application is in finance! Computing risk metrics | for hedging purposes used to to be based on a "bump and | reprice" method, i.e. you bump the input by a small amount and | see how that affects the price (the definition of a | derivative). But now in modern quant libraries, implementing | the pricing as a graph allows you to get all of the | sensitivities (derivatives) for free! | mhh__ wrote: | Not quite. Some people want bump Greeks. For example a | company may have a standard bump. | | Make of that what you will. | KMag wrote: | Not quite free. Last I checked, adding automatic | differentiation to GPU computation for pricing exotic | derivatives roughly doubled the computation cost. (Though, | traditional bump-and-reprice has exponential cost in the | number of dimensions.) | blackbear_ wrote: | That's cool! But it relies on the model being an accurate | representation of reality, right? What kind of models are we | talking about anyways, and what inputs do they use? (Asking | out of curiosity, not skepticism) | mhh__ wrote: | A model is a computation based on its initial assumptions. | Sometimes that's fine. | | The key thing is that the (good) practitioners know the | finance and know the models. If it's obviously wrong that's | a sign in itself - a simple model doesn't fit the market: | You might be about to lose some money, or you could take on | some risk from other people and get rewarded for it (people | are panicking). | | Weirder derivatives and so on can get more dangerous, of | course. However even the really famous example from the 07 | crash (pricing CDOs and CDS against them) was arguably due | to a deliberate ignorance of widespread fraud and | fictitious numbers than the models as per se. Garbage in | garbage out (the model was also not great but still) ___________________________________________________________________ (page generated 2023-05-22 23:00 UTC)