[HN Gopher] Predictive coding has been unified with backpropagation ___________________________________________________________________ Predictive coding has been unified with backpropagation Author : cabalamat Score : 241 points Date : 2021-04-05 12:02 UTC (10 hours ago) (HTM) web link (www.lesswrong.com) (TXT) w3m dump (www.lesswrong.com) | xzvf wrote: | At scale, Evolutionary Strategies (ES) are a very good | approximation of the gradient as well. Don't recommend to jump | just yet to conclusions and unifications. | jnwatson wrote: | The author's point is that predictive coding is a plausible | mechanism by which biological neurons work. ES are not. | | ANNs have deviated widely from their biological inspiration, | most notably in the way that information flows, since | backpropagation requires two way flow and biological axons are | one-directional. | | If predictive coding and backpropagation are shown to have | similar power, then there's a rough idea that the way that ANNs | work isn't too far from how brains work (with lots and lots of | caveats). | whimsicalism wrote: | > If predictive coding and backpropagation are shown to have | similar power, then there's a rough idea that the way that | ANNs work isn't too far from how brains work (with lots and | lots of caveats). | | So many caveats that I don't even really think that is a true | statement. | blueyes wrote: | I'm glad people are talking about this, and the similarity | between predictive coding and the action of biological neurons is | interesting. But we shouldn't fetishize predictive coding. | There's a wider discussion going on, and several theories as to | how back propagation might work in the brain. | | https://www.cell.com/trends/cognitive-sciences/fulltext/S136... | | https://www.nature.com/articles/s41583-020-0277-3 | andyxor wrote: | there is no evidence of back-propagation in the brain. | | See Professor Edmund T. Rolls books on biologically plausible | neural networks: | | "Brain Computations: What and How" (2020) | https://www.amazon.com/gp/product/0198871104 | | "Cerebral Cortex: Principles of Operation" (2018) | https://www.oxcns.org/b12text.html | | "Neural Networks and Brain Function" (1997) | https://www.oxcns.org/b3_text.html | ShamelessC wrote: | "There is just one problem: [biological neural networks] are | physically incapable of running the backpropagation | algorithm." | | From the linked article. | 0lmer wrote: | But does predictive coding perceived as a valid theory for | cortical neurons functioning? There was a paper from 2017 drawing | similar conclusions about backprop approximation with Spike- | Timing-Dependent Plasticity: https://arxiv.org/abs/1711.04214 | Looks more grounded to current models of neuronal functioning. | Nevertheless, it changed nothing in the field of deep learning | since then. | jwmullally wrote: | Some general background on STDP for the thread: | | Biological neurons don't just emit constant 0...1 float values, | they communicate using time sensitive bursts of voltage known | as "spike trains". Spiking Neural Networks (SNN) are a closer | aproximation of natural networks than typical ML ANNs. [0] | gives a quick overview. | | Spike-Timing-Dependant-Plasticity is a local learning rule | experimentally observed in biological neurons. It's a form of | Hebbian learning, aka "Neurons that fire together wire | together." | | Summary from [1]. The top graph gives a clear picture of how | the rule works. | | > _With STDP, repeated presynaptic spike arrival a few | milliseconds before postsynaptic action potentials leads in | many synapse types to Long-Term Potentiation (LTP) of the | synapses, whereas repeated spike arrival after postsynaptic | spikes leads to Long-Term Depression (LTD) of the same | synapse._ | | --- | | [0]: https://towardsdatascience.com/deep-learning-versus- | biologic... | | [1]: http://www.scholarpedia.org/article/Spike- | timing_dependent_p... | andyxor wrote: | as long as the model requires delta rule, or 'teacher signal' | based error correction it is not biologically plausible. | adamnemecek wrote: | I think that this sort of forward backward thing is a very | general idea. There's a one to many relationship called the | adjoint, and a many to one relationship called the norm. | | I wrote something about this here | https://github.com/adamnemecek/adjoint | tsmithe wrote: | In fact, the compositional structure underlying that of | predictive coding [0,1] is abstractly the same as that | underlying backprop [2]. (Disclaimer: [0,1] are my own papers; | I'm working on a more precise and extensive version of [1] | right now!) | | [0] https://arxiv.org/abs/2006.01631 [1] | https://arxiv.org/abs/2101.10483 [2] | https://arxiv.org/abs/1711.10455 | eli_gottlieb wrote: | Hurry and publish before I have manuscripts ready applying | these results. | tsmithe wrote: | Hey, Eli :-) | | I'm working on it; I'll send you an e-mail. Things quickly | turned out to be more general than I realized last year. | selimthegrim wrote: | What were you going to say about Young tableaux? | adamnemecek wrote: | Dynamic programming and reinforcement learning are just | diagonalizations of the Young tableau. This is related to the | spectral theorem. | jdonaldson wrote: | Yeah, I don't like this title. Coding for backprop is worth | getting excited about, but please don't assume it supersedes all | forms of "predictive coding". Plenty of predictive learning | techniques do just fine without it, including our own brains. | | In keeping with the No-Free-Lunch theorem, it's also highly | desirable in general to have a variety of approaches at hand for | solving certain predictive coding problems. Yes, this makes ML | (as a field) cumbersome, but it also prevents us from painting | ourselves into a corner. | nerdponx wrote: | Is this "coding for backprop", or "coding for the same results | as backprop"? | klmadfejno wrote: | > Predictive coding is the idea that BNNs generate a mental model | of their environment and then transmit only the information that | deviates from this model. Predictive coding considers error and | surprise to be the same thing. Hebbian theory is specific | mathematical formulation of predictive coding. | | This is an excellent, concise explanation. It sounds intuitive as | something that could work. Would love to try and dabble with | this. Any resources? | cs702 wrote: | EDIT: Before you read my comment below, please see | https://news.ycombinator.com/item?id=26702815 and | https://openreview.net/forum?id=PdauS7wZBfC for a different view. | | -- | | If the results hold, they seem significant enough to me that I'd | go as far as saying the authors of the paper would end up getting | an important award at some point, not just for _unifying the | fields of biological and artificial intelligence_ , but also for | making it trivial to train models in a fully distributed manner, | with _all learning done locally_ -- if the results hold. | | Here's the paper: "Predictive Coding Approximates Backprop along | Arbitrary Computation Graphs" | | https://arxiv.org/abs/2006.04182 | | I'm making my way through it right now. | klmadfejno wrote: | I'm trying to imagine how that works. Imagine you've got a | nueral net. One node identifies the number of feet. One node | identifies that number of wings. One node identifies color. | This feeds into a layer that tries to predict what animal it | is. | | With backprop, you can sort of assume that given enough scale | your algo will identify these important features. With local | learning, wouldn't you get a tendency to identify the easily | identifiable features many times? Is there a need for a sort of | middleman like a one arm bandit kind of thing that makes a | decision to spawn and despawn child nodes to explore the space | more? | TheRealPomax wrote: | The fallacy there is the idea that "one node" does anything | useful, rather than optimizing itself in a way that you have | _no idea_ what it actually codes for, but at the emergent | level, you see it contribute to coding for wing detection, or | color detection, or more likely actually seventeen different | things that are supposedly unrelated, it just happens to be | generating values that somehow contribute to a result for the | features the various constellations detect. | | (meaning it might also actually cause one or more | constellations to perform worse than if it wasn't | contributing, and realistically, you'll never know) | SamBam wrote: | > Is there a need for a sort of middleman like a one arm | bandit kind of thing that makes a decision to spawn and | despawn child nodes to explore the space more? | | What's the one-armed bandit? (Besides a slot machine.) | | My knowledge of this field is rusty, but I actually wrote my | MSc thesis on novel ways to get Genetic Algorithms to more | efficiently explore the space without getting stuck, so it | sounds up my alley. | fancy_pantser wrote: | I wonder if you thought of it as a type of optimal stopping | problem locally on each node and explore-exploit (multi- | armed bandit) globally? For example, if each node knows | when to halt when it hits a [probably local] minima, the | results can be shared at that point and the best-performing | models can be cross-pollinated or whatever the mechanism is | at that point. Since both copying the models and continuing | without gaining ground are both wastes of time, you want to | dial in that local halting point precisely. An overseeing | scheduler would record epoch-level results and make the | decisions, of course. | babel_ wrote: | Interesting follow up reading: | | "Relaxing the Constraints on Predictive Coding Models" | (https://arxiv.org/abs/2010.01047), from the same authors. | Looks at ways to remove neurological implausibility from PCM | and achieve comparable results. Sadly they only do MNIST in | this one, and are not as ambitious in testing on multiple | architectures and problems/datasets, but the results are still | very interesting and it covers some of the important | theoretical and biological concerns. | | "Predictive Coding Can Do Exact Backpropagation on | Convolutional and Recurrent Neural Networks" | (https://arxiv.org/abs/2103.03725), from different authors. | Uses an alternative formulation that means it always converges | to the backprop result within a fixed number of iterations, | rather than approximately converges "in practice" within | 100-200 iterations. Not only is this a stronger guarantee, it | means they achieve inference speeds within spitting distance of | backprop, levelling the playing field. | | It'd be interesting to see what a combination of these two | could do, and at this point I feel like a logical next step | would be to provide some setting in popular ML libraries such | that backprop can be switched for PCM. Being able to verify | this research just be adding a single extra line for the PCM | version, and perhaps replicating state-of-the-art | architectures, would be quite valuable. | abraxas wrote: | I'm going to personally flog any researcher who titles their | next paper "Predictive Coding Is All You Need". You've been | warned. | cs702 wrote: | There are already 60+ of those, and counting, all but one of | them since Vaswani et al's transformer paper: | | https://arxiv.org/search/?query=is+all+you+need&searchtype=a. | .. | eutropia wrote: | Here's a more recent paper (March, 2021) which cites the above | paper: https://arxiv.org/abs/2103.04689 "Predictive Coding Can | Do Exact Backpropagation on Any Neural Network" | cs702 wrote: | Yup. I'd expect to see many more citations going forward. In | particular, I'd be excited to see how this ends up getting | used in practice, e.g., training and running very large | models running on distributed, masively parallel | "neuromorphic" hardware. | JackFr wrote: | My background is as an interested amateur, but | | > also for making it trivial to train models in a fully | distributed manner, with all learning done locally | | seems like a really huge development. | | At the same time I remain pretty skeptical of claims of | unifying the fields of biological and artificial intelligence. | I think the recent tremendous successes in AI & ML lead to an | unjustified over confidence that we are close to understanding | the way biological systems must work. | himinlomax wrote: | Indeed, it's worth mentioning we still have absolutely no | idea how memory works. | andyxor wrote: | we know a lot about memory, but most AI researchers are | simply ignorant in neuroscience or cognitive psychology and | stick with their comfort zone. | | Saying "we have no idea" is just being lazy. | andyxor wrote: | the thing is about every week there is a paper published with | groundbreaking claims, with this question in particular being | very popular, trying to unify neuroscience and deep learning in | some way, in search for computational foundations of AI. Mostly | this is driven by success of DL in certain industrial | applications. | | Unfortunately most of these papers are heavy on theory but | light on empirical evidence. If we follow the path of natural | sciences, theory has to agree with evidence. Otherwise it's | just another theory unconstrained by reality, or worse, pseudo- | science. | autopoiesis wrote: | The paper (arxiv:2103.04689) linked by eutropia above has | some empirical evidence on the ML side, showing that | performance of predictive coding is not so far off backprop. | And there is no shortage of suggestions for how neural | circuits might work around the strict requirements of | backprop-like algorithms. | | cs702's original comment above is excessively hyperbolic: the | compositional structure of Bayesian inversion is well known | and is known to coincide structurally with the | backward/forward structure of automatic differentiation. And | there have been many papers before this one showing how | predictive coding approximates backprop in other cases, so it | is no surprise that it can do so on graphs, too. I agree with | the ICLR reviewers that this paper is borderline and not in | itself a major contribution. But that does not mean that this | whole endeavour, of trying to find explicit mathematical | connections between biological and artificial learning, is | ill motivated. | eli_gottlieb wrote: | >the compositional structure of Bayesian inversion is well | known | | /u/tsmithe's results on that are _well known_ , now? I can | scarcely find anyone to collaborate with who understands | them! | YeGoblynQueenne wrote: | Note that the paper was rejected for publication in ICLR 2021: | | https://openreview.net/forum?id=PdauS7wZBfC | hctaw wrote: | I don't know enough about biology or ML to know if what I'm | posting below is totally wrong, but here goes. | | "Backprop" == "Feedback" of a non-linear dynamical system. | Feedback is mathematical description of the behavior of systems, | not a literal one. | | I don't know of BNNs are incapable of backprop anymore than an | RLC filter is incapable of "feedback" when analyzing the ODE of | the latter tells you that there's a feedback path (which is what, | physically? The return path for charge?) | | So what makes BNN incapable of feedback? Are they mechanically | and electrically insulated from eachother? How do they share | information, and what is the return path? | | Other than that I wish more unification was done on ML algorithms | and dynamical systems, just in general. There's too much | crossover to ignore. | andyxor wrote: | The back-prop learning algorithm requires information non-local | to the synapse to be propagated from output of the network | backwards to affect neurons deep in the network. | | There is simply no evidence for this global feedback loop, or | global error correction, or delta rule training in | neurophysiological data collected in the last 80 years of | intensive research. [1] | | As for "why", biological learning it is primarily shaped by | evolution driven by energy expenditures constraints and | survival of the most efficient adaptation engines. One can | speculate that iterative optimization akin to the one run by | GPUs in ANNs is way too energy inefficient to be sustainable in | a living organism. | | Good discussion on biological constraints of learning (from | CompSci perspective) can be found in Leslie Valiant book [2]. | Prof. Valiant is the author of PAC [3] one of the few | theoretically sound models of modern ML, so he's worth | listening to. | | [1] https://news.ycombinator.com/item?id=26700536 | | [2] https://www.amazon.com/Circuits-Mind-Leslie-G- | Valiant/dp/019... | | [3] | https://en.wikipedia.org/wiki/Probably_approximately_correct... | hctaw wrote: | I think there's a significant difference worth illustrating | that "there is no feedback path in the brain" is not at all | equivalent to "learning by feedback is not possible in the | brain." | | It's well known in dynamics that feed-forward networks are no | longer feed-forward when outputs are coupled to inputs, an | example of which would be a hypothetically feed-forward | network of neurons in an animal and environmental | conditioning teaching it the consequences of actions. | | I'm very curious on the biological constraints, but I'd | reiterate my point above that feedback is a mathematical or | logical abstraction for analyzing the behavior of the things | we call networks - which are also abstractions. There's a | distinction between the physical behavior of the things we | see and the mathematical models we construct to describe | them, like electromechanical systems where physically no such | coupling from output-to-input appears to exist, yet its | existence is crucially important analytically. | khawkins wrote: | > Other than that I wish more unification was done on ML | algorithms and dynamical systems, just in general. There's too | much crossover to ignore. | | Check out this work, "Deep relaxation: partial differential | equations for optimizing deep neural networks" by Pratik | Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto & | Guillaume Carlier. | | https://link.springer.com/article/10.1007/s40687-018-0148-y | nerdponx wrote: | The article says this: | | > The backpropagation algorithm requires information to flow | forward and backward along the network. But biological neurons | are one-directional. An action potential goes from the cell | body down the axon to the axon terminals to another cell's | dendrites. An axon potential never travels backward from a | cell's terminals to its body. | | The point of the research here is that backpropagation turns | out not to be necessary to fit a neural network, and that it | can be approximated with predictive coding, which does not | require end-to-end backwards information flow. | candiodari wrote: | Yeah, but then you run into the problem of computation speed. | Any given neuron in the middle of your brain does 1 | computation per second absolute maximum, and 1 per 10 seconds | is more realistic. More to the outside (the vast majority of | your brain) 1 per 100 seconds is a lot. And it slows down | when you age. | | This means brains must have a _bloody_ good update rule. You | just can 't update a neural network in 1 billion operations | per second, or 4e17 operations until you're 12, about 2 | million training steps per neuron, or about half that | assuming you sleep. You cannot get to the level of a 12 year | old in 4e17 operations, because GPT-3 does more and while | it's impressive, it doesn't have anything on a 12 year old. | salawat wrote: | So... I don't understand. | | >An action potential goes from the cell body down the axon to | the axon terminals to another cell's dendrites. | | How do you figure that doesn't allow backprop? | | A neuronal bit is a loop of neurons. Information absolutely | can back- propagate. If it couldn't, how does anyone think | it'd be at all possible to learn how to get better at | anything? | | Neuron fires dendrite to axon, secondary neuron fires | dendrite to Axon, Axon branches back to previous neuron's | dendrites, rinse, repeat, or add more intervening neurons... | Trying to disinclude backprop based on the morphology of a | single neuron is... Kinda missing the point. | | It's all about the level of connection between neurons and | how long or whether a signal returns unmodified to the | progenitor that effects the stability of the encoded | information or behavior. At least to the best I've been able | to plausibly model it. Haven't exactly figured out how to | shove a bunch of measuring sticks in there to confirm or | deny, but I just can't how a uniderectional action potential | forwarding element implies lack of backprop in a graph of | connections fully capable of developing cycles. | nmca wrote: | Interesting discussion on the ICLR openreview, resulting in a | reject: | | https://openreview.net/forum?id=PdauS7wZBfC | justicezyx wrote: | Another well received paper [1], but I want to point out that | ICLR should really have an industry track. | | The type of research in [1] (exhaustive analytic study on | various parameters on RL training), is clearly beyond typical | academia environment, probably also beyond normal industry | labs. Note the paper was from Google Brain. | | The study consumes a lot of people's time, and computing time. | It's no doubt very useful and valuable. But I dont think they | should be judged by the same group of reviewers with the other | work from normal universities. | | [1] https://openreview.net/forum?id=nIAxjsniDzg | justicezyx wrote: | Copied from this URL, the final review comments that 1) | summarized the other reviews, 2) describes the rational for | rejection: | | ``` This paper extends recent work (Whittington & Bogacz, 2017, | Neural computation, 29(5), 1229-1262) by showing that | predictive coding (Rao & Ballard, 1999, Nature neuroscience | 2(1), 79-87) as an implementation of backpropagation can be | extended to arbitrary network structures. Specifically, the | original paper by Whittington & Bogacz (2017) demonstrated that | for MLPs, predictive coding converges to backpropagation using | local learning rules. These results were important/interesting | as predictive coding has been shown to match a number of | experimental results in neuroscience and locality is an | important feature of biologically plausible learning | algorithms. | | The reviews were mixed. Three out of four reviews were above | threshold for acceptance, but two of those were just above. | Meanwhile, the fourth review gave a score of clear reject. | There was general agreement that the paper was interesting and | technically valid. But, the central criticisms of the paper | were: | | Lack of biological plausibility The reviewers pointed to a few | biologically implausible components to this work. For example, | the algorithm uses local learning rules in the same sense that | backpropagation does, i.e., if we assume that there exist | feedback pathways with symmetric weights to feedforward | pathways then the algorithm is local. Similarly, it is assumed | that there paired error neurons, which is biologically | questionable. | | Speed of convergence The reviewers noted that this model | requires many more iterations to converge on the correct | errors, and questioned the utility of a model that involves | this much additional computational overhead. | | The authors included some new text regarding biological | plausibility and speed of convergence. They also included some | new results to address some of the other concerns. However, | there is still a core concern about the importance of this work | relative to the original Whittington & Bogacz (2017) paper. It | is nice to see those original results extended to arbitrary | graphs, but is that enough of a major contribution for | acceptance at ICLR? Given that there are still major issues | related to (1) in the model, it is not clear that this | extension to arbitrary graphs is a major contribution for | neuroscience. And, given the issues related to (2) above, it is | not clear that this contribution is important for ML. | Altogether, given these considerations, and the high bar for | acceptance at ICLR, a "reject" decision was recommended. | However, the AC notes that this was a borderline case. ``` | | The core reason is that the proposed model lacks biological | plausibility. Or, if ignoring this weakness, the model is then | computationally more intensive. | | I HAVE NOT read the paper, but the review seems mostly based | "feeling"; i.e., the reviewers feel that this work is not above | the bar. Note that I am not criticizing the reviewers here, in | my past review career of maybe in the range of 100+ papers, | which I did until 6 years ago, most of them are junks. For the | ones that are truly good work, which checks all the boxes: new | result, hard problem, solid validation, it was easy to accept. | | For yet a few other papers, which all seem to fall into the | feeling category, everything looks right, but it was always on | a borderline. And the review results can vary substantially | based on the reviewers' own backgrounds. | marmaduke wrote: | The review is great, it contains all the interesting points and | counterpoints, in a much more succinct format than the article | itself. | ilaksh wrote: | Does anyone know of a simple code example that demonstrates the | original predictive coding concept from 1999? Ideally applied to | some type of simple image/video problem. | | I thought I saw a Matlab explanation of that 99 paper but have | not found it again. | phreeza wrote: | This was already shown for MLPs some years ago, and it is not | really that surprising that it applies to many other | architectures. Note that while learning can take place locally, | it does still require an upward and downward stream of | information flow, which is not supported by the neuroanatomy in | all cases. So while it is an interesting avenue of research, I | don't think it's anywhere near as revolutionary as this blog post | makes it out to be. | AbrahamParangi wrote: | This is an overly strong claim for the paper (which is good!) | backing it. | | If anyone is interested in the reader's digest version of the | original paper check out | https://www.youtube.com/watch?v=LB4B5FYvtdI | fouric wrote: | > Predictive coding is the idea that BNNs generate a mental model | of their environment and then transmit only the information that | deviates from this model. Predictive coding considers error and | surprise to be the same thing. | | This reminds me of a Slate Star Codex article on Friston[1]. | | [1] https://slatestarcodex.com/2018/03/04/god-help-us-lets- | try-t... ___________________________________________________________________ (page generated 2021-04-05 23:00 UTC)