[HN Gopher] Researchers propose faster, more efficient alternati... ___________________________________________________________________ Researchers propose faster, more efficient alternatives to backpropagation Author : webmaven Score : 145 points Date : 2020-12-20 17:12 UTC (5 hours ago) (HTM) web link (venturebeat.com) (TXT) w3m dump (venturebeat.com) | dr_j_ wrote: | So assign random values to connection weights and then 'spin' | those weights to a combination of other random values that | hopefully perform a bit more favourably.. isn't this just random | search? | quotemstr wrote: | It's not a random search through the parameter space: | | "But how do we select a good network from these Kn different | networks? Brute-force evaluation of all possible configurations | is clearly not feasible due to the massive number of different | hypotheses. Instead, we present an algorithm, shown in Figure | 1, that iteratively searches the best combination of connection | values for the entire network by optimizing the given loss. To | do this, the method learns a real-valued quality score for each | weight option. These scores are used to select the weight value | of each connection during the forward pass. The scores are then | updated in the backward pass based on the loss value in order | to improve training performance over iterations." | | It's actually pretty clever. | ssivark wrote: | Random search is a technical term in optimization with a very | specific meaning (which unfortunately does not mean searching | random locations in parameter space a la brute force). It's | more in the spirit of randomly deciding the direction in | which to try to take the next step, thereby implicitly | deriving a gradient component by sampling. | | https://en.m.wikipedia.org/wiki/Random_search | sdenton4 wrote: | It reminds me of Bayesian model sampling, where you have a | distribution over possible weights and 'draw' a model from | the distribution for each evaluation... A problem is that | there may be interesting co-dependencies amongst the weights | which independent sampling will have a hard time getting | right. | thweroiuorier wrote: | Meh. Seems like a lot of hot air over nothing significant. | sesuximo wrote: | Can someone ELI am an undergrad? I don't see how gradient descent | "forgets" anything | saiojd wrote: | Gradient decent doesn't per se, but retraining ("fine-tuning") | on another dataset forgets most of the training done on the | first dataset. | joshgel wrote: | Is this true? My understanding was that in fine tuning, you'd | only re train some of the layers. And even if you re train | all the layers, the starting point for the layers is not | random. If it really was all forgotten then fine tuning would | not be orders of magnitude faster... | thunderbird120 wrote: | Gradient decent optimizes performance of a model on a given | dataset. If you stop training on one dataset and start | training on another one your model will become more | optimized for the second dataset and less optimized for the | first. This will usually result in degraded performance on | classes of data found more commonly in the first dataset | but not the second. This is what people mean by | "forgetting". It doesn't matter how much of the model you | fine-tune, the effect is still present though the effect | size varies. | unishark wrote: | Its not a complicated concept, just a stretch of the concept of | memory. Training in deep learning is done in batches. So | "learning" (i.e. the gradient updates to weights) that happens | due to your early batches of data can be undone by the gradient | updates for later batches. | | The gradient in machine learning is based on the loss. | Specifically it's the direction that reduces the loss the | fastest. So, not only the most recent batches, but specifically | by the recent data that is predicted incorrectly. It doesn't | have any "confidence" from the memory of what was predicted | right previously, for example, it just currently only cares | about changing to suit the most recent batches. | sdenton4 wrote: | Seems like you could just use better active learning | strategies to get around the issue, though... Keep your usual | dataset, but progressively build a reservoir of 'important' | examples while training. (where important == high loss or | near decision boundary, for example.) Then when building | batches, mix in some examples from the broad training set and | some from the reservoir. | cl3misch wrote: | I find the nomenclature in this article a bit weird. | | > Another disadvantage of backpropagation is its tendency to | become stuck in the local minima of the loss function. | Mathematically, the goal in training a model is converging on the | global minimum, the point in the loss function where the model | has optimized its ability to make predictions. | | "Backpropagation" is the method how to compute the gradient of | the weights with respect to a loss function. But the article | repeatedly uses the term as if it was the whole optimization | algorithm, running into local minima. | joe_the_user wrote: | Wikipedia: "The term backpropagation strictly refers only to | the algorithm for computing the gradient, not how the gradient | is used; however, the term is often used loosely to refer to | the entire learning algorithm, including how the gradient is | used, such as by stochastic gradient descent." | | -- Meaning follows usage. | | https://en.wikipedia.org/wiki/Backpropagation | enriquto wrote: | This is wrong, stupid and extremely confusing. | edoceo wrote: | Could you provide a definition correct, smart and sorta | helpful? | atty wrote: | As someone who uses this in their day job, I have no | problem using loose terminology to describe a well known | procedure. Most of the time when I am referring to the act | of optimization, it doesn't matter exactly what method I am | using, and I can use backprop as a stand in. If I'm talking | about the technical details of my work, I will state the | specific optimization strategy. Everyone on my team does | similar things, and no one is confused or misled. Use | rigorous language when necessary, and use colloquial | language when appropriate. | enriquto wrote: | > Use rigorous language when necessary, and use | colloquial language when appropriate. | | Do you think a peer-reviewd publication is formal enough | to warrant precise language? These publications are not | only read by specialists in the field. I use automatic | differentiation in my daily work, but I'm not familiar | with machine learning. Thus I am very confused when | "backpropagation" is used to mean an optimization | algorithm. | | EDIT: It is as if physicists used the term "special | relativity" to talk about "quantum mechanics" because, | after all, quantum mechanics happens in Lorentzian | spacetime. Now for specialists of quantum physics it may | make sense, since they are using "special relativity" to | distinguish it from fancier quantum theories that combine | field theory with GR. But for normal people it would be | certainly misleading. Using "backpropagation" to include | optimization has the same feeling. | miemo23 wrote: | QM doesn't happen in lorentzian spacetime, QFT does | though | enriquto wrote: | yes, sorry, I meant quantum field theory | savant_penguin wrote: | Some people use backpropagation and gradient decent | interchangeably. It's really confusing though | jhrmnn wrote: | I wouldn't say it's confusing--it's wrong and suggests at the | least sloppiness in terminology | ethbr0 wrote: | > _sloppiness in terminology_ | | venturebeat.com | | Not the worst, but we're not talking _Nature_ or _Spectrum_ | here. | option wrote: | of course it is wrong. One nice thing about math is that | thing are defined precisely and back propagation and, say, | SGD or Adam are different things | mlthoughts2018 wrote: | I found it very weird that the SLIDE algorithm from early 2019 | isn't mentioned. Maybe I missed it or maybe it is compared just | deeper in the referenced publications? | | SLIDE seems way, way superior to any of the listed solutions or | approaches, as far as I could tell on a first read through. | | https://arxiv.org/abs/1903.03129 | quotemstr wrote: | AIUI, that only works on sparse networks | mlthoughts2018 wrote: | But there's also been a lot of research suggesting most SOTA | dense networks are arbitrarily replicatable with sparse | networks, and may even be better in the sense of less | overfitting. Perhaps things like GPT are still an exception, | but for most applications SLIDE should work to train networks | just as effective as naively specified dense architectures. | quotemstr wrote: | Yeah. I think part of the problem is just that SLIDE | represents a Kuhnesque paradigm shift and these things take | time. I really want to play with SLIDE myself but just | haven't had a chance. | manjunaths wrote: | https://beyondbackprop.github.io/ | | This is the NeurIPS workshop that the article is talking about. | webmaven wrote: | Aside from being an excellent overview of NeurIPS 2020 papers on | this topic, I found it curious that several of them were | anonymous. | | Are anonymously submitted papers becoming (more) common? If so, | what's driving this? | gwern wrote: | I'd assume they just haven't been unblinded yet. | webmaven wrote: | Hmm. Shouldn't papers all be unblinded at once when | acceptances/rejections are sent out? | dkislyuk wrote: | These are workshop submissions (which typically implies a | more lightweight review process, for more exploratory | work), and it is possible the same submissions are | currently in blind review for other conferences in their | final form. | zipotm wrote: | Bullshit, backpropagation was discovered by Rosenblatt... | bra-ket wrote: | I'd say Gottfried W. Leibniz is the true author, as it's all | comes down to calculus. The particular implementation for | "neural nets" is just a special case of function minimization | by taking derivatives. | contingencies wrote: | I like zoom-out views. To push what you describe further, it | is essentially what ancient humans or their non-hominid | forebears did subconsciously when calculating optimum motion | trajectories to catch or spear prey while hunting... merely a | version in formal notation ... we can thank the zero of India | (https://en.wikipedia.org/wiki/0#History), the Persians | (https://en.wikipedia.org/wiki/Algorithm#Etymology), the | Islamic renaissance in Europe | (https://mitpress.mit.edu/books/islamic-science-and-making- | eu...) and numerous others for the slow development of the | requisite formal maths. But a rose by any other name would | smell as sweet. And perhaps, in the context of the | stupefyingly deferred emergence of zero, even nameless! | lock-free wrote: | I mean if we want to get pedantic I'm pretty sure Shannon used | "backpropagation" for machine learning before either was called | such. | | Feedback for the purpose of regulating the state of a machine | in response to input dates to antiquity, if we're really | getting absurd. The formal definition is also debatable, I | think Maxwell has the strongest claim. | duvenaud wrote: | I'm one of the speakers at the workshop mentioned [1]. The | article is a bit of a concept salad. I'm not familiar with all of | the papers mentioned but am happy to try to answer questions. | | [1] https://beyondbackprop.github.io/ | pretty_dumm_guy wrote: | Hi Professor, | | Good day, | | I was wondering whether it is be possible for you to provide an | overview of different methods that you think might have a | better shot at replacing backpropagation algorithm? | duvenaud wrote: | Sure. First of all, I want to say that backprop, by which I | mean reverse-mode differentiation for computing gradients, | combined with gradient descent for updating parameters, is | pretty great. In a sense it's the last thing we should be | trying to replace, since pretty much the whole deep learning | revolution was about replacing hand-designed functions with | ones that can be optimized in this way. | | Reverse-mode differentiation has about the same time cost as | whatever function you're optimizing, no matter how many | parameters you need gradients for. This which is about as | good as one could hope for, and is what lets it scale to | billions of parameters. | | The main downside of reverse-mode differentiation (and one of | the reasons it's biologically implausible) is that it | requires storing all the intermediate numbers that were | computed when evaluating the function on the forward pass. So | its memory cost grows with the complexity of the function | being optimized. | | So the main practical problem with reverse-mode | differentiation + gradient descent is the memory requirement, | and much of the research presented in the workshop is about | ways to get around this. A few of the major approaches are: | | 1) Only storing a subset of the forward activations, to get | noisier gradients at less memory cost. This is what the | "Randomized Automatic Differentiation" paper does. You can | also save memory and get exact gradients if you re-construct | the activations as you need them (called checkpointing), but | this is slower. | | 2) Only training one layer at a time. This is what the | "Layer-wise Learning" papers are doing. I suppose you could | also say that this is what the "feedback alignment" papers | are doing. | | 3) If the function being optimized is a fixed-point | computation (such as an optimization), you can compute its | gradient without needing to store any activations by using | the implicit function theorem. This is what my talk was | about. | | 4) Some other forms of sensitivity analysis (not exactly the | same as computing gradients) can be done by just letting a | dynamical system run for a little while. Barak Pearlmutter | has some work on how he thinks this is what happens in slow- | wave sleep to make our brains less prone to seizures when | we're awake. | | I'm missing a lot of relevant work, and again I don't even | know all the work that was presented at this one workshop. | But I hope this helps. | pretty_dumm_guy wrote: | Thank you for your answer. It appears to me that we are | trying to achieve an algorithm that has better time | complexity than the one that we have right now(reverse mode | differentiation with gradient descent). | | Is it possible to combine these methods in a straight | forward manner with methods that try to reduce the space | complexity? For example, Lottery ticket | hypothesis(https://arxiv.org/abs/1803.03635) seems to | reduce spacial complexity(Please do correct me if I am | wrong). | | Also, based on my rather poor and limited knowledge, it | appears to me that set of proposed methods that reduced | space complexity and set of proposed methods that reduce | time complexity are disjoint. Is that the case ? | sdenton4 wrote: | (Lottery Ticket, to date, produces small networks ex post | facto... You still have to train the giant network. | There's also some indication that it's chancy on 'large' | datasets+problems. https://arxiv.org/abs/1902.09574 ) | bmc7505 wrote: | > Barak Pearlmutter has some work on how he thinks this is | what happens in slow-wave sleep to make our brains less | prone to seizures when we're awake. | | Interesting! I am more familiar with Pearlmutter's work on | automatic differentiation, but was was unaware of this work | with Houghton. | | A new hypothesis for sleep: tuning for criticality: | https://zero.sci- | hub.se/2153/6c1cfbc1b78d23ef2e1cb7102dd8339... | | There is also a related paper on wake-sleep learning from | UofT, of which I am sure you are aware: | | The wake-sleep algorithm for unsupervised neural networks: | https://www.cs.toronto.edu/~hinton/absps/ws.pdf | | Are you aware of any recent work investigating the role of | sleep in biological and statistical learning? | im3w1l wrote: | One approach I've been thinking about is optimizing each neuron | using only the global loss and information about the | neighboring neurons. | | Basically if the network made the correct prediction tell each | neuron to do a little bit more of what it just did. If it sent | a high output, change the weights so it sends an even higher | output. Weaken connections that were inhibitory and strengthen | connections that were excitatory. And for a neuron with a low | output, make it even lower by doing the opposite. | | If on the other hand prediction was wrong, then try to make the | neuron do less of what it did. | | Do you know if something like this has been tried? | sdenton4 wrote: | Backprop is good at giving credit where credit is due: you're | looking at the impact of each weight on loss, which allows | changing each weight to improve the loss, by an appropriate | amount proportional to the other weights. You can even have | some negative weight gradients and some positive; ie, it may | be that even with a 'good' overall result that it's best to | turn down a particular weight. | | So my guess is that this approach would either take a much | longer time to converge (as there's less information | transmitted back for the neuron updates) or stall out | completely. | | Probably not too hard to code up, if you want to try it. But | I would also be pretty surprised if it hadn't been tried | before. | bra-ket wrote: | The best part of the article is a quote by G. Hinton, the father | of "deep learning", at the end: "My view is throw it all away and | start again, I don't think that's how the brain works." | officehero wrote: | The main problem is not backpropagation though, but the | fixation of resources on DL projects (that's what I call local | minimum!). In my department, for example, they don't seem to | care about the application, integration, deployment etc, as | long as it's DL or DRL. | faitswulff wrote: | Didn't he recently publish an article about a drastically | different way to approach machine learning called capsule | networks? | deehouie wrote: | While Hinton's view need to be noted, I heard a quote | attributed to Yann LeCun, something like, | | "If you want to learn flying by modeling the biology of birds, | you're doing it wrong. Just look at today's airplanes. They | have no resemblance to birds at all. Yet they're million times | better and faster than any bird." | justicezyx wrote: | "They have no resemblance to birds at all. Yet they're | million times better and faster than any bird." | | LMAO | | A 6 years old kid can see the fundamental resemblance between | a bird and a modern passenger airplane: The wings Tail | stabilizer Slender body | | Planes are faster bigger | | Are they better? | | Not necessarily, for example, humming bird can fly in a way | that is far beyond any human machine in terms of efficiency | and flexibility. | | Of course man should not imitate birds, because human flight | is fundamentally different activity than bird flying. But to | say human aviation did not start by mimicking birds, is like | to say Ann was not inspired human brain... | nightski wrote: | I think the point was that aviation at the time did start | mimicking birds and that was why there was so much failure. | It was not until they let go of mimicking birds and took a | different approach that they found success. | bbarnett wrote: | I get what the author was trying to say, but it's still -- a | very limited view. Mostly because of the last bit | (better/faster). | | Birds are to planes, as humans are to cars. Yet can a car | leap over barricades, climb mountains, trees, self-repair, | turn on a dime, stop instantly, etc, etc? | | A plane cannot maneuver like a bird, take off in crazy | weather conditions, land on a dime in a tree, stop almost | instantly in flight, and change direction, etc. | | I think what you've quoted has a lot of value here, for, what | we should expect from an artificial brain, isn't a human | brain. This is truth. However, while it may be faster in a | specific capacity, but it won't have the same | characteristics. | | So yes, expecting it to be like a human brain doesn't make | sense. | | Yet better/faster? I don't think we can compare this, they're | too different. | | (which is really the quote's point, but I just didn't like | the better/faster bit at the end...) | 2-tpg wrote: | > The question of whether a computer can think is no more | interesting than the question of whether a submarine can | swim. --- Dijkstra | | Better/faster we would not directly compare to humans, but | to benchmarks and timed experiments. | | LeCun is saying to treat "intelligence" the same as | "flight" or "swimming". It is a matter of function, not a | matter of a specific instantiation on a biological | substrate. You don't need to recreate flapping wings to | gain "flight", you can strap a combustion engine on a | cylinder and beat all birds on earth in regards to speed. | You don't say "we don't have flight yet", because an | airplane is not able to land on a tree branch. Maybe we | don't have yet all the components and aspects of "flight", | but this is not a show stopper, and drones have come a long | way. | | Now the more interesting question becomes: What are the | laws of aerodynamics for intelligence? | | Aside: I think it is _absolutely insane_ that a conference | workshop with papers yet to go through peer-review, is | highlighted as a popsci article on VentureBeat. That 's | such a narrow workshop, that even researchers in the field | may be unaware of it. And now these get to read the paper | summaries from a HN-story. "the centre cannot hold". | | Aside II: Yann LeCun talk from 2019 about this subject | (better to debate the source ;)): | | > _Clearly, Deep Learning research would greatly benefit | from better theoretical understanding. DL is partly | engineering science in which we create new artifacts | through theoretical insight, intuition, biological | inspiration, and empirical exploration. But understanding | DL is a kind of "physical science" in which the general | properties of this artifact is to be understood. The | history of science and technology is replete with examples | where the technological artifact preceded (not followed) | the theoretical understanding: the theory of optics | followed the invention of the lens, thermodynamics followed | the steam engine, aerodynamics largely followed the | airplane, information theory followed radio communication, | and computer science followed the programmable calculator. | My two main points are that (1) empiricism is a perfectly | legitimate method of investigation, albeit an inefficient | one, and (2) our challenge is to develop the equivalent of | thermodynamics for learning and intelligence. While a | theoretical underpinning, even if only conceptual, would | greatly accelerate progress, one must be conscious of the | limited practical implications of general theories._ --- ht | tps://www.ias.edu/video/DeepLearningConf/2019-0222-YannLeC. | .. | nn3 wrote: | Also birds (and insects/bats/pterosaurs) flight is a lot | more energy efficient than any plane. Today's deep learning | is essentially brute force, burning thousands of watts for | anything more complicated which a single human brain can | often do in ~15Watts. | | The advanced models like GPT-3 are burning millions of | watts in the cloud but they're not that much better than | what a brain can do (and in many ways worse, as in often | requiring supervised learning) | | That's the key point. The algorithms need to become more | energy efficient to make significant leaps, thus become | more like brains. | starpilot wrote: | No, it's not. What's your comparison? Are there birds | that can carry 80,000 lb of passenger + cargo weight? | Condors fly like fixed-wing aircraft for 99% of their | flight, hummingbirds fly more like insects. There isn't | one type of bird flight. | | This whole HN discussion of bird flight is a trainwreck | and reflects massive gaps in understanding of | aerodynamics. This is '00s "computer virus news report" | level competence in this subject. | sterlind wrote: | We understand the aerodynamics of bird flight, and used | it to make fixed-wing planes optimized for carrying lots | of cargo. Once we understand the principles behind | intelligence, we can make very efficient AI optimized for | our usage. But we're still at the point where we don't | understand intelligence as well as we understood | aerodynamics when building the first planes, so we still | have a lot to learn from "birds" - animal brains. | frongpik wrote: | I believe AI will start as a basic principle or idea that | can be applied to any sufficiently big state machine that | controls e.g. an RC airplane or traffic lights. That idea | will be obvious in a hindsight. I'd even make a guess | that it will be like a "stateful" state machine that | accumulates state in a particular manner and uses that to | control the underlying state machine. We still will be | nowhere near understanding intelligence, but that clever | trick will be enough in most cases. | ant6n wrote: | Also, birds produce themselves out of an egg, with only | food, water and air as production input. They also can | produce more of themselves with minimal input. They are | also self-repairing/maintaining, something planes cant | do. | bra-ket wrote: | the problem with that analogy is trying to build an airplane | before you figured out the laws of physics. | hyko wrote: | People build things without understanding the underlying | principles all the time, e.g. the steam engine. You could | probably make the case that building things has helped our | understanding more than our understanding has helped us | build things. | | Having said that, you can certainly improve a design when | you better understand the fundamentals (vs intuition + | trial & error). | sdenton4 wrote: | The physics of lift and aerodynamics were faaaaar from well | understood at the time of the first airplanes, though. New | areas tend to run a bit ahead of the underlying science; | the fundamentals expand to support and improve the | applications over time. | bra-ket wrote: | but we did have quite a few advances at the time of the | first airplane, for example by that time steam & | combustion engines were already invented, which required | non-trivial understanding of physics, chemistry and | material science was very advanced. | | I hold a pessimistic view that we are still in hunter- | gatherer mode as it comes to understanding cognition. | sdenton4 wrote: | Well, it's your right to be a pessimist... I tend to | think that the current hardware specialized for fast, | parallelized linear algebra is at least as good as the | wheels available at the start of the industrial | revolution, though. We have learning algorithms that can | match human/animal performance in a wide - but still | constrained - set of tasks, which previous non-learned | algorithms hadn't been able to crack. It's a start! | | At some point you have to strike rocks to make fire, | because the butane lighter hasn't been invented yet. You | make do with what's available, and progressively get | better at it. I tend to think that we're a couple-few | perspective shifts away from getting it 'right,' and that | the hardware side likely barely matters. But, I'm an | optimist. ___________________________________________________________________ (page generated 2020-12-20 23:00 UTC)