[HN Gopher] Researchers propose faster, more efficient alternati...
       Researchers propose faster, more efficient alternatives to
       Author : webmaven
       Score  : 145 points
       Date   : 2020-12-20 17:12 UTC (5 hours ago)
 (HTM) web link (venturebeat.com)
 (TXT) w3m dump (venturebeat.com)
       | dr_j_ wrote:
       | So assign random values to connection weights and then 'spin'
       | those weights to a combination of other random values that
       | hopefully perform a bit more favourably.. isn't this just random
       | search?
         | quotemstr wrote:
         | It's not a random search through the parameter space:
         | "But how do we select a good network from these Kn different
         | networks? Brute-force evaluation of all possible configurations
         | is clearly not feasible due to the massive number of different
         | hypotheses. Instead, we present an algorithm, shown in Figure
         | 1, that iteratively searches the best combination of connection
         | values for the entire network by optimizing the given loss. To
         | do this, the method learns a real-valued quality score for each
         | weight option. These scores are used to select the weight value
         | of each connection during the forward pass. The scores are then
         | updated in the backward pass based on the loss value in order
         | to improve training performance over iterations."
         | It's actually pretty clever.
           | ssivark wrote:
           | Random search is a technical term in optimization with a very
           | specific meaning (which unfortunately does not mean searching
           | random locations in parameter space a la brute force). It's
           | more in the spirit of randomly deciding the direction in
           | which to try to take the next step, thereby implicitly
           | deriving a gradient component by sampling.
           | https://en.m.wikipedia.org/wiki/Random_search
           | sdenton4 wrote:
           | It reminds me of Bayesian model sampling, where you have a
           | distribution over possible weights and 'draw' a model from
           | the distribution for each evaluation... A problem is that
           | there may be interesting co-dependencies amongst the weights
           | which independent sampling will have a hard time getting
           | right.
       | thweroiuorier wrote:
       | Meh. Seems like a lot of hot air over nothing significant.
       | sesuximo wrote:
       | Can someone ELI am an undergrad? I don't see how gradient descent
       | "forgets" anything
         | saiojd wrote:
         | Gradient decent doesn't per se, but retraining ("fine-tuning")
         | on another dataset forgets most of the training done on the
         | first dataset.
           | joshgel wrote:
           | Is this true? My understanding was that in fine tuning, you'd
           | only re train some of the layers. And even if you re train
           | all the layers, the starting point for the layers is not
           | random. If it really was all forgotten then fine tuning would
           | not be orders of magnitude faster...
             | thunderbird120 wrote:
             | Gradient decent optimizes performance of a model on a given
             | dataset. If you stop training on one dataset and start
             | training on another one your model will become more
             | optimized for the second dataset and less optimized for the
             | first. This will usually result in degraded performance on
             | classes of data found more commonly in the first dataset
             | but not the second. This is what people mean by
             | "forgetting". It doesn't matter how much of the model you
             | fine-tune, the effect is still present though the effect
             | size varies.
         | unishark wrote:
         | Its not a complicated concept, just a stretch of the concept of
         | memory. Training in deep learning is done in batches. So
         | "learning" (i.e. the gradient updates to weights) that happens
         | due to your early batches of data can be undone by the gradient
         | updates for later batches.
         | The gradient in machine learning is based on the loss.
         | Specifically it's the direction that reduces the loss the
         | fastest. So, not only the most recent batches, but specifically
         | by the recent data that is predicted incorrectly. It doesn't
         | have any "confidence" from the memory of what was predicted
         | right previously, for example, it just currently only cares
         | about changing to suit the most recent batches.
           | sdenton4 wrote:
           | Seems like you could just use better active learning
           | strategies to get around the issue, though... Keep your usual
           | dataset, but progressively build a reservoir of 'important'
           | examples while training. (where important == high loss or
           | near decision boundary, for example.) Then when building
           | batches, mix in some examples from the broad training set and
           | some from the reservoir.
       | cl3misch wrote:
       | I find the nomenclature in this article a bit weird.
       | > Another disadvantage of backpropagation is its tendency to
       | become stuck in the local minima of the loss function.
       | Mathematically, the goal in training a model is converging on the
       | global minimum, the point in the loss function where the model
       | has optimized its ability to make predictions.
       | "Backpropagation" is the method how to compute the gradient of
       | the weights with respect to a loss function. But the article
       | repeatedly uses the term as if it was the whole optimization
       | algorithm, running into local minima.
         | joe_the_user wrote:
         | Wikipedia: "The term backpropagation strictly refers only to
         | the algorithm for computing the gradient, not how the gradient
         | is used; however, the term is often used loosely to refer to
         | the entire learning algorithm, including how the gradient is
         | used, such as by stochastic gradient descent."
         | -- Meaning follows usage.
         | https://en.wikipedia.org/wiki/Backpropagation
           | enriquto wrote:
           | This is wrong, stupid and extremely confusing.
             | edoceo wrote:
             | Could you provide a definition correct, smart and sorta
             | helpful?
             | atty wrote:
             | As someone who uses this in their day job, I have no
             | problem using loose terminology to describe a well known
             | procedure. Most of the time when I am referring to the act
             | of optimization, it doesn't matter exactly what method I am
             | using, and I can use backprop as a stand in. If I'm talking
             | about the technical details of my work, I will state the
             | specific optimization strategy. Everyone on my team does
             | similar things, and no one is confused or misled. Use
             | rigorous language when necessary, and use colloquial
             | language when appropriate.
               | enriquto wrote:
               | > Use rigorous language when necessary, and use
               | colloquial language when appropriate.
               | Do you think a peer-reviewd publication is formal enough
               | to warrant precise language? These publications are not
               | only read by specialists in the field. I use automatic
               | differentiation in my daily work, but I'm not familiar
               | with machine learning. Thus I am very confused when
               | "backpropagation" is used to mean an optimization
               | algorithm.
               | EDIT: It is as if physicists used the term "special
               | relativity" to talk about "quantum mechanics" because,
               | after all, quantum mechanics happens in Lorentzian
               | spacetime. Now for specialists of quantum physics it may
               | make sense, since they are using "special relativity" to
               | distinguish it from fancier quantum theories that combine
               | field theory with GR. But for normal people it would be
               | certainly misleading. Using "backpropagation" to include
               | optimization has the same feeling.
               | miemo23 wrote:
               | QM doesn't happen in lorentzian spacetime, QFT does
               | though
               | enriquto wrote:
               | yes, sorry, I meant quantum field theory
         | savant_penguin wrote:
         | Some people use backpropagation and gradient decent
         | interchangeably. It's really confusing though
           | jhrmnn wrote:
           | I wouldn't say it's confusing--it's wrong and suggests at the
           | least sloppiness in terminology
             | ethbr0 wrote:
             | > _sloppiness in terminology_
             | venturebeat.com
             | Not the worst, but we're not talking _Nature_ or _Spectrum_
             | here.
           | option wrote:
           | of course it is wrong. One nice thing about math is that
           | thing are defined precisely and back propagation and, say,
           | SGD or Adam are different things
       | mlthoughts2018 wrote:
       | I found it very weird that the SLIDE algorithm from early 2019
       | isn't mentioned. Maybe I missed it or maybe it is compared just
       | deeper in the referenced publications?
       | SLIDE seems way, way superior to any of the listed solutions or
       | approaches, as far as I could tell on a first read through.
       | https://arxiv.org/abs/1903.03129
         | quotemstr wrote:
         | AIUI, that only works on sparse networks
           | mlthoughts2018 wrote:
           | But there's also been a lot of research suggesting most SOTA
           | dense networks are arbitrarily replicatable with sparse
           | networks, and may even be better in the sense of less
           | overfitting. Perhaps things like GPT are still an exception,
           | but for most applications SLIDE should work to train networks
           | just as effective as naively specified dense architectures.
             | quotemstr wrote:
             | Yeah. I think part of the problem is just that SLIDE
             | represents a Kuhnesque paradigm shift and these things take
             | time. I really want to play with SLIDE myself but just
             | haven't had a chance.
       | manjunaths wrote:
       | https://beyondbackprop.github.io/
       | This is the NeurIPS workshop that the article is talking about.
       | webmaven wrote:
       | Aside from being an excellent overview of NeurIPS 2020 papers on
       | this topic, I found it curious that several of them were
       | anonymous.
       | Are anonymously submitted papers becoming (more) common? If so,
       | what's driving this?
         | gwern wrote:
         | I'd assume they just haven't been unblinded yet.
           | webmaven wrote:
           | Hmm. Shouldn't papers all be unblinded at once when
           | acceptances/rejections are sent out?
             | dkislyuk wrote:
             | These are workshop submissions (which typically implies a
             | more lightweight review process, for more exploratory
             | work), and it is possible the same submissions are
             | currently in blind review for other conferences in their
             | final form.
       | zipotm wrote:
       | Bullshit, backpropagation was discovered by Rosenblatt...
         | bra-ket wrote:
         | I'd say Gottfried W. Leibniz is the true author, as it's all
         | comes down to calculus. The particular implementation for
         | "neural nets" is just a special case of function minimization
         | by taking derivatives.
           | contingencies wrote:
           | I like zoom-out views. To push what you describe further, it
           | is essentially what ancient humans or their non-hominid
           | forebears did subconsciously when calculating optimum motion
           | trajectories to catch or spear prey while hunting... merely a
           | version in formal notation ... we can thank the zero of India
           | (https://en.wikipedia.org/wiki/0#History), the Persians
           | (https://en.wikipedia.org/wiki/Algorithm#Etymology), the
           | Islamic renaissance in Europe
           | (https://mitpress.mit.edu/books/islamic-science-and-making-
           | eu...) and numerous others for the slow development of the
           | requisite formal maths. But a rose by any other name would
           | smell as sweet. And perhaps, in the context of the
           | stupefyingly deferred emergence of zero, even nameless!
         | lock-free wrote:
         | I mean if we want to get pedantic I'm pretty sure Shannon used
         | "backpropagation" for machine learning before either was called
         | such.
         | Feedback for the purpose of regulating the state of a machine
         | in response to input dates to antiquity, if we're really
         | getting absurd. The formal definition is also debatable, I
         | think Maxwell has the strongest claim.
       | duvenaud wrote:
       | I'm one of the speakers at the workshop mentioned [1]. The
       | article is a bit of a concept salad. I'm not familiar with all of
       | the papers mentioned but am happy to try to answer questions.
       | [1] https://beyondbackprop.github.io/
         | pretty_dumm_guy wrote:
         | Hi Professor,
         | Good day,
         | I was wondering whether it is be possible for you to provide an
         | overview of different methods that you think might have a
         | better shot at replacing backpropagation algorithm?
           | duvenaud wrote:
           | Sure. First of all, I want to say that backprop, by which I
           | mean reverse-mode differentiation for computing gradients,
           | combined with gradient descent for updating parameters, is
           | pretty great. In a sense it's the last thing we should be
           | trying to replace, since pretty much the whole deep learning
           | revolution was about replacing hand-designed functions with
           | ones that can be optimized in this way.
           | Reverse-mode differentiation has about the same time cost as
           | whatever function you're optimizing, no matter how many
           | parameters you need gradients for. This which is about as
           | good as one could hope for, and is what lets it scale to
           | billions of parameters.
           | The main downside of reverse-mode differentiation (and one of
           | the reasons it's biologically implausible) is that it
           | requires storing all the intermediate numbers that were
           | computed when evaluating the function on the forward pass. So
           | its memory cost grows with the complexity of the function
           | being optimized.
           | So the main practical problem with reverse-mode
           | differentiation + gradient descent is the memory requirement,
           | and much of the research presented in the workshop is about
           | ways to get around this. A few of the major approaches are:
           | 1) Only storing a subset of the forward activations, to get
           | noisier gradients at less memory cost. This is what the
           | "Randomized Automatic Differentiation" paper does. You can
           | also save memory and get exact gradients if you re-construct
           | the activations as you need them (called checkpointing), but
           | this is slower.
           | 2) Only training one layer at a time. This is what the
           | "Layer-wise Learning" papers are doing. I suppose you could
           | also say that this is what the "feedback alignment" papers
           | are doing.
           | 3) If the function being optimized is a fixed-point
           | computation (such as an optimization), you can compute its
           | gradient without needing to store any activations by using
           | the implicit function theorem. This is what my talk was
           | about.
           | 4) Some other forms of sensitivity analysis (not exactly the
           | same as computing gradients) can be done by just letting a
           | dynamical system run for a little while. Barak Pearlmutter
           | has some work on how he thinks this is what happens in slow-
           | wave sleep to make our brains less prone to seizures when
           | we're awake.
           | I'm missing a lot of relevant work, and again I don't even
           | know all the work that was presented at this one workshop.
           | But I hope this helps.
             | pretty_dumm_guy wrote:
             | Thank you for your answer. It appears to me that we are
             | trying to achieve an algorithm that has better time
             | complexity than the one that we have right now(reverse mode
             | differentiation with gradient descent).
             | Is it possible to combine these methods in a straight
             | forward manner with methods that try to reduce the space
             | complexity? For example, Lottery ticket
             | hypothesis(https://arxiv.org/abs/1803.03635) seems to
             | reduce spacial complexity(Please do correct me if I am
             | wrong).
             | Also, based on my rather poor and limited knowledge, it
             | appears to me that set of proposed methods that reduced
             | space complexity and set of proposed methods that reduce
             | time complexity are disjoint. Is that the case ?
               | sdenton4 wrote:
               | (Lottery Ticket, to date, produces small networks ex post
               | facto... You still have to train the giant network.
               | There's also some indication that it's chancy on 'large'
               | datasets+problems. https://arxiv.org/abs/1902.09574 )
             | bmc7505 wrote:
             | > Barak Pearlmutter has some work on how he thinks this is
             | what happens in slow-wave sleep to make our brains less
             | prone to seizures when we're awake.
             | Interesting! I am more familiar with Pearlmutter's work on
             | automatic differentiation, but was was unaware of this work
             | with Houghton.
             | A new hypothesis for sleep: tuning for criticality:
             | https://zero.sci-
             | hub.se/2153/6c1cfbc1b78d23ef2e1cb7102dd8339...
             | There is also a related paper on wake-sleep learning from
             | UofT, of which I am sure you are aware:
             | The wake-sleep algorithm for unsupervised neural networks:
             | https://www.cs.toronto.edu/~hinton/absps/ws.pdf
             | Are you aware of any recent work investigating the role of
             | sleep in biological and statistical learning?
         | im3w1l wrote:
         | One approach I've been thinking about is optimizing each neuron
         | using only the global loss and information about the
         | neighboring neurons.
         | Basically if the network made the correct prediction tell each
         | neuron to do a little bit more of what it just did. If it sent
         | a high output, change the weights so it sends an even higher
         | output. Weaken connections that were inhibitory and strengthen
         | connections that were excitatory. And for a neuron with a low
         | output, make it even lower by doing the opposite.
         | If on the other hand prediction was wrong, then try to make the
         | neuron do less of what it did.
         | Do you know if something like this has been tried?
           | sdenton4 wrote:
           | Backprop is good at giving credit where credit is due: you're
           | looking at the impact of each weight on loss, which allows
           | changing each weight to improve the loss, by an appropriate
           | amount proportional to the other weights. You can even have
           | some negative weight gradients and some positive; ie, it may
           | be that even with a 'good' overall result that it's best to
           | turn down a particular weight.
           | So my guess is that this approach would either take a much
           | longer time to converge (as there's less information
           | transmitted back for the neuron updates) or stall out
           | completely.
           | Probably not too hard to code up, if you want to try it. But
           | I would also be pretty surprised if it hadn't been tried
           | before.
       | bra-ket wrote:
       | The best part of the article is a quote by G. Hinton, the father
       | of "deep learning", at the end: "My view is throw it all away and
       | start again, I don't think that's how the brain works."
         | officehero wrote:
         | The main problem is not backpropagation though, but the
         | fixation of resources on DL projects (that's what I call local
         | minimum!). In my department, for example, they don't seem to
         | care about the application, integration, deployment etc, as
         | long as it's DL or DRL.
         | faitswulff wrote:
         | Didn't he recently publish an article about a drastically
         | different way to approach machine learning called capsule
         | networks?
         | deehouie wrote:
         | While Hinton's view need to be noted, I heard a quote
         | attributed to Yann LeCun, something like,
         | "If you want to learn flying by modeling the biology of birds,
         | you're doing it wrong. Just look at today's airplanes. They
         | have no resemblance to birds at all. Yet they're million times
         | better and faster than any bird."
           | justicezyx wrote:
           | "They have no resemblance to birds at all. Yet they're
           | million times better and faster than any bird."
           | LMAO
           | A 6 years old kid can see the fundamental resemblance between
           | a bird and a modern passenger airplane: The wings Tail
           | stabilizer Slender body
           | Planes are faster bigger
           | Are they better?
           | Not necessarily, for example, humming bird can fly in a way
           | that is far beyond any human machine in terms of efficiency
           | and flexibility.
           | Of course man should not imitate birds, because human flight
           | is fundamentally different activity than bird flying. But to
           | say human aviation did not start by mimicking birds, is like
           | to say Ann was not inspired human brain...
             | nightski wrote:
             | I think the point was that aviation at the time did start
             | mimicking birds and that was why there was so much failure.
             | It was not until they let go of mimicking birds and took a
             | different approach that they found success.
           | bbarnett wrote:
           | I get what the author was trying to say, but it's still -- a
           | very limited view. Mostly because of the last bit
           | (better/faster).
           | Birds are to planes, as humans are to cars. Yet can a car
           | leap over barricades, climb mountains, trees, self-repair,
           | turn on a dime, stop instantly, etc, etc?
           | A plane cannot maneuver like a bird, take off in crazy
           | weather conditions, land on a dime in a tree, stop almost
           | instantly in flight, and change direction, etc.
           | I think what you've quoted has a lot of value here, for, what
           | we should expect from an artificial brain, isn't a human
           | brain. This is truth. However, while it may be faster in a
           | specific capacity, but it won't have the same
           | characteristics.
           | So yes, expecting it to be like a human brain doesn't make
           | sense.
           | Yet better/faster? I don't think we can compare this, they're
           | too different.
           | (which is really the quote's point, but I just didn't like
           | the better/faster bit at the end...)
             | 2-tpg wrote:
             | > The question of whether a computer can think is no more
             | interesting than the question of whether a submarine can
             | swim. --- Dijkstra
             | Better/faster we would not directly compare to humans, but
             | to benchmarks and timed experiments.
             | LeCun is saying to treat "intelligence" the same as
             | "flight" or "swimming". It is a matter of function, not a
             | matter of a specific instantiation on a biological
             | substrate. You don't need to recreate flapping wings to
             | gain "flight", you can strap a combustion engine on a
             | cylinder and beat all birds on earth in regards to speed.
             | You don't say "we don't have flight yet", because an
             | airplane is not able to land on a tree branch. Maybe we
             | don't have yet all the components and aspects of "flight",
             | but this is not a show stopper, and drones have come a long
             | way.
             | Now the more interesting question becomes: What are the
             | laws of aerodynamics for intelligence?
             | Aside: I think it is _absolutely insane_ that a conference
             | workshop with papers yet to go through peer-review, is
             | highlighted as a popsci article on VentureBeat. That 's
             | such a narrow workshop, that even researchers in the field
             | may be unaware of it. And now these get to read the paper
             | summaries from a HN-story. "the centre cannot hold".
             | Aside II: Yann LeCun talk from 2019 about this subject
             | (better to debate the source ;)):
             | > _Clearly, Deep Learning research would greatly benefit
             | from better theoretical understanding. DL is partly
             | engineering science in which we create new artifacts
             | through theoretical insight, intuition, biological
             | inspiration, and empirical exploration. But understanding
             | DL is a kind of "physical science" in which the general
             | properties of this artifact is to be understood. The
             | history of science and technology is replete with examples
             | where the technological artifact preceded (not followed)
             | the theoretical understanding: the theory of optics
             | followed the invention of the lens, thermodynamics followed
             | the steam engine, aerodynamics largely followed the
             | airplane, information theory followed radio communication,
             | and computer science followed the programmable calculator.
             | My two main points are that (1) empiricism is a perfectly
             | legitimate method of investigation, albeit an inefficient
             | one, and (2) our challenge is to develop the equivalent of
             | thermodynamics for learning and intelligence. While a
             | theoretical underpinning, even if only conceptual, would
             | greatly accelerate progress, one must be conscious of the
             | limited practical implications of general theories._ --- ht
             | tps://www.ias.edu/video/DeepLearningConf/2019-0222-YannLeC.
             | ..
             | nn3 wrote:
             | Also birds (and insects/bats/pterosaurs) flight is a lot
             | more energy efficient than any plane. Today's deep learning
             | is essentially brute force, burning thousands of watts for
             | anything more complicated which a single human brain can
             | often do in ~15Watts.
             | The advanced models like GPT-3 are burning millions of
             | watts in the cloud but they're not that much better than
             | what a brain can do (and in many ways worse, as in often
             | requiring supervised learning)
             | That's the key point. The algorithms need to become more
             | energy efficient to make significant leaps, thus become
             | more like brains.
               | starpilot wrote:
               | No, it's not. What's your comparison? Are there birds
               | that can carry 80,000 lb of passenger + cargo weight?
               | Condors fly like fixed-wing aircraft for 99% of their
               | flight, hummingbirds fly more like insects. There isn't
               | one type of bird flight.
               | This whole HN discussion of bird flight is a trainwreck
               | and reflects massive gaps in understanding of
               | aerodynamics. This is '00s "computer virus news report"
               | level competence in this subject.
               | sterlind wrote:
               | We understand the aerodynamics of bird flight, and used
               | it to make fixed-wing planes optimized for carrying lots
               | of cargo. Once we understand the principles behind
               | intelligence, we can make very efficient AI optimized for
               | our usage. But we're still at the point where we don't
               | understand intelligence as well as we understood
               | aerodynamics when building the first planes, so we still
               | have a lot to learn from "birds" - animal brains.
               | frongpik wrote:
               | I believe AI will start as a basic principle or idea that
               | can be applied to any sufficiently big state machine that
               | controls e.g. an RC airplane or traffic lights. That idea
               | will be obvious in a hindsight. I'd even make a guess
               | that it will be like a "stateful" state machine that
               | accumulates state in a particular manner and uses that to
               | control the underlying state machine. We still will be
               | nowhere near understanding intelligence, but that clever
               | trick will be enough in most cases.
               | ant6n wrote:
               | Also, birds produce themselves out of an egg, with only
               | food, water and air as production input. They also can
               | produce more of themselves with minimal input. They are
               | also self-repairing/maintaining, something planes cant
               | do.
           | bra-ket wrote:
           | the problem with that analogy is trying to build an airplane
           | before you figured out the laws of physics.
             | hyko wrote:
             | People build things without understanding the underlying
             | principles all the time, e.g. the steam engine. You could
             | probably make the case that building things has helped our
             | understanding more than our understanding has helped us
             | build things.
             | Having said that, you can certainly improve a design when
             | you better understand the fundamentals (vs intuition +
             | trial & error).
             | sdenton4 wrote:
             | The physics of lift and aerodynamics were faaaaar from well
             | understood at the time of the first airplanes, though. New
             | areas tend to run a bit ahead of the underlying science;
             | the fundamentals expand to support and improve the
             | applications over time.
               | bra-ket wrote:
               | but we did have quite a few advances at the time of the
               | first airplane, for example by that time steam &
               | combustion engines were already invented, which required
               | non-trivial understanding of physics, chemistry and
               | material science was very advanced.
               | I hold a pessimistic view that we are still in hunter-
               | gatherer mode as it comes to understanding cognition.
               | sdenton4 wrote:
               | Well, it's your right to be a pessimist... I tend to
               | think that the current hardware specialized for fast,
               | parallelized linear algebra is at least as good as the
               | wheels available at the start of the industrial
               | revolution, though. We have learning algorithms that can
               | match human/animal performance in a wide - but still
               | constrained - set of tasks, which previous non-learned
               | algorithms hadn't been able to crack. It's a start!
               | At some point you have to strike rocks to make fire,
               | because the butane lighter hasn't been invented yet. You
               | make do with what's available, and progressively get
               | better at it. I tend to think that we're a couple-few
               | perspective shifts away from getting it 'right,' and that
               | the hardware side likely barely matters. But, I'm an
               | optimist.
       (page generated 2020-12-20 23:00 UTC)