[HN Gopher] Deep Learning's Diminishing Returns
       ___________________________________________________________________
        
       Deep Learning's Diminishing Returns
        
       Author : RageoftheRobots
       Score  : 82 points
       Date   : 2021-09-24 18:41 UTC (4 hours ago)
        
 (HTM) web link (spectrum.ieee.org)
 (TXT) w3m dump (spectrum.ieee.org)
        
       | lvl100 wrote:
       | I've had similar thoughts in the past but I started playing
       | around with some newer models recently and I have about 25-30
       | projects that I can think of right off that could be considered
       | commercially viable. And certainly VC fundable in this investment
       | environment.
        
         | phyalow wrote:
         | Email me (its on my profile) if you want to shoot some project
         | ideas.
        
         | selimthegrim wrote:
         | I'd love to hear some of them too (profile at gmail)
        
       | CodeGlitch wrote:
       | I do wonder if we'll see the rise of symbolic AI to give deep
       | learning a sense of common sense? I've been thinking a lot about
       | this interview on Cyc:
       | 
       | https://www.youtube.com/watch?v=3wMKoSRbGVs
        
         | drdeca wrote:
         | And here I thought one of the big benefits of DL was that it
         | could handle the complexities which would be too hard to
         | specify symbolically in order to give symbolic AI "common
         | sense".
         | 
         | The following argument comes to mind, but I don't really buy it
         | (it just came to mind as something that one might say next):
         | 
         | ' Perhaps there is an analogy between the solutions of "we just
         | need to get better (more varied, better fitting the desired
         | behavior, etc.) training data, and maybe better training
         | procedures" and "we just need to add more/better inference
         | rules and symbolic ways to encode statements, and add more
         | facts about the world". Similar in that both will produce the
         | specific improvements they target, but where solving "the
         | real/whole/big problem" that way is infeasible. If so, then
         | maybe this indicates that a practical full-solution to
         | artificial "common sense" would require something fundamentally
         | different than both of them, if it is even possible at all. '
         | 
         | Again, I don't really buy that line of reasoning, just
         | expressing my inner GPT2 I guess, haha.
         | 
         | Ok, but I presented an argument (or something like an argument)
         | which I made up, and said that I don't buy it. So, I should say
         | why I don't buy it, right? Like many of the things I write, it
         | is chock-full of qualifiers like "perhaps" and "maybe", to the
         | point that one might say that it hardly makes any claims at
         | all. But ignoring that part of it, one major difference is that
         | the DL style architectures, seem to be working? And it isn't
         | clear what kinds of (practically speaking) hard limits it could
         | run into. Now, on the other hand, perhaps at the time that
         | symbolic AI was all the rage, it appeared the same way. (Is
         | this what people mean when they talk about inside view vs
         | outside view?).
         | 
         | Why should these two things not be especially analogous? Well,
         | saying "proposed solution X to the problem says to just [do
         | more of what X is/do X better], and that is just like how
         | proposed solution Y says to just [do more of what Y is/do Y
         | better]" is kind of a fully generalize argument for dismissing
         | any proposed type of solution where partial solutions of that
         | type have been tried, but the whole problem hasn't been solved
         | that way yet, and another proposed kind of solution has already
         | lost favor. This doesn't seem like a generally valid line of
         | reasoning. Sometimes you really do just need more dakka
         | (spelling? I mean "more of the thing you already tried some
         | of").
         | 
         | Of course, if one is convinced that it really was right for the
         | older proposed kind of solution to be discarded, that probably
         | should say something about the currently popular kind of
         | solution. Especially if there have been many proposed kinds of
         | solutions which have been discarded. But, it seems like much of
         | what it says is just that the problem is hard. And, sure, that
         | may mean an increased probability that the currently popular
         | proposed kind of solution also doesn't end up being
         | satisfactory, that doesn't mean one should be too quick to
         | discard it. Tautologically: if no known alternative is
         | currently at least as promising as the type of solution
         | currently being considered, then, the current one is the most
         | promising of the currently known options. Whether it is
         | promising enough to actively pursue may be a different
         | question, but it shouldn't be marked as discarded until
         | something else (perhaps something previously discarded, or
         | something novel) becomes more promising.
        
         | airstrike wrote:
         | From Wikipedia (https://en.wikipedia.org/wiki/Cyc#Criticisms)
         | 
         | > ... A similar sentiment was expressed by Marvin Minsky:
         | "Unfortunately, the strategies most popular among AI
         | researchers in the 1980s have come to a dead end," said Minsky.
         | So-called "expert systems," which emulated human expertise
         | within tightly defined subject areas like law and medicine,
         | could match users' queries to relevant diagnoses, papers and
         | abstracts, yet they could not learn concepts that most children
         | know by the time they are 3 years old. "For each different kind
         | of problem," said Minsky, "the construction of expert systems
         | had to start all over again, because they didn't accumulate
         | common-sense knowledge." Only one researcher has committed
         | himself to the colossal task of building a comprehensive
         | common-sense reasoning system, according to Minsky. Douglas
         | Lenat, through his Cyc project, has directed the line-by-line
         | entry of more than 1 million rules into a commonsense knowledge
         | base."
        
           | ypcx wrote:
           | And then, GPT-3 came along and rendered Cyc a wasted effort.
        
             | CodeGlitch wrote:
             | > And then, GPT-3 came along and rendered Cyc a wasted
             | effort.
             | 
             | I'm yet to see GPT-3 do anything commercially important?
             | Cyc on the other hand seems to have been used in a number
             | of sectors. Not to downplay GPT-3 - it's cool tech that
             | produces cool demos - Cyc just seems more like a tool
             | rather than a toy.
        
               | montenegrohugo wrote:
               | GPT-3 has been used in a ton of commercially important
               | applications.
               | 
               | To name a few:
               | 
               | - GitHub CoPilot (transformative imo)
               | 
               | - Markcopy.ai, jenni.ai, etc.... Tons of content
               | generation and SEO tools startups
               | 
               | - AI Dungeon and such
               | 
               | - Plenty of chatbots
               | 
               | - It's super useful for all kinds of classification tasks
               | too (as are all transformer models)
        
               | CodeGlitch wrote:
               | Thanks for the response, I'm not familiar with
               | Markcopy.ai, jenni.ai, but the others are "toys" if we're
               | being honest (although as I said - very cool toys). You
               | wouldn't want to use GPT-3 to recommend drugs for a
               | condition you feed it...would you? As far as I
               | understand, this is the kind of problem Cyc is trying to
               | solve with it's domain-specific rules.
               | 
               | edit: Cyc can also tell you _why_ it gave the response it
               | did - something that deep nets cannot do. This is
               | important in many fields, otherwise you cannot trust any
               | response it produces.
        
             | goatlover wrote:
             | Cyc was trying to encode common knowledge about the world
             | in a bunch of rules. That goes well beyond what GPT-3 does
             | with text.
        
               | ypcx wrote:
               | GPT-3 learns these rules by itself.
        
         | williamtrask wrote:
         | From what I can tell, many of the best thinkers agree with this
         | idea but we haven't cracked it yet.
        
       | d_burfoot wrote:
       | > it does so using a network with 480 million parameters. The
       | training to ascertain the values of such a large number of
       | parameters is even more remarkable because it was done with only
       | 1.2 million labeled images--which may understandably confuse
       | those of us who remember from high school algebra that we are
       | supposed to have more equations than unknowns.
       | 
       | This is one of the key misunderstandings are still deeply rooted
       | in people's minds. For modern DL, a large part of the learning
       | comes from "internal" data points, in this case the pixels of the
       | image, as opposed to the labels. If you count the number of
       | pixels, you will likely get something like 1.2 trillion, more
       | than enough to justify the 4.8e8 parameters. It's the usage of
       | internal data that prevents overfitting, NOT the random
       | initialization and SGD as claimed in the article.
       | 
       | Another way to see this is: if you need more labels than
       | parameters, how can GPT3 have ANY parameters at all? It is
       | trained purely on raw text data.
        
         | albertzeyer wrote:
         | For modern DL, a large part also comes from regularization. And
         | then also data augmentation. And self-supervision in whatever
         | way, either prediction, masked prediction, contrastive losses,
         | etc.
         | 
         | Which all adds to the number of constraints / equations.
        
         | sendtown_expwy wrote:
         | You are incorrect about the input dimensionality mattering.
         | Let's say you have 100 high-res images with yes/no labels. If
         | you hash the images and put their labels in a hashmap, you can
         | say this is a "learned" function of 100 parameters which
         | achieves zero training error on the dataset. This parameter
         | count is independent of input dimension. Why do you think this
         | would change when this mapping is replaced by a smooth neural
         | network mapping?
         | 
         | GPT is trained to predict the input (estimating p(x)), versus
         | predicting a label given an input (p(y|x)). So in the case of
         | GPT you can use the input dimensionality as a "label", as
         | another responder has mentioned. ImageNet classification is
         | different (excepting recent semi-supervised or unsupervised
         | approaches to image recognition).
         | 
         | The ability to generalize in the typical imagenet setting is,
         | as the article says, a byproduct of SGD with early stopping,
         | which in practice limits the number of functions a deep neural
         | network can express (something not considered in an analysis
         | which only considers parameter count).
        
           | montenegrohugo wrote:
           | The point is your simple mapping with zero error on the
           | training dataset also has zero prediction power in both the
           | test dataset and in real life. It's learned nothing; it's at
           | the extreme scale of overfitted.
           | 
           | Input dimensionality is absolutely important when determining
           | net size.
        
             | sendtown_expwy wrote:
             | That's the point. 100 parameters is sufficient to overfit,
             | and it's a number that's independent of the input size. Do
             | you have a reference for your statement?
        
               | montenegrohugo wrote:
               | Reference for what exactly? That input dimensionality is
               | important when determining net size? That seems quite
               | self-explanatory; try training a image classifier with
               | only 100 parameters.
               | 
               | Maybe I understood that question wrong, but regardless,
               | even if early stopping wasn't implemented, a NN would
               | have more predictive power than the hash mapping. Both
               | would be completely overfit on the training data set, yet
               | the NN would most likely be able to make some okay
               | guesses with OOD data.
        
         | williamtrask wrote:
         | GPT3 has millions of labels. Every vocabulary term is a label.
         | It's equivalent to supervised learning in architecture. The
         | "self-supervised" business is mostly spin to make it sound a
         | bit more novel. People have been predicting the next word for
         | ages (Turing did this).
         | 
         | Input: <previous words of article>
         | 
         | Label: <next word>
         | 
         | Your point is well taken that the number of input data points
         | is also important when considering the complexity of the
         | problem. In this case however the number of data points more or
         | less exactly equals the number of labels.
         | 
         | (About Me: the first year+ of my PhD was focused on large scale
         | language modelling, during which transformers came out.)
        
         | axiosgunnar wrote:
         | This is such a basic error the author has made, that I am not
         | sure he can use ,,we" when refering to researchers...
        
           | commandlinefan wrote:
           | Well, that _is_ something that you 're taught in high school
           | algebra, which you end up "unlearning" when you study linear
           | algebra.
        
         | contravariant wrote:
         | That doesn't quite work out that way I think, if you compare it
         | to solving a system of equations then the size of the input is
         | irrelevant. Indeed a very large input is often the main reason
         | for a problem to be under-specified.
         | 
         | What you should look at is the number of outputs times the
         | number of data points for each output. If this number is lower
         | than the number of parameters then it should be possible to
         | find multiple solutions.
         | 
         | Of course in this case you're not looking for a solution, but
         | an optimum, and not even a global one, so it's not too
         | troubling per se that you don't get a unique answer. Though it
         | does somewhat suggest you should be able to get an equivalent
         | fit with far fewer parameters, but finding it could be quite
         | tricky.
        
       | culi wrote:
       | > At OpenAI, an important machine-learning think tank,
       | researchers recently designed and trained a much-lauded deep-
       | learning language system called GPT-3 at the cost of more than $4
       | million. Even though they made a mistake when they implemented
       | the system, they didn't fix it, explaining simply in a supplement
       | to their scholarly publication that "due to the cost of training,
       | it wasn't feasible to retrain the model."
        
       | djoldman wrote:
       | > Deep-learning models are overparameterized, which is to say
       | they have more parameters than there are data points available
       | for training.
       | 
       | Is this true for all deep learning models?
        
         | dekhn wrote:
         | depends on the model, but most systemns I've worked with had
         | millions to billions of parameters, and trillions of (sparsely
         | populated) data points.
        
         | lvl100 wrote:
         | DL cannot be over-specified. However you do need to mind your
         | endogenous and exogenous variables.
        
         | jjcon wrote:
         | Not even close, most of my work has been in naturally occurring
         | data and there is way waay more data available than can
         | possibly be used (petabytes). Where they get this idea as being
         | the rule and not the exception is beyond me.
        
         | armoredkitten wrote:
         | It's not _inherently_ true. Technically, deep learning is
         | essentially any neural network model with hidden layers (i.e.,
         | one layer in between the input layer and the output layer). You
         | could have a  "deep learning" model with a couple dozen
         | parameters, perhaps. But at that end of the scale, most people
         | would probably reach for other approaches that are more easily
         | interpretable (e.g., logistic regression, random forest). So in
         | practice, yes, virtually any deep learning model you see out
         | there in the wild, even most "toy examples" used to teach
         | machine learning, are going to be overparameterized.
        
       | bjornsing wrote:
       | > Training such a model would cost US $100 billion and would
       | produce as much carbon emissions as New York City does in a
       | month.
       | 
       | This is what's called infeasible.
        
       | abecedarius wrote:
       | Longish article about the cost of training increasingly big
       | neural nets. Worried about carbon. "Training such a model would
       | cost US $100 billion and would produce as much carbon emissions
       | as New York City does in a month. And if we estimate the
       | computational burden of a 1 percent error rate, the results are
       | considerably worse."
        
       | sayonaraman wrote:
       | I'm wondering if there is a way to combine optimization of model
       | weights in a neural net with a set of heuristics limiting the
       | search space, as a sort of rules engine/decision tree integrated
       | within ANN backprop training. Basically pruning irrelevant and
       | redundant features early and focusing on more informative ones.
        
         | visarga wrote:
         | Yes, there are many approaches like that. In one approach they
         | train a network and prune it, then mask the pruned weights and
         | retrain from scratch a sparse network from the original
         | untrained weights.
        
       | gibolt wrote:
       | This assumes that our processes and algorithms don't get more
       | targeted or improve. The rate of new approach discovery is
       | staggering. For every problem, some combination of approaches
       | will more efficiently pre-process and understand the training
       | data.
       | 
       | The article also ignores training vs running tradeoffs. Training
       | a model once may be extremely resource intensive, but running the
       | resulting model on millions of devices can be negligible while
       | having huge value add.
        
         | et1337 wrote:
         | Keep reading, the article includes a whole section about
         | training vs running tradeoffs.
        
         | sayonaraman wrote:
         | > new approach discovery
         | 
         | a good example is discovery of attention
         | mechanisms/transformers replacing more cumbersome and
         | computationally expensive RNNs and LSTMs in NLP and more
         | recently outperforming more expensive models in computer
         | vision.
        
           | visarga wrote:
           | Transformers are pretty huge and expensive to run, LSTMs are
           | lighthweight by comparison.
        
         | culi wrote:
         | Keep reading, the article directly addresses this point
        
       | dekhn wrote:
       | Anybody who argues against deep learning based on energy
       | consumption immediately fails to impress me. This article is
       | particularly bad- claiming you need k*2 more data points to
       | improve a model and using that to extrapolate unrealistic energy
       | consumption targets for DL training.
       | 
       | The sum of all DL training in teh world is noise compared to the
       | other big consumers of energy in computing. That's because the
       | main players all invested in energy-efficient architectures. DL
       | training energy is not something to optimize if your goal is to
       | have a measurable impact on total power consumption.
        
         | josefx wrote:
         | > That's because the main players all invested in energy-
         | efficient architectures.
         | 
         | If the cost was gigantic enough to make the investment worth it
         | they must have found some really great improvements for it to
         | end up being just noise. Improvements that somehow didn't have
         | a noteworthy impact on general computing.
        
         | Spooky23 wrote:
         | Open ended crying about electricity doesn't make sense in the
         | absence of specifics.
         | 
         | A big company like Microsoft probably wasted more money on
         | pentium 4s 15 years ago. Electricity is just another resource -
         | if the numbers work, burn away.
        
           | ypcx wrote:
           | Especially if the result is the cure for cancer, or similar.
        
           | ben_w wrote:
           | Perhaps for now, but not necessarily in general.
           | 
           | I know we're nowhere near the following scenario, this is
           | just to illustrate how things can go wrong even if the
           | numbers tell you to "burn away":
           | 
           | Image we have computronium with negligible manufacture cost,
           | the only important thing is the power cost to use it.
           | 
           | Imagine you're using it to run an uploaded mind, spending
           | $35,805/year on energy.
           | 
           | The 50% of Americans earning more than this [0] are no longer
           | economically viable, because their productivity can now be
           | done at the same cost by a computer program.
           | 
           | Doing this with the current power mixture would be
           | disastrous, doing it with PV needs about 1400m^2 per
           | simultaneous real time mind upload instance (depending on
           | your assumption about energy costs and cell efficiency,
           | naturally).
           | 
           | In a more near-term sense, there are plenty of examples where
           | the Nash equilibrium tells each of us to benefit ourselves at
           | the expense of all of us. Not saying that is the case for
           | Deep Learning right now, but can (and frequently does)
           | happen.
           | 
           | [0] https://fred.stlouisfed.org/series/MEPAINUSA672N
        
           | user-the-name wrote:
           | > Electricity is just another resource
           | 
           | I hate to be the one to tell you, but, it turns out we are
           | living in the middle of an ecological catastrophe, and it
           | also turns out that means that electricity is a resource we
           | are going to have to conserve.
        
         | wanderingmind wrote:
         | And yet people here have no trouble crying about electricity
         | wastage of crypto. Also from my limited knowledge I think DNN
         | models are not very transferable in real world setting
         | requiring constant retraining even for a small drift in signal
         | or change in noise modes.
        
           | nerdponx wrote:
           | > And yet people here have no trouble crying about
           | electricity wastage of crypto
           | 
           | Which is many orders of magnitude more energy-intensive, on
           | the scale of a small nation-state, and in most cases
           | fundamentally wasteful by design. A very large pre-trained
           | model can be reused very cheaply once it's finished.
           | 
           | > Also from my limited knowledge I think DNN models are not
           | very transferable in real world setting requiring constant
           | retraining even for a small drift in signal or change in
           | noise modes.
           | 
           | This is FUD, promulgated by people who expected deep learning
           | to solve all their problems overnight. All models will suffer
           | from "drift" whenever the underlying data changes.
           | 
           | Part of what made deep learning so good was that it was able
           | to generalize exceptionally well from exceptionally
           | complicated input data.
           | 
           | It is unreasonable to expect that a model pre-trained on a
           | huge generic corpus will be a perfect match for your very
           | specific business problem. However it is _not_ unreasonable
           | to expect that said model will be a useful baseline and
           | starting point for your very specific business problem.
           | 
           | We are not yet (and might never be) at the point where you
           | can dump a pile of garbage data into an API and get great
           | predictions out the other end on the first try. But nobody
           | ever thought you could do that, except the people selling
           | expensive subscriptions to those kinds of APIs. The fact that
           | they work at all should be taken as evidence of how amazing
           | deep learning is; the fact that they don't work perfectly
           | should not be taken as evidence that deep learning is
           | bad/useless/wasteful/hype/whatever.
           | 
           | Don't let the clueless tech media set your expectations.
           | 
           | Professional data scientists and machine learning
           | practitioners for the most part take their work very
           | seriously and take pride in delivering good outcomes, just
           | like professional software engineers. If deep learning wasn't
           | useful to that end, nobody would be using it.
        
       | culi wrote:
       | Here's the original study[1] that seems to be the primary source
       | for this article. It's an important study from a respectable
       | journal. To be frank, it's pretty disconcerting that the top
       | comments on this thread are those writing off the topic on the
       | premise alone while those comments actually engaging with the
       | topic seem to be at the bottom
       | 
       | [1] https://arxiv.org/pdf/1906.02243.pdf
        
       | SavantIdiot wrote:
       | Putting aside energy costs, Object detection is still crappy and
       | has stalled. YOLO/SSDMN were impressive as all get-out, but they
       | stink for general purpose use. It's been 3 years (?) and general
       | object detection, even with 100 classes, is still unusable off
       | the shelf. Yes, I understand incremental training of pre-trained
       | nets is a thing, but that's not where we all hoped it would go.
        
       | BeatLeJuce wrote:
       | > > The first part is true of all statistical models: To improve
       | performance by a factor of k, at least k^2 more data points must
       | be used to train the model. The second part of the computational
       | cost comes explicitly from overparameterization. Once accounted
       | for, this yields a total computational cost for improvement of at
       | least k4.
       | 
       | Those claims are entirely new to me, and I've been a researcher
       | in the field for almost 10 years. Where do they come from/what
       | theorems are they based on? It's unfortunate this article doesn't
       | have any citations.
        
       ___________________________________________________________________
       (page generated 2021-09-24 23:01 UTC)