[HN Gopher] Deep Learning's Diminishing Returns ___________________________________________________________________ Deep Learning's Diminishing Returns Author : RageoftheRobots Score : 82 points Date : 2021-09-24 18:41 UTC (4 hours ago) (HTM) web link (spectrum.ieee.org) (TXT) w3m dump (spectrum.ieee.org) | lvl100 wrote: | I've had similar thoughts in the past but I started playing | around with some newer models recently and I have about 25-30 | projects that I can think of right off that could be considered | commercially viable. And certainly VC fundable in this investment | environment. | phyalow wrote: | Email me (its on my profile) if you want to shoot some project | ideas. | selimthegrim wrote: | I'd love to hear some of them too (profile at gmail) | CodeGlitch wrote: | I do wonder if we'll see the rise of symbolic AI to give deep | learning a sense of common sense? I've been thinking a lot about | this interview on Cyc: | | https://www.youtube.com/watch?v=3wMKoSRbGVs | drdeca wrote: | And here I thought one of the big benefits of DL was that it | could handle the complexities which would be too hard to | specify symbolically in order to give symbolic AI "common | sense". | | The following argument comes to mind, but I don't really buy it | (it just came to mind as something that one might say next): | | ' Perhaps there is an analogy between the solutions of "we just | need to get better (more varied, better fitting the desired | behavior, etc.) training data, and maybe better training | procedures" and "we just need to add more/better inference | rules and symbolic ways to encode statements, and add more | facts about the world". Similar in that both will produce the | specific improvements they target, but where solving "the | real/whole/big problem" that way is infeasible. If so, then | maybe this indicates that a practical full-solution to | artificial "common sense" would require something fundamentally | different than both of them, if it is even possible at all. ' | | Again, I don't really buy that line of reasoning, just | expressing my inner GPT2 I guess, haha. | | Ok, but I presented an argument (or something like an argument) | which I made up, and said that I don't buy it. So, I should say | why I don't buy it, right? Like many of the things I write, it | is chock-full of qualifiers like "perhaps" and "maybe", to the | point that one might say that it hardly makes any claims at | all. But ignoring that part of it, one major difference is that | the DL style architectures, seem to be working? And it isn't | clear what kinds of (practically speaking) hard limits it could | run into. Now, on the other hand, perhaps at the time that | symbolic AI was all the rage, it appeared the same way. (Is | this what people mean when they talk about inside view vs | outside view?). | | Why should these two things not be especially analogous? Well, | saying "proposed solution X to the problem says to just [do | more of what X is/do X better], and that is just like how | proposed solution Y says to just [do more of what Y is/do Y | better]" is kind of a fully generalize argument for dismissing | any proposed type of solution where partial solutions of that | type have been tried, but the whole problem hasn't been solved | that way yet, and another proposed kind of solution has already | lost favor. This doesn't seem like a generally valid line of | reasoning. Sometimes you really do just need more dakka | (spelling? I mean "more of the thing you already tried some | of"). | | Of course, if one is convinced that it really was right for the | older proposed kind of solution to be discarded, that probably | should say something about the currently popular kind of | solution. Especially if there have been many proposed kinds of | solutions which have been discarded. But, it seems like much of | what it says is just that the problem is hard. And, sure, that | may mean an increased probability that the currently popular | proposed kind of solution also doesn't end up being | satisfactory, that doesn't mean one should be too quick to | discard it. Tautologically: if no known alternative is | currently at least as promising as the type of solution | currently being considered, then, the current one is the most | promising of the currently known options. Whether it is | promising enough to actively pursue may be a different | question, but it shouldn't be marked as discarded until | something else (perhaps something previously discarded, or | something novel) becomes more promising. | airstrike wrote: | From Wikipedia (https://en.wikipedia.org/wiki/Cyc#Criticisms) | | > ... A similar sentiment was expressed by Marvin Minsky: | "Unfortunately, the strategies most popular among AI | researchers in the 1980s have come to a dead end," said Minsky. | So-called "expert systems," which emulated human expertise | within tightly defined subject areas like law and medicine, | could match users' queries to relevant diagnoses, papers and | abstracts, yet they could not learn concepts that most children | know by the time they are 3 years old. "For each different kind | of problem," said Minsky, "the construction of expert systems | had to start all over again, because they didn't accumulate | common-sense knowledge." Only one researcher has committed | himself to the colossal task of building a comprehensive | common-sense reasoning system, according to Minsky. Douglas | Lenat, through his Cyc project, has directed the line-by-line | entry of more than 1 million rules into a commonsense knowledge | base." | ypcx wrote: | And then, GPT-3 came along and rendered Cyc a wasted effort. | CodeGlitch wrote: | > And then, GPT-3 came along and rendered Cyc a wasted | effort. | | I'm yet to see GPT-3 do anything commercially important? | Cyc on the other hand seems to have been used in a number | of sectors. Not to downplay GPT-3 - it's cool tech that | produces cool demos - Cyc just seems more like a tool | rather than a toy. | montenegrohugo wrote: | GPT-3 has been used in a ton of commercially important | applications. | | To name a few: | | - GitHub CoPilot (transformative imo) | | - Markcopy.ai, jenni.ai, etc.... Tons of content | generation and SEO tools startups | | - AI Dungeon and such | | - Plenty of chatbots | | - It's super useful for all kinds of classification tasks | too (as are all transformer models) | CodeGlitch wrote: | Thanks for the response, I'm not familiar with | Markcopy.ai, jenni.ai, but the others are "toys" if we're | being honest (although as I said - very cool toys). You | wouldn't want to use GPT-3 to recommend drugs for a | condition you feed it...would you? As far as I | understand, this is the kind of problem Cyc is trying to | solve with it's domain-specific rules. | | edit: Cyc can also tell you _why_ it gave the response it | did - something that deep nets cannot do. This is | important in many fields, otherwise you cannot trust any | response it produces. | goatlover wrote: | Cyc was trying to encode common knowledge about the world | in a bunch of rules. That goes well beyond what GPT-3 does | with text. | ypcx wrote: | GPT-3 learns these rules by itself. | williamtrask wrote: | From what I can tell, many of the best thinkers agree with this | idea but we haven't cracked it yet. | d_burfoot wrote: | > it does so using a network with 480 million parameters. The | training to ascertain the values of such a large number of | parameters is even more remarkable because it was done with only | 1.2 million labeled images--which may understandably confuse | those of us who remember from high school algebra that we are | supposed to have more equations than unknowns. | | This is one of the key misunderstandings are still deeply rooted | in people's minds. For modern DL, a large part of the learning | comes from "internal" data points, in this case the pixels of the | image, as opposed to the labels. If you count the number of | pixels, you will likely get something like 1.2 trillion, more | than enough to justify the 4.8e8 parameters. It's the usage of | internal data that prevents overfitting, NOT the random | initialization and SGD as claimed in the article. | | Another way to see this is: if you need more labels than | parameters, how can GPT3 have ANY parameters at all? It is | trained purely on raw text data. | albertzeyer wrote: | For modern DL, a large part also comes from regularization. And | then also data augmentation. And self-supervision in whatever | way, either prediction, masked prediction, contrastive losses, | etc. | | Which all adds to the number of constraints / equations. | sendtown_expwy wrote: | You are incorrect about the input dimensionality mattering. | Let's say you have 100 high-res images with yes/no labels. If | you hash the images and put their labels in a hashmap, you can | say this is a "learned" function of 100 parameters which | achieves zero training error on the dataset. This parameter | count is independent of input dimension. Why do you think this | would change when this mapping is replaced by a smooth neural | network mapping? | | GPT is trained to predict the input (estimating p(x)), versus | predicting a label given an input (p(y|x)). So in the case of | GPT you can use the input dimensionality as a "label", as | another responder has mentioned. ImageNet classification is | different (excepting recent semi-supervised or unsupervised | approaches to image recognition). | | The ability to generalize in the typical imagenet setting is, | as the article says, a byproduct of SGD with early stopping, | which in practice limits the number of functions a deep neural | network can express (something not considered in an analysis | which only considers parameter count). | montenegrohugo wrote: | The point is your simple mapping with zero error on the | training dataset also has zero prediction power in both the | test dataset and in real life. It's learned nothing; it's at | the extreme scale of overfitted. | | Input dimensionality is absolutely important when determining | net size. | sendtown_expwy wrote: | That's the point. 100 parameters is sufficient to overfit, | and it's a number that's independent of the input size. Do | you have a reference for your statement? | montenegrohugo wrote: | Reference for what exactly? That input dimensionality is | important when determining net size? That seems quite | self-explanatory; try training a image classifier with | only 100 parameters. | | Maybe I understood that question wrong, but regardless, | even if early stopping wasn't implemented, a NN would | have more predictive power than the hash mapping. Both | would be completely overfit on the training data set, yet | the NN would most likely be able to make some okay | guesses with OOD data. | williamtrask wrote: | GPT3 has millions of labels. Every vocabulary term is a label. | It's equivalent to supervised learning in architecture. The | "self-supervised" business is mostly spin to make it sound a | bit more novel. People have been predicting the next word for | ages (Turing did this). | | Input: <previous words of article> | | Label: <next word> | | Your point is well taken that the number of input data points | is also important when considering the complexity of the | problem. In this case however the number of data points more or | less exactly equals the number of labels. | | (About Me: the first year+ of my PhD was focused on large scale | language modelling, during which transformers came out.) | axiosgunnar wrote: | This is such a basic error the author has made, that I am not | sure he can use ,,we" when refering to researchers... | commandlinefan wrote: | Well, that _is_ something that you 're taught in high school | algebra, which you end up "unlearning" when you study linear | algebra. | contravariant wrote: | That doesn't quite work out that way I think, if you compare it | to solving a system of equations then the size of the input is | irrelevant. Indeed a very large input is often the main reason | for a problem to be under-specified. | | What you should look at is the number of outputs times the | number of data points for each output. If this number is lower | than the number of parameters then it should be possible to | find multiple solutions. | | Of course in this case you're not looking for a solution, but | an optimum, and not even a global one, so it's not too | troubling per se that you don't get a unique answer. Though it | does somewhat suggest you should be able to get an equivalent | fit with far fewer parameters, but finding it could be quite | tricky. | culi wrote: | > At OpenAI, an important machine-learning think tank, | researchers recently designed and trained a much-lauded deep- | learning language system called GPT-3 at the cost of more than $4 | million. Even though they made a mistake when they implemented | the system, they didn't fix it, explaining simply in a supplement | to their scholarly publication that "due to the cost of training, | it wasn't feasible to retrain the model." | djoldman wrote: | > Deep-learning models are overparameterized, which is to say | they have more parameters than there are data points available | for training. | | Is this true for all deep learning models? | dekhn wrote: | depends on the model, but most systemns I've worked with had | millions to billions of parameters, and trillions of (sparsely | populated) data points. | lvl100 wrote: | DL cannot be over-specified. However you do need to mind your | endogenous and exogenous variables. | jjcon wrote: | Not even close, most of my work has been in naturally occurring | data and there is way waay more data available than can | possibly be used (petabytes). Where they get this idea as being | the rule and not the exception is beyond me. | armoredkitten wrote: | It's not _inherently_ true. Technically, deep learning is | essentially any neural network model with hidden layers (i.e., | one layer in between the input layer and the output layer). You | could have a "deep learning" model with a couple dozen | parameters, perhaps. But at that end of the scale, most people | would probably reach for other approaches that are more easily | interpretable (e.g., logistic regression, random forest). So in | practice, yes, virtually any deep learning model you see out | there in the wild, even most "toy examples" used to teach | machine learning, are going to be overparameterized. | bjornsing wrote: | > Training such a model would cost US $100 billion and would | produce as much carbon emissions as New York City does in a | month. | | This is what's called infeasible. | abecedarius wrote: | Longish article about the cost of training increasingly big | neural nets. Worried about carbon. "Training such a model would | cost US $100 billion and would produce as much carbon emissions | as New York City does in a month. And if we estimate the | computational burden of a 1 percent error rate, the results are | considerably worse." | sayonaraman wrote: | I'm wondering if there is a way to combine optimization of model | weights in a neural net with a set of heuristics limiting the | search space, as a sort of rules engine/decision tree integrated | within ANN backprop training. Basically pruning irrelevant and | redundant features early and focusing on more informative ones. | visarga wrote: | Yes, there are many approaches like that. In one approach they | train a network and prune it, then mask the pruned weights and | retrain from scratch a sparse network from the original | untrained weights. | gibolt wrote: | This assumes that our processes and algorithms don't get more | targeted or improve. The rate of new approach discovery is | staggering. For every problem, some combination of approaches | will more efficiently pre-process and understand the training | data. | | The article also ignores training vs running tradeoffs. Training | a model once may be extremely resource intensive, but running the | resulting model on millions of devices can be negligible while | having huge value add. | et1337 wrote: | Keep reading, the article includes a whole section about | training vs running tradeoffs. | sayonaraman wrote: | > new approach discovery | | a good example is discovery of attention | mechanisms/transformers replacing more cumbersome and | computationally expensive RNNs and LSTMs in NLP and more | recently outperforming more expensive models in computer | vision. | visarga wrote: | Transformers are pretty huge and expensive to run, LSTMs are | lighthweight by comparison. | culi wrote: | Keep reading, the article directly addresses this point | dekhn wrote: | Anybody who argues against deep learning based on energy | consumption immediately fails to impress me. This article is | particularly bad- claiming you need k*2 more data points to | improve a model and using that to extrapolate unrealistic energy | consumption targets for DL training. | | The sum of all DL training in teh world is noise compared to the | other big consumers of energy in computing. That's because the | main players all invested in energy-efficient architectures. DL | training energy is not something to optimize if your goal is to | have a measurable impact on total power consumption. | josefx wrote: | > That's because the main players all invested in energy- | efficient architectures. | | If the cost was gigantic enough to make the investment worth it | they must have found some really great improvements for it to | end up being just noise. Improvements that somehow didn't have | a noteworthy impact on general computing. | Spooky23 wrote: | Open ended crying about electricity doesn't make sense in the | absence of specifics. | | A big company like Microsoft probably wasted more money on | pentium 4s 15 years ago. Electricity is just another resource - | if the numbers work, burn away. | ypcx wrote: | Especially if the result is the cure for cancer, or similar. | ben_w wrote: | Perhaps for now, but not necessarily in general. | | I know we're nowhere near the following scenario, this is | just to illustrate how things can go wrong even if the | numbers tell you to "burn away": | | Image we have computronium with negligible manufacture cost, | the only important thing is the power cost to use it. | | Imagine you're using it to run an uploaded mind, spending | $35,805/year on energy. | | The 50% of Americans earning more than this [0] are no longer | economically viable, because their productivity can now be | done at the same cost by a computer program. | | Doing this with the current power mixture would be | disastrous, doing it with PV needs about 1400m^2 per | simultaneous real time mind upload instance (depending on | your assumption about energy costs and cell efficiency, | naturally). | | In a more near-term sense, there are plenty of examples where | the Nash equilibrium tells each of us to benefit ourselves at | the expense of all of us. Not saying that is the case for | Deep Learning right now, but can (and frequently does) | happen. | | [0] https://fred.stlouisfed.org/series/MEPAINUSA672N | user-the-name wrote: | > Electricity is just another resource | | I hate to be the one to tell you, but, it turns out we are | living in the middle of an ecological catastrophe, and it | also turns out that means that electricity is a resource we | are going to have to conserve. | wanderingmind wrote: | And yet people here have no trouble crying about electricity | wastage of crypto. Also from my limited knowledge I think DNN | models are not very transferable in real world setting | requiring constant retraining even for a small drift in signal | or change in noise modes. | nerdponx wrote: | > And yet people here have no trouble crying about | electricity wastage of crypto | | Which is many orders of magnitude more energy-intensive, on | the scale of a small nation-state, and in most cases | fundamentally wasteful by design. A very large pre-trained | model can be reused very cheaply once it's finished. | | > Also from my limited knowledge I think DNN models are not | very transferable in real world setting requiring constant | retraining even for a small drift in signal or change in | noise modes. | | This is FUD, promulgated by people who expected deep learning | to solve all their problems overnight. All models will suffer | from "drift" whenever the underlying data changes. | | Part of what made deep learning so good was that it was able | to generalize exceptionally well from exceptionally | complicated input data. | | It is unreasonable to expect that a model pre-trained on a | huge generic corpus will be a perfect match for your very | specific business problem. However it is _not_ unreasonable | to expect that said model will be a useful baseline and | starting point for your very specific business problem. | | We are not yet (and might never be) at the point where you | can dump a pile of garbage data into an API and get great | predictions out the other end on the first try. But nobody | ever thought you could do that, except the people selling | expensive subscriptions to those kinds of APIs. The fact that | they work at all should be taken as evidence of how amazing | deep learning is; the fact that they don't work perfectly | should not be taken as evidence that deep learning is | bad/useless/wasteful/hype/whatever. | | Don't let the clueless tech media set your expectations. | | Professional data scientists and machine learning | practitioners for the most part take their work very | seriously and take pride in delivering good outcomes, just | like professional software engineers. If deep learning wasn't | useful to that end, nobody would be using it. | culi wrote: | Here's the original study[1] that seems to be the primary source | for this article. It's an important study from a respectable | journal. To be frank, it's pretty disconcerting that the top | comments on this thread are those writing off the topic on the | premise alone while those comments actually engaging with the | topic seem to be at the bottom | | [1] https://arxiv.org/pdf/1906.02243.pdf | SavantIdiot wrote: | Putting aside energy costs, Object detection is still crappy and | has stalled. YOLO/SSDMN were impressive as all get-out, but they | stink for general purpose use. It's been 3 years (?) and general | object detection, even with 100 classes, is still unusable off | the shelf. Yes, I understand incremental training of pre-trained | nets is a thing, but that's not where we all hoped it would go. | BeatLeJuce wrote: | > > The first part is true of all statistical models: To improve | performance by a factor of k, at least k^2 more data points must | be used to train the model. The second part of the computational | cost comes explicitly from overparameterization. Once accounted | for, this yields a total computational cost for improvement of at | least k4. | | Those claims are entirely new to me, and I've been a researcher | in the field for almost 10 years. Where do they come from/what | theorems are they based on? It's unfortunate this article doesn't | have any citations. ___________________________________________________________________ (page generated 2021-09-24 23:01 UTC)