[HN Gopher] Gradients are not all you need ___________________________________________________________________ Gradients are not all you need Author : bundie Score : 97 points Date : 2023-04-23 16:37 UTC (6 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | 0xBABAD00C wrote: | > Gradients Are Not All You Need | | Sometimes you need to peek at the Hessian. | | Seriously though, what is intelligence if not creative unrolling | of the first few terms of the Taylor expansion? | unlikelymordant wrote: | My one wish is that machine learning papers would use paper | titles that actually described what the paper was about. I | suppose there is a certain 'evolutionary pressure' where clever | titles 'outcompete' dryer, more descriptive titles (or it seems | that way). But i don't like it. | satvikpendem wrote: | The clever titles are more brandable. See the citations for | "Attention is all you need" or "Chinchilla limit" versus more | mundane titles. | rolisz wrote: | I'm pretty sure Attention is all you need would have gotten a | lot of citations even if it had a "boring" title. It was a | groundbreaking paper with lots of good ideas. | cubefox wrote: | The Chinchilla paper was called "Training Compute-Optimal | Large Language Models", which is exactly on point. | dragonwriter wrote: | I mean, it would be _slightly_ more accurate if it had been | "Compute-Optimal Training of Large Language Models", since | the _models_ so-trained aren't the thing that is compute- | optimal, the training is compute-optimal. | | But... yeah, its hardly a title chosen for marketing rather | than description. | sroussey wrote: | But we all know what you mean when you say "the | Chinchilla paper" | moritzwarhier wrote: | Stupid question, is training considered a part of the | model? Or is this only common parlance for GPT (the P). | cubefox wrote: | Interestingly, the models themselves also have compute | cost associated with them, namely the cost of running | them (inference). The smaller a model is, the less | compute it needs for inference. Which is an interesting | trade-off, because you can overtrain (rather than | undertrain, as in the past) a model, i.e. use more | tokens, fewer parameters, and ultimately more compute | than optimal during training, to get lower inference | cost. | | https://www.harmdevries.com/post/model-size-vs-compute- | overh... | | This website has an interesting graph which visualizes | this: | | > For example, the compute overhead for 75% of the | optimal model size is only 2.8%, whereas for half of the | optimal model size, the overhead rises to 20%. As we move | towards smaller models, we observe an asymptotic trend, | and at 25% of the compute-optimal model size, the compute | overhead increases rapidly to 188%. | | So if you train your model you shouldn't just look at | compute-optimal training but try to anticipate how much | the model will probably be used for inference, to | minimize total compute cost. Basically, the less | inference you expect to do with your model, the closer | the training should be to Chinchilla-optimality. | sho_hn wrote: | Remember the days after Rowhammer and Heartbleed, when every | new security vulnerability needed its own catchy name and | domain name website? This is the science version of that. | | Branding is eating the world. | mumblemumble wrote: | It's inescapable. | | 40 years ago, when articles went into print publications, | you'd just get your paper into a key print journal and then | trust that everyone who gets it would at least look through | the article headlines and read the abstracts of articles that | seemed relevant to them. And it was manageable because you'd | only have a few new issues rolling in per month. | | But arXiv had an average of 167 CS papers being submitted per | _day_ in 2021. An academic who wants to keep their career | alive needs to resort to every trick in the book to be heard | above that din. | V__ wrote: | Isn't this more a problem of not good enough curation? | There are more papers now than ever, but the signal is | probably getting worse and worse. | xpe wrote: | Is _branding_ the concept in play? I don't see it being a | very good metaphorical fit in comparison to my suggestion | below. Branding requires a common theme over multiple things | (such as products or events). | | I think we're seeing something closer to _title optimization_ | in the service of _marketing_. But even marketing isn't a | great fit. | | Upon reflecting on the Wikipedia definition of marketing, it | strikes me that title optimization is only a tiny part of | what "academic research marketing" could be. In the most | generous sense, it could mean reaching out to media sources | fairly early in the research/writing process to help craft a | paper that will be interesting to the desired audience. | | Wikipedia quotes: | | > Marketing is the process of exploring, creating, and | delivering value to meet the needs of a target market in | terms of goods and services; potentially including selection | of a target audience; selection of certain attributes or | themes to emphasize in advertising; operation of advertising | campaigns; attendance at trade shows and public events; | design of products and packaging attractive to buyers; | defining the terms of sale, such as price, discounts, | warranty, and return policy; product placement in media or | with people believed to influence the buying habits of | others; agreements with retailers, wholesale distributors, or | resellers; and attempts to create awareness of, loyalty to, | and positive feelings about a brand. | | > A brand is a name, term, design, symbol or any other | feature that distinguishes one seller's good or service from | those of other sellers. Brands are used in business, | marketing, and advertising for recognition and, importantly, | to create and store value as brand equity for the object | identified, to the benefit of the brand's customers, its | owners and shareholders. | | > Personal branding is the conscious and intentional effort | to create and influence public perception of an individual by | positioning them as an authority in their industry, elevating | their credibility, and differentiating themselves from the | competition, to ultimately advance their career, widen their | circle of influence, and have a larger impact. | preommr wrote: | > Branding is eating the world. | | Not realizing that it always has is why it's still effective. | | History is written by the victors and victors tend to have | the best marketting. | [deleted] | tough wrote: | I asked GPT > https://arxiv.org/abs/2111.05803 make a good | useful title that's not silly for this paper and wht's about | | I dunno if it even read the article lols | | ==== | | Title: "Deep Reinforcement Learning for Multi-Agent Navigation | in Complex Environments" This paper proposes a deep | reinforcement learning approach for multi-agent navigation in | complex environments. The proposed method uses a centralized | critic and decentralized actor architecture, where each agent | has its own policy network and learns from its local | observations, while the critic network takes in the global | state information to provide a centralized value estimate. The | method is evaluated on a variety of benchmark environments and | achieves state-of-the-art performance compared to existing | methods. | Bootvis wrote: | ChatGPT doesn't read URL's on the fly but it will pretend it | does. | tough wrote: | lol | abhgh wrote: | Your comment reminded me this talk [1] by David Duvenaud where | he discusses his work on Neural ODEs, introduced in the paper | [2] titled "Neural Ordinary Differential Equations" (this was | one of the Best Papers in Neurips 2018). Apparently the | original name of the paper was "Training Infinitesimally- | Layered Neural Networks by Backpropagating through Black-box | ODE Solvers". I found that to be much better description for | the paper. It seems he thinks so too, but eventually the catch- | phrase-y title won out. In the video [1], this bit is at | 00:48-01:22. | | PS: [1] is a good talk in general where he discusses some of | the limitations of the paper and things that could have been | done better. | | [1] https://www.youtube.com/watch?v=YZ-_E7A3V2w | | [2] | https://papers.nips.cc/paper_files/paper/2018/file/69386f6bb... | dauertewigkeit wrote: | At some point you run into the problem where titles become | useless because there are 100 papers on the same exact topic | with very slight variation in the title. At this point people | use surnames and date to cite papers. | | But then the title can become something catchy that will give | you more visibility. | scrubs wrote: | Agree -- owl talk of Winnie-the-pooh childhood fame --- sucks. | optimalsolver wrote: | I'd like to imagine quirky titles make them harder for other | researchers to find, leading to lower citation counts, but that | may be wishful thinking. | skybrian wrote: | I don't think it makes much difference. If anything, it might | help? It's easier to search for a paper if you remember its | name. If not, you can search on author or words in the | abstract. | | The problem isn't quirky titles, it's with websites like | Hacker News that only display a headline and not the | abstract. | bundie wrote: | I think this trend is happening because of the entire buzz | around OPENAI LLMs and that 2017 Google Research Paper | rsfern wrote: | Ironically, attention is what you need to get high citation | counts | | Whether quirky titles help or hurt with that i think | depends on name recognition and the publication venue | xpe wrote: | Getting cited is probably the largest individual incentive. To | various degrees, authors also want to get their ideas and | reputation "out there". To that end, authors want their work to | be (i) skimmed (the abstract at least); (ii) shared; (iii) | read. A catchy title almost always helps (right?); there | doesn't seem to be significant downside to catchy titles. | | So how do we get more desirable system behavior? It seems we | have a collective action problem. | ajb wrote: | There's quite a long history of these titles, even before ML: | | "Sometime' is Sometimes 'Not Never'" - Lamport | | "Reflections on trusting trust" - Thompson | | "On the cruelty of really teaching computer science" - Dijkstra | | I'm sure there are more | hikingsimulator wrote: | There is a lot of jargon in ML, an example is found in the | object detection literature where you will often find sentences | like such: "we present a novel simple non-hierarchical feature | network backbone with window-shifted attention based learning | accommodating neck+regression head fine-tuning or masked | cascade RCNN second stage." I'm half joking. Surveys are often | a godsend. | alephxyz wrote: | You forgot to toss in a few "state-of-the-art" | chas wrote: | The major ML conferences all have pretty tight page limits, | so more expository sentences usually get cut. This also means | that papers usually only explain how their work is different | from previous work, so they assume you are familiar with the | papers they cite or are willing to read the cited papers. | | This means that people who have an up-to-date knowledge of a | given subfield can quickly get a lot out of a new papers. | Unfortunately, it also means that it usually takes a pretty | decent stack of papers to get up to speed on a new subfield | since you have to read the important segments of the commonly | cited papers in order to gain the common knowledge that | papers are being diffed against. | | Traditionally, this issue is solved by textbooks, since the | base set of ideas in a given field or subfield is pretty | stable. ML has been moving pretty fast in recent years, so | there is still a sizable gap between the base knowledge | required for productive paper reading and what you can get | out of a textbook. For example, Goodfellow et al [1] is a | great intro to the core ideas of deep learning, but it was | published before transformers were invented, so it doesn't | mention them at all. | | [1] https://www.deeplearningbook.org/ | visarga wrote: | > "Understanding Limitations and Chaos-Based Failures in | Gradient-Based Optimization Methods." (gpt4) | | Fixed with ML. | xpe wrote: | I've provided the PDF URL to ChatGPT 4.0 and asked it to | summarize the article with alternative titles, but for some | reason it keeps getting the original title, authors, | abstract, and body wrong. What prompt did you use? | wcoenen wrote: | That's because chatgpt doesn't have the ability to retrieve | PDFs from the internet. (Unless maybe if you have early | access to the version with plugins?) | | Bing chat does have the ability to read opened PDFs when | used from the Edge side bar. | visarga wrote: | I asked a bunch of things. So initially I posted the | abstract and title, then prompted just: "Explain". Then I | asked "Give me background knowledge on spectrum of the | Jacobian" and "Explain the title of the paper" and in the | end "Reformulate the title in a more explicit manner". | Maybe you can skip directly to the last prompt. | asdfman123 wrote: | > chaos based failure mode | | I studied this in undergrad, but it's not the same thing the | paper is talking about | Der_Einzige wrote: | Global optimization techniques which don't rely on gradients | seems theoretically superior in all instances, except that we | haven't found super fast ways to run these kinds of optimizers. | | The cartpoll demo famously tripped up derivative based | reinforcement learning for awhile. | wenc wrote: | > Global optimization techniques which don't rely on gradients | seems theoretically superior in all instances, except that we | haven't found super fast ways to run these kinds of optimizers. | | Did you mean "Global optimization techniques which _do_ rely on | gradients... "? Because exact gradient-based global | optimization (GBD or branch-and-bound based) methods for | general nonconvex problems _are_ theoretically superior | (bounding with McCormick relaxations etc.) but also more | challenging to practically deploy than say stochastic methods | or metaheuristics like local search. | modeless wrote: | Seems to me like the whole history of neural nets is basically | crafting models with well-behaved gradients to make gradient | descent work well. That, and models that can achieve high | utilization of available hardware. The surprising thing is that | models exist where the gradients are _so_ well-behaved that we | can learn GPT-4 level stuff. | smonn_ wrote: | There's plenty of interesting neural network designs out there | but they're being overshadowed by transformers due to their | recent success. I personally thing that the main reason | transformers work so well is because they actually step away | from the multi layer perceptron stuff and introduce some | structure and in a way sparsity. | mumblemumble wrote: | Also, multi-head attention strikes me as being about as close | to how language semantics seems to actually work in human | brains as I've seen. | | Lots of caveats there, of course. First off, I don't know | much about the neurology, I just have an amateur interest in | second language acquisition research that sometimes brings me | into contact with this sort of thing. On the ANN side, which | is closer to my actual wheelhouse, we definitely don't | actually have any way of knowing if the actual mechanism is | all that close, and I'm guessing it probably isn't even close | since ANN's don't actually work _that_ similarly to brains. | Nor does it need to be, but, intuitively, there 's still | something promising about an ANN architecture that's vaguely | capable of mimicking the behavior of modules in an existing | system (human brains) that's well known to be capable of | doing the job. I'm not super wild about the bidirectional | recurrent layers, either, because they impose some | restrictions that clearly aren't great, such as the hard | limit on input size. et cetera. But it still strikes me as | another big step in a good direction. | smonn_ wrote: | I'm currently working on a variation of a spiking neural | network that learns by making and purging connections | between neurons, which so far has been pretty interesting, | though I am having a hard time getting it to output | anything more than just the patterns it recognised. I did | play around with adding its outputs to the input list, | making it sort of recurrent but its practically impossible | to decode anything thats going on inside of the network. Im | thinking of tracking the inputs around to see what its | doing right now, might be interesting to see it generate | some sort of tree-like structure. ___________________________________________________________________ (page generated 2023-04-23 23:00 UTC)