[HN Gopher] Gradients are not all you need
       ___________________________________________________________________
        
       Gradients are not all you need
        
       Author : bundie
       Score  : 97 points
       Date   : 2023-04-23 16:37 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | 0xBABAD00C wrote:
       | > Gradients Are Not All You Need
       | 
       | Sometimes you need to peek at the Hessian.
       | 
       | Seriously though, what is intelligence if not creative unrolling
       | of the first few terms of the Taylor expansion?
        
       | unlikelymordant wrote:
       | My one wish is that machine learning papers would use paper
       | titles that actually described what the paper was about. I
       | suppose there is a certain 'evolutionary pressure' where clever
       | titles 'outcompete' dryer, more descriptive titles (or it seems
       | that way). But i don't like it.
        
         | satvikpendem wrote:
         | The clever titles are more brandable. See the citations for
         | "Attention is all you need" or "Chinchilla limit" versus more
         | mundane titles.
        
           | rolisz wrote:
           | I'm pretty sure Attention is all you need would have gotten a
           | lot of citations even if it had a "boring" title. It was a
           | groundbreaking paper with lots of good ideas.
        
           | cubefox wrote:
           | The Chinchilla paper was called "Training Compute-Optimal
           | Large Language Models", which is exactly on point.
        
             | dragonwriter wrote:
             | I mean, it would be _slightly_ more accurate if it had been
             | "Compute-Optimal Training of Large Language Models", since
             | the _models_ so-trained aren't the thing that is compute-
             | optimal, the training is compute-optimal.
             | 
             | But... yeah, its hardly a title chosen for marketing rather
             | than description.
        
               | sroussey wrote:
               | But we all know what you mean when you say "the
               | Chinchilla paper"
        
               | moritzwarhier wrote:
               | Stupid question, is training considered a part of the
               | model? Or is this only common parlance for GPT (the P).
        
               | cubefox wrote:
               | Interestingly, the models themselves also have compute
               | cost associated with them, namely the cost of running
               | them (inference). The smaller a model is, the less
               | compute it needs for inference. Which is an interesting
               | trade-off, because you can overtrain (rather than
               | undertrain, as in the past) a model, i.e. use more
               | tokens, fewer parameters, and ultimately more compute
               | than optimal during training, to get lower inference
               | cost.
               | 
               | https://www.harmdevries.com/post/model-size-vs-compute-
               | overh...
               | 
               | This website has an interesting graph which visualizes
               | this:
               | 
               | > For example, the compute overhead for 75% of the
               | optimal model size is only 2.8%, whereas for half of the
               | optimal model size, the overhead rises to 20%. As we move
               | towards smaller models, we observe an asymptotic trend,
               | and at 25% of the compute-optimal model size, the compute
               | overhead increases rapidly to 188%.
               | 
               | So if you train your model you shouldn't just look at
               | compute-optimal training but try to anticipate how much
               | the model will probably be used for inference, to
               | minimize total compute cost. Basically, the less
               | inference you expect to do with your model, the closer
               | the training should be to Chinchilla-optimality.
        
         | sho_hn wrote:
         | Remember the days after Rowhammer and Heartbleed, when every
         | new security vulnerability needed its own catchy name and
         | domain name website? This is the science version of that.
         | 
         | Branding is eating the world.
        
           | mumblemumble wrote:
           | It's inescapable.
           | 
           | 40 years ago, when articles went into print publications,
           | you'd just get your paper into a key print journal and then
           | trust that everyone who gets it would at least look through
           | the article headlines and read the abstracts of articles that
           | seemed relevant to them. And it was manageable because you'd
           | only have a few new issues rolling in per month.
           | 
           | But arXiv had an average of 167 CS papers being submitted per
           | _day_ in 2021. An academic who wants to keep their career
           | alive needs to resort to every trick in the book to be heard
           | above that din.
        
             | V__ wrote:
             | Isn't this more a problem of not good enough curation?
             | There are more papers now than ever, but the signal is
             | probably getting worse and worse.
        
           | xpe wrote:
           | Is _branding_ the concept in play? I don't see it being a
           | very good metaphorical fit in comparison to my suggestion
           | below. Branding requires a common theme over multiple things
           | (such as products or events).
           | 
           | I think we're seeing something closer to _title optimization_
           | in the service of _marketing_. But even marketing isn't a
           | great fit.
           | 
           | Upon reflecting on the Wikipedia definition of marketing, it
           | strikes me that title optimization is only a tiny part of
           | what "academic research marketing" could be. In the most
           | generous sense, it could mean reaching out to media sources
           | fairly early in the research/writing process to help craft a
           | paper that will be interesting to the desired audience.
           | 
           | Wikipedia quotes:
           | 
           | > Marketing is the process of exploring, creating, and
           | delivering value to meet the needs of a target market in
           | terms of goods and services; potentially including selection
           | of a target audience; selection of certain attributes or
           | themes to emphasize in advertising; operation of advertising
           | campaigns; attendance at trade shows and public events;
           | design of products and packaging attractive to buyers;
           | defining the terms of sale, such as price, discounts,
           | warranty, and return policy; product placement in media or
           | with people believed to influence the buying habits of
           | others; agreements with retailers, wholesale distributors, or
           | resellers; and attempts to create awareness of, loyalty to,
           | and positive feelings about a brand.
           | 
           | > A brand is a name, term, design, symbol or any other
           | feature that distinguishes one seller's good or service from
           | those of other sellers. Brands are used in business,
           | marketing, and advertising for recognition and, importantly,
           | to create and store value as brand equity for the object
           | identified, to the benefit of the brand's customers, its
           | owners and shareholders.
           | 
           | > Personal branding is the conscious and intentional effort
           | to create and influence public perception of an individual by
           | positioning them as an authority in their industry, elevating
           | their credibility, and differentiating themselves from the
           | competition, to ultimately advance their career, widen their
           | circle of influence, and have a larger impact.
        
           | preommr wrote:
           | > Branding is eating the world.
           | 
           | Not realizing that it always has is why it's still effective.
           | 
           | History is written by the victors and victors tend to have
           | the best marketting.
        
             | [deleted]
        
         | tough wrote:
         | I asked GPT > https://arxiv.org/abs/2111.05803 make a good
         | useful title that's not silly for this paper and wht's about
         | 
         | I dunno if it even read the article lols
         | 
         | ====
         | 
         | Title: "Deep Reinforcement Learning for Multi-Agent Navigation
         | in Complex Environments" This paper proposes a deep
         | reinforcement learning approach for multi-agent navigation in
         | complex environments. The proposed method uses a centralized
         | critic and decentralized actor architecture, where each agent
         | has its own policy network and learns from its local
         | observations, while the critic network takes in the global
         | state information to provide a centralized value estimate. The
         | method is evaluated on a variety of benchmark environments and
         | achieves state-of-the-art performance compared to existing
         | methods.
        
           | Bootvis wrote:
           | ChatGPT doesn't read URL's on the fly but it will pretend it
           | does.
        
             | tough wrote:
             | lol
        
         | abhgh wrote:
         | Your comment reminded me this talk [1] by David Duvenaud where
         | he discusses his work on Neural ODEs, introduced in the paper
         | [2] titled "Neural Ordinary Differential Equations" (this was
         | one of the Best Papers in Neurips 2018). Apparently the
         | original name of the paper was "Training Infinitesimally-
         | Layered Neural Networks by Backpropagating through Black-box
         | ODE Solvers". I found that to be much better description for
         | the paper. It seems he thinks so too, but eventually the catch-
         | phrase-y title won out. In the video [1], this bit is at
         | 00:48-01:22.
         | 
         | PS: [1] is a good talk in general where he discusses some of
         | the limitations of the paper and things that could have been
         | done better.
         | 
         | [1] https://www.youtube.com/watch?v=YZ-_E7A3V2w
         | 
         | [2]
         | https://papers.nips.cc/paper_files/paper/2018/file/69386f6bb...
        
         | dauertewigkeit wrote:
         | At some point you run into the problem where titles become
         | useless because there are 100 papers on the same exact topic
         | with very slight variation in the title. At this point people
         | use surnames and date to cite papers.
         | 
         | But then the title can become something catchy that will give
         | you more visibility.
        
         | scrubs wrote:
         | Agree -- owl talk of Winnie-the-pooh childhood fame --- sucks.
        
         | optimalsolver wrote:
         | I'd like to imagine quirky titles make them harder for other
         | researchers to find, leading to lower citation counts, but that
         | may be wishful thinking.
        
           | skybrian wrote:
           | I don't think it makes much difference. If anything, it might
           | help? It's easier to search for a paper if you remember its
           | name. If not, you can search on author or words in the
           | abstract.
           | 
           | The problem isn't quirky titles, it's with websites like
           | Hacker News that only display a headline and not the
           | abstract.
        
           | bundie wrote:
           | I think this trend is happening because of the entire buzz
           | around OPENAI LLMs and that 2017 Google Research Paper
        
             | rsfern wrote:
             | Ironically, attention is what you need to get high citation
             | counts
             | 
             | Whether quirky titles help or hurt with that i think
             | depends on name recognition and the publication venue
        
         | xpe wrote:
         | Getting cited is probably the largest individual incentive. To
         | various degrees, authors also want to get their ideas and
         | reputation "out there". To that end, authors want their work to
         | be (i) skimmed (the abstract at least); (ii) shared; (iii)
         | read. A catchy title almost always helps (right?); there
         | doesn't seem to be significant downside to catchy titles.
         | 
         | So how do we get more desirable system behavior? It seems we
         | have a collective action problem.
        
         | ajb wrote:
         | There's quite a long history of these titles, even before ML:
         | 
         | "Sometime' is Sometimes 'Not Never'" - Lamport
         | 
         | "Reflections on trusting trust" - Thompson
         | 
         | "On the cruelty of really teaching computer science" - Dijkstra
         | 
         | I'm sure there are more
        
         | hikingsimulator wrote:
         | There is a lot of jargon in ML, an example is found in the
         | object detection literature where you will often find sentences
         | like such: "we present a novel simple non-hierarchical feature
         | network backbone with window-shifted attention based learning
         | accommodating neck+regression head fine-tuning or masked
         | cascade RCNN second stage." I'm half joking. Surveys are often
         | a godsend.
        
           | alephxyz wrote:
           | You forgot to toss in a few "state-of-the-art"
        
           | chas wrote:
           | The major ML conferences all have pretty tight page limits,
           | so more expository sentences usually get cut. This also means
           | that papers usually only explain how their work is different
           | from previous work, so they assume you are familiar with the
           | papers they cite or are willing to read the cited papers.
           | 
           | This means that people who have an up-to-date knowledge of a
           | given subfield can quickly get a lot out of a new papers.
           | Unfortunately, it also means that it usually takes a pretty
           | decent stack of papers to get up to speed on a new subfield
           | since you have to read the important segments of the commonly
           | cited papers in order to gain the common knowledge that
           | papers are being diffed against.
           | 
           | Traditionally, this issue is solved by textbooks, since the
           | base set of ideas in a given field or subfield is pretty
           | stable. ML has been moving pretty fast in recent years, so
           | there is still a sizable gap between the base knowledge
           | required for productive paper reading and what you can get
           | out of a textbook. For example, Goodfellow et al [1] is a
           | great intro to the core ideas of deep learning, but it was
           | published before transformers were invented, so it doesn't
           | mention them at all.
           | 
           | [1] https://www.deeplearningbook.org/
        
         | visarga wrote:
         | > "Understanding Limitations and Chaos-Based Failures in
         | Gradient-Based Optimization Methods." (gpt4)
         | 
         | Fixed with ML.
        
           | xpe wrote:
           | I've provided the PDF URL to ChatGPT 4.0 and asked it to
           | summarize the article with alternative titles, but for some
           | reason it keeps getting the original title, authors,
           | abstract, and body wrong. What prompt did you use?
        
             | wcoenen wrote:
             | That's because chatgpt doesn't have the ability to retrieve
             | PDFs from the internet. (Unless maybe if you have early
             | access to the version with plugins?)
             | 
             | Bing chat does have the ability to read opened PDFs when
             | used from the Edge side bar.
        
             | visarga wrote:
             | I asked a bunch of things. So initially I posted the
             | abstract and title, then prompted just: "Explain". Then I
             | asked "Give me background knowledge on spectrum of the
             | Jacobian" and "Explain the title of the paper" and in the
             | end "Reformulate the title in a more explicit manner".
             | Maybe you can skip directly to the last prompt.
        
       | asdfman123 wrote:
       | > chaos based failure mode
       | 
       | I studied this in undergrad, but it's not the same thing the
       | paper is talking about
        
       | Der_Einzige wrote:
       | Global optimization techniques which don't rely on gradients
       | seems theoretically superior in all instances, except that we
       | haven't found super fast ways to run these kinds of optimizers.
       | 
       | The cartpoll demo famously tripped up derivative based
       | reinforcement learning for awhile.
        
         | wenc wrote:
         | > Global optimization techniques which don't rely on gradients
         | seems theoretically superior in all instances, except that we
         | haven't found super fast ways to run these kinds of optimizers.
         | 
         | Did you mean "Global optimization techniques which _do_ rely on
         | gradients... "? Because exact gradient-based global
         | optimization (GBD or branch-and-bound based) methods for
         | general nonconvex problems _are_ theoretically superior
         | (bounding with McCormick relaxations etc.) but also more
         | challenging to practically deploy than say stochastic methods
         | or metaheuristics like local search.
        
       | modeless wrote:
       | Seems to me like the whole history of neural nets is basically
       | crafting models with well-behaved gradients to make gradient
       | descent work well. That, and models that can achieve high
       | utilization of available hardware. The surprising thing is that
       | models exist where the gradients are _so_ well-behaved that we
       | can learn GPT-4 level stuff.
        
         | smonn_ wrote:
         | There's plenty of interesting neural network designs out there
         | but they're being overshadowed by transformers due to their
         | recent success. I personally thing that the main reason
         | transformers work so well is because they actually step away
         | from the multi layer perceptron stuff and introduce some
         | structure and in a way sparsity.
        
           | mumblemumble wrote:
           | Also, multi-head attention strikes me as being about as close
           | to how language semantics seems to actually work in human
           | brains as I've seen.
           | 
           | Lots of caveats there, of course. First off, I don't know
           | much about the neurology, I just have an amateur interest in
           | second language acquisition research that sometimes brings me
           | into contact with this sort of thing. On the ANN side, which
           | is closer to my actual wheelhouse, we definitely don't
           | actually have any way of knowing if the actual mechanism is
           | all that close, and I'm guessing it probably isn't even close
           | since ANN's don't actually work _that_ similarly to brains.
           | Nor does it need to be, but, intuitively, there 's still
           | something promising about an ANN architecture that's vaguely
           | capable of mimicking the behavior of modules in an existing
           | system (human brains) that's well known to be capable of
           | doing the job. I'm not super wild about the bidirectional
           | recurrent layers, either, because they impose some
           | restrictions that clearly aren't great, such as the hard
           | limit on input size. et cetera. But it still strikes me as
           | another big step in a good direction.
        
             | smonn_ wrote:
             | I'm currently working on a variation of a spiking neural
             | network that learns by making and purging connections
             | between neurons, which so far has been pretty interesting,
             | though I am having a hard time getting it to output
             | anything more than just the patterns it recognised. I did
             | play around with adding its outputs to the input list,
             | making it sort of recurrent but its practically impossible
             | to decode anything thats going on inside of the network. Im
             | thinking of tracking the inputs around to see what its
             | doing right now, might be interesting to see it generate
             | some sort of tree-like structure.
        
       ___________________________________________________________________
       (page generated 2023-04-23 23:00 UTC)