[HN Gopher] Transcending Scaling Laws with 0.1% Extra Compute
       ___________________________________________________________________
        
       Transcending Scaling Laws with 0.1% Extra Compute
        
       Author : ashvardanian
       Score  : 62 points
       Date   : 2023-01-27 19:00 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jxf wrote:
       | This sounds very interesting but I lack the technical depth in
       | language models to understand it. In particular I can't parse the
       | following excerpt:
       | 
       | > The key idea is to continue training a state-of-the-art large
       | language model (e.g., PaLM) on a few more steps with UL2's
       | mixture-of-denoiser objective. We show that, with almost
       | negligible extra computational costs and no new sources of data,
       | we are able to substantially improve the scaling properties of
       | large language models on downstream metrics. In this paper, we
       | continue training PaLM with UL2R, introducing a new set of models
       | at 8B, 62B, and 540B scale which we call U-PaLM.
       | 
       | Things I don't understand:
       | 
       | * PaLM (and its advantages/disadvantages relative to other LMs)
       | 
       | * What a "mixture-of-denoiser objective" is
       | 
       | * How "the scaling properties" are measured
       | 
       | I'd be interested in a more accessible summary of how this works,
       | if HN has any references.
        
         | p1esk wrote:
         | Did you try reading the paper (past the abstract)? It provides
         | the reference to the original PaLM paper and answers the rest
         | of your questions.
        
       | mcint wrote:
       | Linking Paper's with Code for its listing of relevant: tasks,
       | datasets, and metrics with global ranking against other models on
       | defined tasks.
       | 
       | https://paperswithcode.com/paper/transcending-scaling-laws-w...
        
       | 6gvONxR4sf7o wrote:
       | > Impressively, at 540B scale, we show an approximately 2x
       | computational savings rate where U-PaLM achieves the same
       | performance as the final PaLM 540B model at around half its
       | computational budget (i.e., saving ~4.4 million TPUv4 hours).
       | 
       | Are you allowed to call your own work impressive in your
       | abstract? Cool work, but that line is "transcendent."
       | 
       | Anyways, aren't scaling laws more like O(whatever) asymptotics?
       | Like if you reduce your sorting algo from 6.4 n^2 seconds to 3.2
       | n^2, you don't say you "transcended the scaling laws," even
       | though you sped it up a very significant amount. Am I
       | misunderstanding?
        
         | p1esk wrote:
         | _aren 't scaling laws more like O(whatever) asymptotics?_
         | 
         | Not if scaling is linear.
        
         | [deleted]
        
         | hobs wrote:
         | Yes, that's day one stuff that the coefficient gets thrown
         | away.
        
       | dgreensp wrote:
       | By "transcending scaling laws" and "improve the scaling
       | properties," do they just mean higher-quality output compared to
       | using the same (or smaller) model size with previous methods?
        
       | whatshisface wrote:
       | Here is a summary:
       | 
       | - Training on a mixture of fill-in-the-gaps (a few missing words)
       | and denoising (every word slightly corrupted) produces better
       | LLMs than either one alone.
       | 
       | - This advantage (that of using both metrics at once) can be
       | gained with just a little extra training on a model previously
       | trained with only one of them.
       | 
       | This results in 2-4% improvements on most tasks, with a couple
       | really big improvements (+20%) and one quite surprising (+60%) on
       | a few BigBench tasks. The large percent improvements on the
       | BigBench tasks seem to have more to do with the low initial
       | performance than the new performance being outstanding; the 60%
       | was from 7.6% right to 12.5% right.
        
         | mcint wrote:
         | Thank you for the summary!
         | 
         | I quite like this idea, train on a mixture of simple critics,
         | or really here, simple sources of noise.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-01-27 23:00 UTC)