[HN Gopher] Transcending Scaling Laws with 0.1% Extra Compute ___________________________________________________________________ Transcending Scaling Laws with 0.1% Extra Compute Author : ashvardanian Score : 62 points Date : 2023-01-27 19:00 UTC (4 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | jxf wrote: | This sounds very interesting but I lack the technical depth in | language models to understand it. In particular I can't parse the | following excerpt: | | > The key idea is to continue training a state-of-the-art large | language model (e.g., PaLM) on a few more steps with UL2's | mixture-of-denoiser objective. We show that, with almost | negligible extra computational costs and no new sources of data, | we are able to substantially improve the scaling properties of | large language models on downstream metrics. In this paper, we | continue training PaLM with UL2R, introducing a new set of models | at 8B, 62B, and 540B scale which we call U-PaLM. | | Things I don't understand: | | * PaLM (and its advantages/disadvantages relative to other LMs) | | * What a "mixture-of-denoiser objective" is | | * How "the scaling properties" are measured | | I'd be interested in a more accessible summary of how this works, | if HN has any references. | p1esk wrote: | Did you try reading the paper (past the abstract)? It provides | the reference to the original PaLM paper and answers the rest | of your questions. | mcint wrote: | Linking Paper's with Code for its listing of relevant: tasks, | datasets, and metrics with global ranking against other models on | defined tasks. | | https://paperswithcode.com/paper/transcending-scaling-laws-w... | 6gvONxR4sf7o wrote: | > Impressively, at 540B scale, we show an approximately 2x | computational savings rate where U-PaLM achieves the same | performance as the final PaLM 540B model at around half its | computational budget (i.e., saving ~4.4 million TPUv4 hours). | | Are you allowed to call your own work impressive in your | abstract? Cool work, but that line is "transcendent." | | Anyways, aren't scaling laws more like O(whatever) asymptotics? | Like if you reduce your sorting algo from 6.4 n^2 seconds to 3.2 | n^2, you don't say you "transcended the scaling laws," even | though you sped it up a very significant amount. Am I | misunderstanding? | p1esk wrote: | _aren 't scaling laws more like O(whatever) asymptotics?_ | | Not if scaling is linear. | [deleted] | hobs wrote: | Yes, that's day one stuff that the coefficient gets thrown | away. | dgreensp wrote: | By "transcending scaling laws" and "improve the scaling | properties," do they just mean higher-quality output compared to | using the same (or smaller) model size with previous methods? | whatshisface wrote: | Here is a summary: | | - Training on a mixture of fill-in-the-gaps (a few missing words) | and denoising (every word slightly corrupted) produces better | LLMs than either one alone. | | - This advantage (that of using both metrics at once) can be | gained with just a little extra training on a model previously | trained with only one of them. | | This results in 2-4% improvements on most tasks, with a couple | really big improvements (+20%) and one quite surprising (+60%) on | a few BigBench tasks. The large percent improvements on the | BigBench tasks seem to have more to do with the low initial | performance than the new performance being outstanding; the 60% | was from 7.6% right to 12.5% right. | mcint wrote: | Thank you for the summary! | | I quite like this idea, train on a mixture of simple critics, | or really here, simple sources of noise. | [deleted] ___________________________________________________________________ (page generated 2023-01-27 23:00 UTC)