[HN Gopher] Chinchilla Scaling: A replication attempt ___________________________________________________________________ Chinchilla Scaling: A replication attempt Author : tosh Score : 87 points Date : 2024-04-18 15:05 UTC (7 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | cs702 wrote: | Interesting! If the authors are right, it seems that the number | of training tokens required per parameter (slowly) _declines_ as | models become larger (Figure 5). | | That's good news. I think it deserves wider dissemination, so I'm | upvoting your post. | | Thank you for sharing this on HN! | dzdt wrote: | Could be that the independence of training points available | decline as the dataset size grows? At some point it becomes | hard to add data that isn't essentially similar to something | youve already added. | cs702 wrote: | Yes, could be. Not sure how or even if anyone could prove it, | though. | sebzim4500 wrote: | I guess you could artifically limit the training data (e.g. | by removing languages, categories) and see if the utility | of extra tokens drops off as a result. | godelski wrote: | This should be fairly de facto true. Remember your dataset | is some proxy for some real (but almost surely intractable) | distribution. | | Now let's think about filling the space with p-balls that | are bound by nearest points. So there should be no data | point inside the ball. Then we've turned this problem into | a sphere packing problem and we can talk about the size and | volumes of those spheres. | | So if we uniformally fill our real distribution with data | then the average volume of those spheres decrease. If we | fill but not uniformly the average ball will decrease but | the largest ball will shrink slower (this case being we | aren't properly covering data in that region). In either | case that more you add data, the more the balls shrink. | Essentially meaning the difference between data decreases. | The harder question is about those under represented | regions. Finding them and determining how to properly | sample. | | Another quick trick you can use to convince yourself if | thinking about basis vectors (this won't be robust btw but | a good starting point). In high dimensions the likelihood | that two randomly sampled vectors are orthogonal is almost | certainly true. So then we think of drawing basis vectors | (independent vectors that span our space). So as we fill in | data, we initially are very likely to have vectors (or | data) that are independent in some way. But as we add more, | the likelihood that they are orthogonal decreases. Of | course your basis vectors don't need to be orthogonal but | that's more semantics because we can always work in a space | where that's true. | Kronopath wrote: | This is not good news, this means that we could end up with a | dangerously superintelligent AI just by scaling up the number | of parameters, without increasing the amount of training data. | exe34 wrote: | Like a corporation then. We should ban them until we can | figure out how to align them! | tehsauce wrote: | ASI is nothing like a corporation | wizzwizz4 wrote: | No, they're not. Corporations have known, concrete | impacts on the world, whereas the dangers of AI are, so | far, corporations. ASIs are (as yet) fictional. | | Another difference: most corporations will avoid doing | illegal stuff if the penalties are large enough: the | corporation alignment problem is political. Pretty much | no extant AI systems can be instructed in this way: we | don't know how to align AIs even in theory. | TeMPOraL wrote: | Is very much like a corporation; a corp is effectively an | AGI, just running very slowly - at the speed of | bureaucracy. | kelseyfrog wrote: | No, but LLMs require orders of magnitude more language input | than humans[1]. It's very reasonable to assume that | architectural differences (size among them) is more likely a | constraint for performance. | | 1. Specifically larger than the upper bound on _lifetime_ | language input for humans, even assuming 24 /7 at max reading | speed. | HeatrayEnjoyer wrote: | Do they? What is the total size of all visual, audio, | touch, locomotive, scent, and taste data collected between | birth and when a human reaches IQ 100? There are multiple | high-bandwidth feeds running into the brain 24/7. | cubefox wrote: | > language input | p1esk wrote: | How much language input does a human need to become | intelligent if he doesn't receive any other input? | TeMPOraL wrote: | Yes, but LLMs come out of training as experts in | approximately any single thing you can think of, and then | some, and all that in dozen of languages. Humans don't | achieve even a fraction of this kind of breadth. | godelski wrote: | This is not quite accurate, but complex because | measurement is hard. The things they are being tested on | are almost surely within the dataset. Let's take the bar | exam for instance. Sure, we don't know what's in GPT | data, but we know it has reddit, and we know reddit has | many similar if not exact questions on it. We know that | the first GPT4 did not have good semantic similarity | matching because they just used a 3 substring matching on | 50 chararcters (Appendix C) and they only consider the | false positive nature. Then there's this line... | The RLHF post-training dataset is vastly smaller than the | pretraining set and unlikely to have any particular | question contaminated. However we did not check | explicitly. | | But my favorite is the HumanEval. I'll just remind | everyone that this was written by 60 authors, mostly from | OpenAI We evaluate functional correctness | on a set of 164 handwritten programming problems, which | we call the HumanEval dataset. ... __It is important for | these tasks to be hand-written, since our models are | trained on a large fraction of GitHub, which already | contains solutions to problems from a variety of | sources.__ | | The problems? Well they're leetcode style... Can you tell | me you can write leetcode style questions that | Human Eval 2 Prompt: def | truncate_number(number: float) -> float: """ Given a | positive floating point number, it can be decomposed into | and integer part (largest integer smaller than given | number) and decimals (leftover part always smaller than | 1). Return the decimal part of the number. >>> | truncate_number(3.5) 0.5 """ Solution: | return number % 1.0 Human Eval 4 | Prompt: from typing import List def | mean_absolute_deviation(numbers: List[float]) -> float: | """ For a given list of input numbers, calculate Mean | Absolute Deviation around the mean of this dataset. Mean | Absolute Deviation is the average absolute difference | between each element and a centerpoint (mean in this | case): MAD = average | x - x_mean | >>> | mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """ | Solution mean = sum(numbers) / len(numbers) | return sum(abs(x - mean) for x in numbers) / len(numbers) | | You really want to bet that that isn't on github? Because | I'll bet you any dollar amount you want that there are | solutions in near exact form that are on github prior to | their cutoff date (Don't trust me, you can find them too. | They're searchable even). Hell, I've poisoned the dataset | here! | | LLMs are (lossy) compression systems. So they're great | for information retrieval. And a lot of what we consider | intelligence (and possibly even creativity) is based on | information retrieval. Doesn't mean these things are any | less impressive but just a note on how we should be | interpreting results and understanding the limitations of | our tools. Measuring intelligence is a really difficult | thing and we need to be aware that the term isn't | universally agreed upon and so people are often talking | past one another and also some people are conflating the | differences as if they are the same. | newfocogi wrote: | Key claims: | | "We have found three potential issues with Hoffmann et al.'s | estimates of the Chinchilla scaling law that rely on Approach 3: | 1. Their estimated model fits the reconstructed data very poorly. | These conclusions hold even when accounting for potential noise | in data reconstruction and excluding outlier models. 2. The | confidence are implausibly tight given the number of data points. | Obtaining confidence intervals that tight would require many | hundreds of thousands of observations, while they likely had only | ~400. 3. Their estimated model implies a scaling policy that is | inconsistent with their other approach" | | Data point most people are probably looking for: "We find a range | consistent with the 20 tokens per parameter rule of thumb. | Indeed, our point estimates imply that 25.6 tokens per parameters | is optimal." | moffkalast wrote: | Their rule of thumb would imply that a 70B model is saturated | with 1.7T tokens, that's inconsistent with reality. | og_kalu wrote: | The Chinchilla laws were _compute optimal_ scaling laws. They | 're not supposed to tell you what parameter-token combination | will saturate a model. | moffkalast wrote: | Compute optimal for what, training? There's nothing optimal | in blowing up model size beyond the absolute minimal needed | or you'll spent the equivalent of a country in electricity | trying to scale inference later. | rfw300 wrote: | Yes, compute-optimal for training only. The purpose of | the paper wasn't to determine the most economically | practical model one could build, the most "intelligent" | model one could build given some amount of compute. | ijk wrote: | Quite. The big question at the time was "how much data do | we need to train GPT-3 equivalent models". Open models | had failed to live up to GPT performance, even ones with | a massive number of parameters. So getting results that | suggested a reason why other models were massively | undertrained was important. | | Meanwhile, people noticed that for deployed models, | inference cost often outweighs the initial training | costs. It's sometimes better to train a smaller, faster | model longer on more data, because it has lower overall | cost (including environmental impact) if you're expecting | to run the model a few million or billion times (e.g., | [1]). So training past the Chinchilla optimum point | became a lot more common, particularly after Llama. | | [1] https://arxiv.org/abs/2401.00448 | FeepingCreature wrote: | Blow up model size, get lots of space and parameters to | do the double-descent grok thing in, then distill it way | way down? | og_kalu wrote: | Training yes. | | Doubling your parameter count past that ratio will yield | a better model than doubling your data and is much easier | and cheaper to do. | naasking wrote: | That suggests that it's likely memorizing more special | cases rather than distilling general principles. They | generalize to some degree but clearly there's room for | improvement. | og_kalu wrote: | It doesn't really suggest anything. Neither model will | even close to saturation and all else equal, bigger | models perform better in every way, including | generalization. | eldenring wrote: | No their claim is that there are dimishing returns for a | fixed compute budget (in training) to scaling up data past | that threshold vs. scaling up params. | | This doesn't take inference into account either, obviously. | magnio wrote: | > To extract the data from the figure, we first downloaded the | PDF from Hoffmann et al.'s arXiv submission and saved it in SVG | format. We then parsed the SVG content to navigate and search the | SVG structure. Within the SVG, we identified the group of points | representing the scatter plot data and iterated over each point | to extract its fill color and position (x and y coordinates) | using the attributes of the corresponding SVG elements. | | > To map the SVG coordinates to the model size and training FLOP | values, we used the location of the labels or ticks on the | respective axes. This allowed us to establish a correspondence | between the SVG coordinates and the actual data values | represented in the plot. | | They ... reconstructed the data ... from a plot ... using ruler | and eyes? Why not just emailed the original authors for the raw | data? I can't help but feel like @yuvaltheterrible debunking | papers. | mxwsn wrote: | Funnily enough, I've done this for a paper I wrote as well. | Emailing authors is kind of a crapshoot. It's normal to get no | response if it's been several years since the paper came out. | In this case, a pdf plot is essentially lossless, and it's much | faster than waiting for authors to maybe respond. | V1ndaar wrote: | And not only that, in many cases they will tell you (if they | reply) "oh, we can't find the source of that plot anymore". | Happened to me quite a few times (although in physics). | | I'm pretty sure I'm not the only one who's written themselves | a mini tool to even extract data from a bitmap plot based on | the axes. Involves some manual steps (cropping mainly), but | is very convenient for the cases where people not even use | vector graphics, but sometimes even just screenshots of | plots... Do I like it? Hell no! It's why I've put quite some | effort in doing it better for my PhD thesis. | godelski wrote: | Yeah it's very annoying especially these days when there's | no real excuse to not have a copy. You can easily store all | code and data for free and in an accessible manner. Even | just GitHub for 90+% is good enough. Hugging face helps, | and there's many other ways too. | | I remember my first year in grad school I was trying to | replicate a work by a very prestigious university. It | definitely wasn't reproducible from text but I did my best. | Couldn't get close to their claims so I email the lead | author (another grad student). No response. Luckily my | advisor knew their advisor. Got a meeting and then I got | sent code. It was nothing like what they claimed in the | paper so I have no idea what they gave me. Anyways, my | paper never got published because I couldn't beat them. It | is what it is. | Ajoo wrote: | They claimed that they did ask several times in one of the | replies. | polygamous_bat wrote: | > Why not just emailed the original authors for the raw data? | | Industry research labs, especially Google deepmind, are | notoriously closed up about their "proprietary" data. I've hit | this wall multiple times in my own work in AI. | sp332 wrote: | https://twitter.com/borgeaud_s/status/1780988694163321250 | says they're going to open the data from the paper. Not sure | why they didn't do it before, but good news. | acc_297 wrote: | In fairness they did not use a ruler or eyes based on the | excerpts you quote they extracted exact coordinates of data | from an svg format which if the svg was created correctly | should at least give a non-biased dataset maybe with less | precision than the source | levocardia wrote: | I do that all the time using WebPlotDigitizer [1]. Works great. | | [1] https://apps.automeris.io/wpd/ | dynm wrote: | Seconded. When I first saw this, I thought it looked | unintuitive and difficult to use, but when I tried it, it was | very easy and I had the extracted data in a few minutes. | williamdclt wrote: | I particularly like this second quote, I appreciate them taking | the time to explain "what is a graph" in a scientific paper! | ege_erdil wrote: | we did and gave them a two week grace period to respond, but | they only responded to us after we published on arxiv | | also, we didn't reconstruct the data using a ruler, you can | automate that entire process so that it's much more reliable | than that | cgearhart wrote: | TL;DR--couldn't exactly replicate their results, but broadly | confirmed their findings. They agree that the optimal range is | 5-40 tokens per parameter, and close to 20 for the "chinchilla" | model from the original paper. | | Very unusual choice to reconstruct the dataset by eyeballing the | graph in the source paper (why not just ask for it...?) and it's | not really clear why the result is dressed up behind the | salacious-seeming abstract. | ege_erdil wrote: | we didn't eyeball the graph, there are more accurate ways of | extracting the data from a pdf file than that | | we did ask for the data but got no response until we published | on arxiv | | what is supposed to be "salacious" about the abstract? | warbaker wrote: | Calling this a "replication attempt" implied to me that they | tried to replicate the Chinchilla Scaling paper and found that it | did not replicate, which would be a very big deal! | | Instead, they just redid the analysis based on a figure in the | paper and found that the old model with slightly different | parameters gave a better fit to the data. This is a valuable | contribution, but a bit over-stated by the paper title, and the | confrontational, "gotcha" tone of the paper is unwarranted. | | A better framing would have been something like "Chinchilla | Scaling: Reanalyzed". | ege_erdil wrote: | one of their three approaches does not replicate and it's | because of a software bug in the optimizer they used, i don't | know what else we were supposed to say | gwern wrote: | The original Chinchilla authors have now identified the original | bug, apparently: | https://twitter.com/borgeaud_s/status/1780988694163321250 ___________________________________________________________________ (page generated 2024-04-18 23:00 UTC)