[HN Gopher] Chinchilla Scaling: A replication attempt
       ___________________________________________________________________
        
       Chinchilla Scaling: A replication attempt
        
       Author : tosh
       Score  : 87 points
       Date   : 2024-04-18 15:05 UTC (7 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | cs702 wrote:
       | Interesting! If the authors are right, it seems that the number
       | of training tokens required per parameter (slowly) _declines_ as
       | models become larger (Figure 5).
       | 
       | That's good news. I think it deserves wider dissemination, so I'm
       | upvoting your post.
       | 
       | Thank you for sharing this on HN!
        
         | dzdt wrote:
         | Could be that the independence of training points available
         | decline as the dataset size grows? At some point it becomes
         | hard to add data that isn't essentially similar to something
         | youve already added.
        
           | cs702 wrote:
           | Yes, could be. Not sure how or even if anyone could prove it,
           | though.
        
             | sebzim4500 wrote:
             | I guess you could artifically limit the training data (e.g.
             | by removing languages, categories) and see if the utility
             | of extra tokens drops off as a result.
        
             | godelski wrote:
             | This should be fairly de facto true. Remember your dataset
             | is some proxy for some real (but almost surely intractable)
             | distribution.
             | 
             | Now let's think about filling the space with p-balls that
             | are bound by nearest points. So there should be no data
             | point inside the ball. Then we've turned this problem into
             | a sphere packing problem and we can talk about the size and
             | volumes of those spheres.
             | 
             | So if we uniformally fill our real distribution with data
             | then the average volume of those spheres decrease. If we
             | fill but not uniformly the average ball will decrease but
             | the largest ball will shrink slower (this case being we
             | aren't properly covering data in that region). In either
             | case that more you add data, the more the balls shrink.
             | Essentially meaning the difference between data decreases.
             | The harder question is about those under represented
             | regions. Finding them and determining how to properly
             | sample.
             | 
             | Another quick trick you can use to convince yourself if
             | thinking about basis vectors (this won't be robust btw but
             | a good starting point). In high dimensions the likelihood
             | that two randomly sampled vectors are orthogonal is almost
             | certainly true. So then we think of drawing basis vectors
             | (independent vectors that span our space). So as we fill in
             | data, we initially are very likely to have vectors (or
             | data) that are independent in some way. But as we add more,
             | the likelihood that they are orthogonal decreases. Of
             | course your basis vectors don't need to be orthogonal but
             | that's more semantics because we can always work in a space
             | where that's true.
        
         | Kronopath wrote:
         | This is not good news, this means that we could end up with a
         | dangerously superintelligent AI just by scaling up the number
         | of parameters, without increasing the amount of training data.
        
           | exe34 wrote:
           | Like a corporation then. We should ban them until we can
           | figure out how to align them!
        
             | tehsauce wrote:
             | ASI is nothing like a corporation
        
               | wizzwizz4 wrote:
               | No, they're not. Corporations have known, concrete
               | impacts on the world, whereas the dangers of AI are, so
               | far, corporations. ASIs are (as yet) fictional.
               | 
               | Another difference: most corporations will avoid doing
               | illegal stuff if the penalties are large enough: the
               | corporation alignment problem is political. Pretty much
               | no extant AI systems can be instructed in this way: we
               | don't know how to align AIs even in theory.
        
               | TeMPOraL wrote:
               | Is very much like a corporation; a corp is effectively an
               | AGI, just running very slowly - at the speed of
               | bureaucracy.
        
           | kelseyfrog wrote:
           | No, but LLMs require orders of magnitude more language input
           | than humans[1]. It's very reasonable to assume that
           | architectural differences (size among them) is more likely a
           | constraint for performance.
           | 
           | 1. Specifically larger than the upper bound on _lifetime_
           | language input for humans, even assuming 24 /7 at max reading
           | speed.
        
             | HeatrayEnjoyer wrote:
             | Do they? What is the total size of all visual, audio,
             | touch, locomotive, scent, and taste data collected between
             | birth and when a human reaches IQ 100? There are multiple
             | high-bandwidth feeds running into the brain 24/7.
        
               | cubefox wrote:
               | > language input
        
             | p1esk wrote:
             | How much language input does a human need to become
             | intelligent if he doesn't receive any other input?
        
             | TeMPOraL wrote:
             | Yes, but LLMs come out of training as experts in
             | approximately any single thing you can think of, and then
             | some, and all that in dozen of languages. Humans don't
             | achieve even a fraction of this kind of breadth.
        
               | godelski wrote:
               | This is not quite accurate, but complex because
               | measurement is hard. The things they are being tested on
               | are almost surely within the dataset. Let's take the bar
               | exam for instance. Sure, we don't know what's in GPT
               | data, but we know it has reddit, and we know reddit has
               | many similar if not exact questions on it. We know that
               | the first GPT4 did not have good semantic similarity
               | matching because they just used a 3 substring matching on
               | 50 chararcters (Appendix C) and they only consider the
               | false positive nature. Then there's this line...
               | The RLHF post-training dataset is vastly smaller than the
               | pretraining set and unlikely to have any particular
               | question contaminated. However we did not check
               | explicitly.
               | 
               | But my favorite is the HumanEval. I'll just remind
               | everyone that this was written by 60 authors, mostly from
               | OpenAI                 We evaluate functional correctness
               | on a set of 164 handwritten programming problems, which
               | we call the HumanEval dataset. ... __It is important for
               | these tasks to be hand-written, since our models are
               | trained on a large fraction of GitHub, which already
               | contains solutions to problems from a variety of
               | sources.__
               | 
               | The problems? Well they're leetcode style... Can you tell
               | me you can write leetcode style questions that
               | Human Eval 2            Prompt:       def
               | truncate_number(number: float) -> float: """ Given a
               | positive floating point number, it can be decomposed into
               | and integer part (largest integer smaller than given
               | number) and decimals (leftover part always smaller than
               | 1). Return the decimal part of the number. >>>
               | truncate_number(3.5) 0.5 """             Solution:
               | return number % 1.0             Human Eval 4
               | Prompt:       from typing import List def
               | mean_absolute_deviation(numbers: List[float]) -> float:
               | """ For a given list of input numbers, calculate Mean
               | Absolute Deviation around the mean of this dataset. Mean
               | Absolute Deviation is the average absolute difference
               | between each element and a centerpoint (mean in this
               | case): MAD = average | x - x_mean | >>>
               | mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """
               | Solution       mean = sum(numbers) / len(numbers)
               | return sum(abs(x - mean) for x in numbers) / len(numbers)
               | 
               | You really want to bet that that isn't on github? Because
               | I'll bet you any dollar amount you want that there are
               | solutions in near exact form that are on github prior to
               | their cutoff date (Don't trust me, you can find them too.
               | They're searchable even). Hell, I've poisoned the dataset
               | here!
               | 
               | LLMs are (lossy) compression systems. So they're great
               | for information retrieval. And a lot of what we consider
               | intelligence (and possibly even creativity) is based on
               | information retrieval. Doesn't mean these things are any
               | less impressive but just a note on how we should be
               | interpreting results and understanding the limitations of
               | our tools. Measuring intelligence is a really difficult
               | thing and we need to be aware that the term isn't
               | universally agreed upon and so people are often talking
               | past one another and also some people are conflating the
               | differences as if they are the same.
        
       | newfocogi wrote:
       | Key claims:
       | 
       | "We have found three potential issues with Hoffmann et al.'s
       | estimates of the Chinchilla scaling law that rely on Approach 3:
       | 1. Their estimated model fits the reconstructed data very poorly.
       | These conclusions hold even when accounting for potential noise
       | in data reconstruction and excluding outlier models. 2. The
       | confidence are implausibly tight given the number of data points.
       | Obtaining confidence intervals that tight would require many
       | hundreds of thousands of observations, while they likely had only
       | ~400. 3. Their estimated model implies a scaling policy that is
       | inconsistent with their other approach"
       | 
       | Data point most people are probably looking for: "We find a range
       | consistent with the 20 tokens per parameter rule of thumb.
       | Indeed, our point estimates imply that 25.6 tokens per parameters
       | is optimal."
        
         | moffkalast wrote:
         | Their rule of thumb would imply that a 70B model is saturated
         | with 1.7T tokens, that's inconsistent with reality.
        
           | og_kalu wrote:
           | The Chinchilla laws were _compute optimal_ scaling laws. They
           | 're not supposed to tell you what parameter-token combination
           | will saturate a model.
        
             | moffkalast wrote:
             | Compute optimal for what, training? There's nothing optimal
             | in blowing up model size beyond the absolute minimal needed
             | or you'll spent the equivalent of a country in electricity
             | trying to scale inference later.
        
               | rfw300 wrote:
               | Yes, compute-optimal for training only. The purpose of
               | the paper wasn't to determine the most economically
               | practical model one could build, the most "intelligent"
               | model one could build given some amount of compute.
        
               | ijk wrote:
               | Quite. The big question at the time was "how much data do
               | we need to train GPT-3 equivalent models". Open models
               | had failed to live up to GPT performance, even ones with
               | a massive number of parameters. So getting results that
               | suggested a reason why other models were massively
               | undertrained was important.
               | 
               | Meanwhile, people noticed that for deployed models,
               | inference cost often outweighs the initial training
               | costs. It's sometimes better to train a smaller, faster
               | model longer on more data, because it has lower overall
               | cost (including environmental impact) if you're expecting
               | to run the model a few million or billion times (e.g.,
               | [1]). So training past the Chinchilla optimum point
               | became a lot more common, particularly after Llama.
               | 
               | [1] https://arxiv.org/abs/2401.00448
        
               | FeepingCreature wrote:
               | Blow up model size, get lots of space and parameters to
               | do the double-descent grok thing in, then distill it way
               | way down?
        
               | og_kalu wrote:
               | Training yes.
               | 
               | Doubling your parameter count past that ratio will yield
               | a better model than doubling your data and is much easier
               | and cheaper to do.
        
               | naasking wrote:
               | That suggests that it's likely memorizing more special
               | cases rather than distilling general principles. They
               | generalize to some degree but clearly there's room for
               | improvement.
        
               | og_kalu wrote:
               | It doesn't really suggest anything. Neither model will
               | even close to saturation and all else equal, bigger
               | models perform better in every way, including
               | generalization.
        
           | eldenring wrote:
           | No their claim is that there are dimishing returns for a
           | fixed compute budget (in training) to scaling up data past
           | that threshold vs. scaling up params.
           | 
           | This doesn't take inference into account either, obviously.
        
       | magnio wrote:
       | > To extract the data from the figure, we first downloaded the
       | PDF from Hoffmann et al.'s arXiv submission and saved it in SVG
       | format. We then parsed the SVG content to navigate and search the
       | SVG structure. Within the SVG, we identified the group of points
       | representing the scatter plot data and iterated over each point
       | to extract its fill color and position (x and y coordinates)
       | using the attributes of the corresponding SVG elements.
       | 
       | > To map the SVG coordinates to the model size and training FLOP
       | values, we used the location of the labels or ticks on the
       | respective axes. This allowed us to establish a correspondence
       | between the SVG coordinates and the actual data values
       | represented in the plot.
       | 
       | They ... reconstructed the data ... from a plot ... using ruler
       | and eyes? Why not just emailed the original authors for the raw
       | data? I can't help but feel like @yuvaltheterrible debunking
       | papers.
        
         | mxwsn wrote:
         | Funnily enough, I've done this for a paper I wrote as well.
         | Emailing authors is kind of a crapshoot. It's normal to get no
         | response if it's been several years since the paper came out.
         | In this case, a pdf plot is essentially lossless, and it's much
         | faster than waiting for authors to maybe respond.
        
           | V1ndaar wrote:
           | And not only that, in many cases they will tell you (if they
           | reply) "oh, we can't find the source of that plot anymore".
           | Happened to me quite a few times (although in physics).
           | 
           | I'm pretty sure I'm not the only one who's written themselves
           | a mini tool to even extract data from a bitmap plot based on
           | the axes. Involves some manual steps (cropping mainly), but
           | is very convenient for the cases where people not even use
           | vector graphics, but sometimes even just screenshots of
           | plots... Do I like it? Hell no! It's why I've put quite some
           | effort in doing it better for my PhD thesis.
        
             | godelski wrote:
             | Yeah it's very annoying especially these days when there's
             | no real excuse to not have a copy. You can easily store all
             | code and data for free and in an accessible manner. Even
             | just GitHub for 90+% is good enough. Hugging face helps,
             | and there's many other ways too.
             | 
             | I remember my first year in grad school I was trying to
             | replicate a work by a very prestigious university. It
             | definitely wasn't reproducible from text but I did my best.
             | Couldn't get close to their claims so I email the lead
             | author (another grad student). No response. Luckily my
             | advisor knew their advisor. Got a meeting and then I got
             | sent code. It was nothing like what they claimed in the
             | paper so I have no idea what they gave me. Anyways, my
             | paper never got published because I couldn't beat them. It
             | is what it is.
        
         | Ajoo wrote:
         | They claimed that they did ask several times in one of the
         | replies.
        
         | polygamous_bat wrote:
         | > Why not just emailed the original authors for the raw data?
         | 
         | Industry research labs, especially Google deepmind, are
         | notoriously closed up about their "proprietary" data. I've hit
         | this wall multiple times in my own work in AI.
        
           | sp332 wrote:
           | https://twitter.com/borgeaud_s/status/1780988694163321250
           | says they're going to open the data from the paper. Not sure
           | why they didn't do it before, but good news.
        
         | acc_297 wrote:
         | In fairness they did not use a ruler or eyes based on the
         | excerpts you quote they extracted exact coordinates of data
         | from an svg format which if the svg was created correctly
         | should at least give a non-biased dataset maybe with less
         | precision than the source
        
         | levocardia wrote:
         | I do that all the time using WebPlotDigitizer [1]. Works great.
         | 
         | [1] https://apps.automeris.io/wpd/
        
           | dynm wrote:
           | Seconded. When I first saw this, I thought it looked
           | unintuitive and difficult to use, but when I tried it, it was
           | very easy and I had the extracted data in a few minutes.
        
         | williamdclt wrote:
         | I particularly like this second quote, I appreciate them taking
         | the time to explain "what is a graph" in a scientific paper!
        
         | ege_erdil wrote:
         | we did and gave them a two week grace period to respond, but
         | they only responded to us after we published on arxiv
         | 
         | also, we didn't reconstruct the data using a ruler, you can
         | automate that entire process so that it's much more reliable
         | than that
        
       | cgearhart wrote:
       | TL;DR--couldn't exactly replicate their results, but broadly
       | confirmed their findings. They agree that the optimal range is
       | 5-40 tokens per parameter, and close to 20 for the "chinchilla"
       | model from the original paper.
       | 
       | Very unusual choice to reconstruct the dataset by eyeballing the
       | graph in the source paper (why not just ask for it...?) and it's
       | not really clear why the result is dressed up behind the
       | salacious-seeming abstract.
        
         | ege_erdil wrote:
         | we didn't eyeball the graph, there are more accurate ways of
         | extracting the data from a pdf file than that
         | 
         | we did ask for the data but got no response until we published
         | on arxiv
         | 
         | what is supposed to be "salacious" about the abstract?
        
       | warbaker wrote:
       | Calling this a "replication attempt" implied to me that they
       | tried to replicate the Chinchilla Scaling paper and found that it
       | did not replicate, which would be a very big deal!
       | 
       | Instead, they just redid the analysis based on a figure in the
       | paper and found that the old model with slightly different
       | parameters gave a better fit to the data. This is a valuable
       | contribution, but a bit over-stated by the paper title, and the
       | confrontational, "gotcha" tone of the paper is unwarranted.
       | 
       | A better framing would have been something like "Chinchilla
       | Scaling: Reanalyzed".
        
         | ege_erdil wrote:
         | one of their three approaches does not replicate and it's
         | because of a software bug in the optimizer they used, i don't
         | know what else we were supposed to say
        
       | gwern wrote:
       | The original Chinchilla authors have now identified the original
       | bug, apparently:
       | https://twitter.com/borgeaud_s/status/1780988694163321250
        
       ___________________________________________________________________
       (page generated 2024-04-18 23:00 UTC)