[HN Gopher] 100 Pages of raw notes released with the language mo... ___________________________________________________________________ 100 Pages of raw notes released with the language model OPT-175 Author : mfiguiere Score : 66 points Date : 2022-05-04 14:10 UTC (8 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | flakiness wrote: | From the note: > "AKA: Help! I'm oncall, it's 3am, and everything | is on fire!" | | I didn't think ML model training ever needs on-call, especially | for this kind of research-oriented ones. But apparently it's a | thing. So is this what MLOps is about? | Nzen wrote: | This twitter post points at the pdf rendering [0] of the communal | log book that facebook researchers kept while training opt-175. | | [0] | https://github.com/facebookresearch/metaseq/tree/main/projec... | humanistbot wrote: | SemanticStrengh wrote: | Did they leverage deepspeed? Also where are the accuracy results | vs popular datasets? | tomcam wrote: | MAD props for Meta being releasing these raw notes. Love seeing | into their work process as well. | learndeeply wrote: | Skimming through this, a lot of it has to deal with bad GPU | hosts. | | > CSP fat fingered and deleted our entire cluster when trying to | replenish our buffer nodes. | | Ouch. | ensan wrote: | "The paper mentions 35 (!) manual restarts to train OPT-175B due | to hardware failure (and 70+ automatic restarts)." | | https://twitter.com/awnihannun/status/1521572873449533440 | Ameo wrote: | Wow, they hot-swapped activation functions (GELU -> RELU) during | training. They are indeed very similar activation functions, but | it's kinda crazy to me that you can make that kind of a change to | a model while it's training, preserving all weights and other | state, and just keep going. They changed weight clipping | thresholds on the fly too. | | They also swapped out the optimizer several times from what I can | tell, switching between Adam, "Fake SGD", and "Vanilla SGD" | multiple times. | | Even without the huge amounts of hardware/driver issues they | seemed to be having with the GPUs in their big training | cluster(s), this puts into perspective how hard it is to train | enormous models like this. Many of the failures don't have an | immediately obvious cause. Plus, there aren't all that many | places out there doing training at this scale so I imagine many | of these things need to get figured out on their own. | ackbar03 wrote: | I'm surprised by how hacky the whole process is and how it's | mostly just about tuning different hyperparameters | sbierwagen wrote: | Welcome to ML. | daenz wrote: | Can you say more about why you are seeing the process as hacky? | ackbar03 wrote: | I've started reading from bottom and haven't read the whole | thing yet. But their default action as stated in their log | when face exploding gradients or unstable training is to just | roll back a checkpoint and lower lr. Other proposed actions | such as clamping activations are also just pretty standard | things to try. | | I guess since their goal is to just be able to have a trained | model it doesn't really matter. But it doesn't seem to be a | easily reproducible process, and like i said a bit hacky in | my opinion | gnulinux wrote: | They hot-swapped all kinds of model hyperparameters such as | changing activation function and optimizer. It doesn't look | like there was a principled reason why they kept switching | optimizer or activation function. Maybe as they were training | the model their data scientists kept finding ways to improve | the model? Not sure, but it looks extremely hacky to me. Not | something some team ran one day and forgot until it trained. | joshvm wrote: | Not sure if your comment is meant as a disagreement or a | question. | | Generally the way hyperparameters are adjusted is some mix of | intuition/experience and random/grid searching. Plus most | people don't have the resources/infra to do a large scale | grid search on a model that might take a day or more to | train. It's somewhat principled, but often a random search is | just as good as fiddling numbers by hand and often you have | to figure out why something worked post-hoc. You also accept | that you might never have a good explanation - for all you | know it's dataset dependent - and trust that your results are | good enough to convince peer review (and you can show that | this other parameter set was worse, so you didn't use it). | It's hacky in the sense that a lot of the work in getting to | state of the art (moving the needle on a benchmark by less | than 1%) involves playing with the numbers until you get the | best results. For example here the engineers modify the | learning rate between various runs. I don't think they really | had any theoretical reason behind the step changes apart from | "this will probably work better because we've seen that | effect when training similar sized models". | | Adjusting learning rate schedules is one of the simplest | knobs to tweak. When you're working with huge models | generally you want to use as big a batch size as you can get | away with to reduce training time. A bit counter to the | earlier thinking where LeCunn said something like "friends | don't let friends use batch sizes > 32". | | There may be some guided methods like exploring the parameter | space in a Bayesian way (eg try to efficiently explore which | knobs make the most difference). | ackbar03 wrote: | They seem to be adjusting lr between epochs as well when | the loss explodes, not just runs. But I haven't read | through the whole thing yet, maybe they trained the whole | thing properly from start to finish at the end. Otherwise | that would be extremely hacky and irreproducible | dotnet00 wrote: | Yeah I think for now they were just trying to get any | comparable results due to a near complete lack of details | on GPT-3. They seemed to have a hard deadline for the | task. | lumost wrote: | The time and expense of training a model at this size | does not benefit well from trial and error. It's simply | impractical to iteratively try ~20 different learning | schedules. | | Hideously ineficient and hacky to have someone manually | tweaking things, but not terribly different from the | state of the art for scientific research. As long as they | state the objectives of their manual control and produce | a log of what they did someone else could _try_ to | replicate it. | mhh__ wrote: | Currently I think we are still gluing transistors (networks) | together (spiritually) like the very early days of the modern | computer, it is hacky. | dotnet00 wrote: | It's pretty reassuring to see that constantly fiddling with the | model and trying to adjust learning rates on the fly is also | normal at leading research labs. Although on the other hand it | only makes the replication crisis even worse. | | After a quick look through, I really hope releasing raw notes | like this becomes more of a trend! ___________________________________________________________________ (page generated 2022-05-04 23:01 UTC)