       | flakiness wrote:
       | From the note: > "AKA: Help! I'm oncall, it's 3am, and everything
       | is on fire!"
       | I didn't think ML model training ever needs on-call, especially
       | for this kind of research-oriented ones. But apparently it's a
       | thing. So is this what MLOps is about?
       | Nzen wrote:
       | This twitter post points at the pdf rendering [0] of the communal
       | log book that facebook researchers kept while training opt-175.
       | [0]
       | https://github.com/facebookresearch/metaseq/tree/main/projec...
         | humanistbot wrote:
       | SemanticStrengh wrote:
       | Did they leverage deepspeed? Also where are the accuracy results
       | vs popular datasets?
       | tomcam wrote:
       | MAD props for Meta being releasing these raw notes. Love seeing
       | into their work process as well.
       | learndeeply wrote:
       | Skimming through this, a lot of it has to deal with bad GPU
       | hosts.
       | > CSP fat fingered and deleted our entire cluster when trying to
       | replenish our buffer nodes.
       | Ouch.
       | ensan wrote:
       | "The paper mentions 35 (!) manual restarts to train OPT-175B due
       | to hardware failure (and 70+ automatic restarts)."
       | https://twitter.com/awnihannun/status/1521572873449533440
       | Ameo wrote:
       | Wow, they hot-swapped activation functions (GELU -> RELU) during
       | training. They are indeed very similar activation functions, but
       | it's kinda crazy to me that you can make that kind of a change to
       | a model while it's training, preserving all weights and other
       | state, and just keep going. They changed weight clipping
       | thresholds on the fly too.
       | They also swapped out the optimizer several times from what I can
       | tell, switching between Adam, "Fake SGD", and "Vanilla SGD"
       | multiple times.
       | Even without the huge amounts of hardware/driver issues they
       | seemed to be having with the GPUs in their big training
       | cluster(s), this puts into perspective how hard it is to train
       | enormous models like this. Many of the failures don't have an
       | immediately obvious cause. Plus, there aren't all that many
       | places out there doing training at this scale so I imagine many
       | of these things need to get figured out on their own.
       | ackbar03 wrote:
       | I'm surprised by how hacky the whole process is and how it's
       | mostly just about tuning different hyperparameters
         | sbierwagen wrote:
         | Welcome to ML.
         | daenz wrote:
         | Can you say more about why you are seeing the process as hacky?
           | ackbar03 wrote:
           | I've started reading from bottom and haven't read the whole
           | thing yet. But their default action as stated in their log
           | when face exploding gradients or unstable training is to just
           | roll back a checkpoint and lower lr. Other proposed actions
           | such as clamping activations are also just pretty standard
           | things to try.
           | I guess since their goal is to just be able to have a trained
           | model it doesn't really matter. But it doesn't seem to be a
           | easily reproducible process, and like i said a bit hacky in
           | my opinion
           | gnulinux wrote:
           | They hot-swapped all kinds of model hyperparameters such as
           | changing activation function and optimizer. It doesn't look
           | like there was a principled reason why they kept switching
           | optimizer or activation function. Maybe as they were training
           | the model their data scientists kept finding ways to improve
           | the model? Not sure, but it looks extremely hacky to me. Not
           | something some team ran one day and forgot until it trained.
           | joshvm wrote:
           | Not sure if your comment is meant as a disagreement or a
           | question.
           | Generally the way hyperparameters are adjusted is some mix of
           | intuition/experience and random/grid searching. Plus most
           | people don't have the resources/infra to do a large scale
           | grid search on a model that might take a day or more to
           | train. It's somewhat principled, but often a random search is
           | just as good as fiddling numbers by hand and often you have
           | to figure out why something worked post-hoc. You also accept
           | that you might never have a good explanation - for all you
           | know it's dataset dependent - and trust that your results are
           | good enough to convince peer review (and you can show that
           | this other parameter set was worse, so you didn't use it).
           | It's hacky in the sense that a lot of the work in getting to
           | state of the art (moving the needle on a benchmark by less
           | than 1%) involves playing with the numbers until you get the
           | best results. For example here the engineers modify the
           | learning rate between various runs. I don't think they really
           | had any theoretical reason behind the step changes apart from
           | "this will probably work better because we've seen that
           | effect when training similar sized models".
           | Adjusting learning rate schedules is one of the simplest
           | knobs to tweak. When you're working with huge models
           | generally you want to use as big a batch size as you can get
           | away with to reduce training time. A bit counter to the
           | earlier thinking where LeCunn said something like "friends
           | don't let friends use batch sizes > 32".
           | There may be some guided methods like exploring the parameter
           | space in a Bayesian way (eg try to efficiently explore which
           | knobs make the most difference).
             | ackbar03 wrote:
             | They seem to be adjusting lr between epochs as well when
             | the loss explodes, not just runs. But I haven't read
             | through the whole thing yet, maybe they trained the whole
             | thing properly from start to finish at the end. Otherwise
             | that would be extremely hacky and irreproducible
               | dotnet00 wrote:
               | Yeah I think for now they were just trying to get any
               | comparable results due to a near complete lack of details
               | on GPT-3. They seemed to have a hard deadline for the
               | task.
               | lumost wrote:
               | The time and expense of training a model at this size
               | does not benefit well from trial and error. It's simply
               | impractical to iteratively try ~20 different learning
               | schedules.
               | Hideously ineficient and hacky to have someone manually
               | tweaking things, but not terribly different from the
               | state of the art for scientific research. As long as they
               | state the objectives of their manual control and produce
               | a log of what they did someone else could _try_ to
               | replicate it.
           | mhh__ wrote:
           | Currently I think we are still gluing transistors (networks)
           | together (spiritually) like the very early days of the modern
           | computer, it is hacky.
       | dotnet00 wrote:
       | It's pretty reassuring to see that constantly fiddling with the
       | model and trying to adjust learning rates on the fly is also
       | normal at leading research labs. Although on the other hand it
       | only makes the replication crisis even worse.
       | After a quick look through, I really hope releasing raw notes
       | like this becomes more of a trend!
