[HN Gopher] Zero-3 Offload: Scale DL models to trillion paramete...
       ___________________________________________________________________
        
       Zero-3 Offload: Scale DL models to trillion parameters without code
       changes
        
       Author : ghosthamlet
       Score  : 81 points
       Date   : 2021-03-13 15:06 UTC (7 hours ago)
        
 (HTM) web link (www.deepspeed.ai)
 (TXT) w3m dump (www.deepspeed.ai)
        
       | ansk wrote:
       | Question for someone knowledgable about this: if I have a model
       | which is large -- but small enough that I can fit a single
       | training example on GPU -- does this approach offer speedups
       | compared to simple gradient accumulation? Or is this only useful
       | for models which are so large that the model parameters
       | themselves are overwhelming GPU memory?
        
       | singhrac wrote:
       | For those searching, DeepSpeed is implemented as a set of
       | C++/CUDA extensions on top of PyTorch (compiled using their JIT).
        
       | alphagrep12345 wrote:
       | Simple 10 min overview/tutorial (official) if someone is
       | interested - https://www.youtube.com/watch?v=ovQC7FqXHXk
        
       | vladf wrote:
       | Alternatively, one could get rid of the memory used by optimizers
       | entirely by switching to vanilla SGD.
       | 
       | I haven't tried this on transformers and maybe that's what breaks
       | down here but in "classic" supervised settings I've found SGD
       | with schedule tuning just as fast as Adam.
        
         | gwern wrote:
         | SGD doesn't work on large Transformers, no. You need something
         | like AdamW.
        
           | The_rationalist wrote:
           | Mish is generally superior to RadamW
           | https://lessw.medium.com/meet-mish-new-state-of-the-art-
           | ai-a...
        
       | andrewprock wrote:
       | How much data do you need to mitigate the risk of over fitting a
       | trillion parameter model?
        
         | gwern wrote:
         | You ideally need ~500GB of text, or so. EleutherAI's The Pile
         | was designed to be just big enough to fit a 1t GPT efficiently,
         | and you can get the various scaling curves out of the OA-
         | related scaling papers. (You want the amount of data that fits
         | into a single epoch, because if you reuse data, you get less
         | bang for the FLOPs buck, and FLOPS constraints are right now
         | much more binding than data or model size.)
        
           | andrewprock wrote:
           | This feels off by a couple of orders of magnitude, unless a
           | significant number of the parameters are not independent.
        
             | gwern wrote:
             | It's quite amusing. The standard statistical theory does
             | not work at all in estimating data vs model size, and the
             | bounds are all vacuously large. It's a very active area of
             | research, understanding why models act so simple when
             | overparameterized and coming up with real measures of model
             | complexity. Lots to read there if you are interested in
             | such things.
        
             | singhrac wrote:
             | Well, that's the "magic" of modern deep learning. You can
             | fit models with p > n somehow without overfitting. In some
             | areas you might find this called "the strong inductive bias
             | of neural networks" or "double descent" but no one has
             | found a convincing explanation (to me).
        
       | The_rationalist wrote:
       | See also zeroth order backpropagation which allows 300X faster
       | training while not reducing throughput that much
       | https://arxiv.org/abs/2011.08895 How much zero-3 affect accuracy?
       | 
       | See also https://github.com/microsoft/fastformers
        
       | stephenroller wrote:
       | Support for this was also added to
       | [Fairscale](https://fairscale.readthedocs.io/en/latest/) and
       | [Fairseq](https://github.com/pytorch/fairseq) last week. In
       | particular, the Fairscale implementation can be used in any
       | pyotrch project without requiring the use of the Deepspeed
       | trainer.
        
         | diptanu wrote:
         | What are the relevant commits in Fairseq for this? I couldn't
         | figure out the changes by looking at the commits from last
         | week.
        
       | mchusma wrote:
       | This is super impressive. I could not figure out for a while who
       | exactly was running this project, but it looks like its
       | Microsoft. Great work!
        
       | bionhoward wrote:
       | please hook this up to Jax!
        
       | joshlk wrote:
       | GPT-NeoX is an example project that is using deepspeed and Zero-3
       | offloading. The wider project intend to train a GPT-3 sized model
       | and release it freely to the world.
       | 
       | https://github.com/EleutherAI/gpt-neox
        
         | ma2rten wrote:
         | It seems like Zero-3 doesn't work for them:
         | 
         | https://github.com/EleutherAI/gpt-neox/issues/171
        
           | joshlk wrote:
           | Looks like they got it working recently
           | https://github.com/EleutherAI/gpt-neox/pull/178
        
           | dqpb wrote:
           | Did you even read through the issue? I don't see anything
           | that indicates it won't work.
        
             | ma2rten wrote:
             | Yes, I did. The last comment is a traceback and an
             | explanation what would have to be done to fix it.
        
               | minimaxir wrote:
               | Your comment implied it's not possible _at all_ for them
               | to use it, not that it 's currently not working.
        
       | bevenky wrote:
       | This is also being added to pytorch
       | 
       | https://github.com/pytorch/pytorch/pull/46750
        
         | minimaxir wrote:
         | I don't think that's the Stage 3 announced in this blog post,
         | but it's def a framework for it.
        
       | FL33TW00D wrote:
       | Huggingface has been working on implementing this into their
       | library, and it has some pretty amazing effects on the size of
       | models you can train on a simple Colab.
       | 
       | https://huggingface.co/blog/zero-deepspeed-fairscale
        
       | dataangel wrote:
       | ELI5? All this techno babble just sounds like "it's faster
       | because we optimized it". What are the nontrivial, new
       | fundamental tricks?
        
         | jonbaer wrote:
         | I think there is some explanation (on the previous model?)
         | here, https://www.youtube.com/watch?v=tC01FRB0M7w
        
         | jiofih wrote:
         | Third paragraph or so in the overview:
         | 
         | > ZeRO removes the memory redundancies across data-parallel
         | processes by partitioning the three model states (optimizer
         | states, gradients, and parameters) across data-parallel
         | processes instead of replicating them. By doing this, it boosts
         | memory efficiency compared to classic data-parallelism while
         | retaining its computational granularity and communication
         | efficiency
        
           | dataangel wrote:
           | Yeah that would be the techno-babble. I've been working on a
           | machine learning pipeline for 6 years and I still have no
           | idea what this means.
        
             | eugenhotaj wrote:
             | If your pipeline uses only "classic" ml models, then this
             | won't make too much sense. It's mostly applicable to NNs.
        
             | cambalache wrote:
             | The product is obviously not for you but for clueless PHBs
             | who want the "latest and best" for the team so those
             | useless ML engineers can finally put his brilliant idea in
             | production with a less than 1% prediction error.
        
             | zachthewf wrote:
             | It doesn't sound like techno-babble to me. They've
             | distributed storage across nodes rather than replicating on
             | each node, hence the model size is now scalable with number
             | of nodes rather than being limited to what could be stored
             | on a single node.
        
               | p1esk wrote:
               | But it's not clear how they managed to improve training
               | on a single GPU: they say they can fit 40B model on a
               | single V100.
        
               | liuliu wrote:
               | They offload parameters, gradients and optimizer states
               | (such as moment, velocity and exponential avg of these in
               | Adam) into CPU memory.
        
               | p1esk wrote:
               | They did all that before:
               | https://arxiv.org/abs/2101.06840, but they could only fit
               | a model with 13B weights on a single V100.
        
             | jiofih wrote:
             | You can read the paper here:
             | https://arxiv.org/abs/1910.02054
        
             | liuliu wrote:
             | It is mostly applicable to transformer models, the ideas in
             | the paper would be alien if you work on computer vision.
             | 
             | In transformer models, big chunk of memory was parameters,
             | and states for optimizers (because vanilla SGD not used
             | there). The memory optimization technique that removes
             | parameters duplication on each GPU or offload entirely to
             | CPU makes sense.
             | 
             | In computer vision, big chunk of memory was hold by forward
             | layer activations and the memory optimization technique
             | applicable in these cases would be binomial checkpointing.
        
       ___________________________________________________________________
       (page generated 2021-03-13 23:01 UTC)