[HN Gopher] Zero-3 Offload: Scale DL models to trillion paramete... ___________________________________________________________________ Zero-3 Offload: Scale DL models to trillion parameters without code changes Author : ghosthamlet Score : 81 points Date : 2021-03-13 15:06 UTC (7 hours ago) (HTM) web link (www.deepspeed.ai) (TXT) w3m dump (www.deepspeed.ai) | ansk wrote: | Question for someone knowledgable about this: if I have a model | which is large -- but small enough that I can fit a single | training example on GPU -- does this approach offer speedups | compared to simple gradient accumulation? Or is this only useful | for models which are so large that the model parameters | themselves are overwhelming GPU memory? | singhrac wrote: | For those searching, DeepSpeed is implemented as a set of | C++/CUDA extensions on top of PyTorch (compiled using their JIT). | alphagrep12345 wrote: | Simple 10 min overview/tutorial (official) if someone is | interested - https://www.youtube.com/watch?v=ovQC7FqXHXk | vladf wrote: | Alternatively, one could get rid of the memory used by optimizers | entirely by switching to vanilla SGD. | | I haven't tried this on transformers and maybe that's what breaks | down here but in "classic" supervised settings I've found SGD | with schedule tuning just as fast as Adam. | gwern wrote: | SGD doesn't work on large Transformers, no. You need something | like AdamW. | The_rationalist wrote: | Mish is generally superior to RadamW | https://lessw.medium.com/meet-mish-new-state-of-the-art- | ai-a... | andrewprock wrote: | How much data do you need to mitigate the risk of over fitting a | trillion parameter model? | gwern wrote: | You ideally need ~500GB of text, or so. EleutherAI's The Pile | was designed to be just big enough to fit a 1t GPT efficiently, | and you can get the various scaling curves out of the OA- | related scaling papers. (You want the amount of data that fits | into a single epoch, because if you reuse data, you get less | bang for the FLOPs buck, and FLOPS constraints are right now | much more binding than data or model size.) | andrewprock wrote: | This feels off by a couple of orders of magnitude, unless a | significant number of the parameters are not independent. | gwern wrote: | It's quite amusing. The standard statistical theory does | not work at all in estimating data vs model size, and the | bounds are all vacuously large. It's a very active area of | research, understanding why models act so simple when | overparameterized and coming up with real measures of model | complexity. Lots to read there if you are interested in | such things. | singhrac wrote: | Well, that's the "magic" of modern deep learning. You can | fit models with p > n somehow without overfitting. In some | areas you might find this called "the strong inductive bias | of neural networks" or "double descent" but no one has | found a convincing explanation (to me). | The_rationalist wrote: | See also zeroth order backpropagation which allows 300X faster | training while not reducing throughput that much | https://arxiv.org/abs/2011.08895 How much zero-3 affect accuracy? | | See also https://github.com/microsoft/fastformers | stephenroller wrote: | Support for this was also added to | [Fairscale](https://fairscale.readthedocs.io/en/latest/) and | [Fairseq](https://github.com/pytorch/fairseq) last week. In | particular, the Fairscale implementation can be used in any | pyotrch project without requiring the use of the Deepspeed | trainer. | diptanu wrote: | What are the relevant commits in Fairseq for this? I couldn't | figure out the changes by looking at the commits from last | week. | mchusma wrote: | This is super impressive. I could not figure out for a while who | exactly was running this project, but it looks like its | Microsoft. Great work! | bionhoward wrote: | please hook this up to Jax! | joshlk wrote: | GPT-NeoX is an example project that is using deepspeed and Zero-3 | offloading. The wider project intend to train a GPT-3 sized model | and release it freely to the world. | | https://github.com/EleutherAI/gpt-neox | ma2rten wrote: | It seems like Zero-3 doesn't work for them: | | https://github.com/EleutherAI/gpt-neox/issues/171 | joshlk wrote: | Looks like they got it working recently | https://github.com/EleutherAI/gpt-neox/pull/178 | dqpb wrote: | Did you even read through the issue? I don't see anything | that indicates it won't work. | ma2rten wrote: | Yes, I did. The last comment is a traceback and an | explanation what would have to be done to fix it. | minimaxir wrote: | Your comment implied it's not possible _at all_ for them | to use it, not that it 's currently not working. | bevenky wrote: | This is also being added to pytorch | | https://github.com/pytorch/pytorch/pull/46750 | minimaxir wrote: | I don't think that's the Stage 3 announced in this blog post, | but it's def a framework for it. | FL33TW00D wrote: | Huggingface has been working on implementing this into their | library, and it has some pretty amazing effects on the size of | models you can train on a simple Colab. | | https://huggingface.co/blog/zero-deepspeed-fairscale | dataangel wrote: | ELI5? All this techno babble just sounds like "it's faster | because we optimized it". What are the nontrivial, new | fundamental tricks? | jonbaer wrote: | I think there is some explanation (on the previous model?) | here, https://www.youtube.com/watch?v=tC01FRB0M7w | jiofih wrote: | Third paragraph or so in the overview: | | > ZeRO removes the memory redundancies across data-parallel | processes by partitioning the three model states (optimizer | states, gradients, and parameters) across data-parallel | processes instead of replicating them. By doing this, it boosts | memory efficiency compared to classic data-parallelism while | retaining its computational granularity and communication | efficiency | dataangel wrote: | Yeah that would be the techno-babble. I've been working on a | machine learning pipeline for 6 years and I still have no | idea what this means. | eugenhotaj wrote: | If your pipeline uses only "classic" ml models, then this | won't make too much sense. It's mostly applicable to NNs. | cambalache wrote: | The product is obviously not for you but for clueless PHBs | who want the "latest and best" for the team so those | useless ML engineers can finally put his brilliant idea in | production with a less than 1% prediction error. | zachthewf wrote: | It doesn't sound like techno-babble to me. They've | distributed storage across nodes rather than replicating on | each node, hence the model size is now scalable with number | of nodes rather than being limited to what could be stored | on a single node. | p1esk wrote: | But it's not clear how they managed to improve training | on a single GPU: they say they can fit 40B model on a | single V100. | liuliu wrote: | They offload parameters, gradients and optimizer states | (such as moment, velocity and exponential avg of these in | Adam) into CPU memory. | p1esk wrote: | They did all that before: | https://arxiv.org/abs/2101.06840, but they could only fit | a model with 13B weights on a single V100. | jiofih wrote: | You can read the paper here: | https://arxiv.org/abs/1910.02054 | liuliu wrote: | It is mostly applicable to transformer models, the ideas in | the paper would be alien if you work on computer vision. | | In transformer models, big chunk of memory was parameters, | and states for optimizers (because vanilla SGD not used | there). The memory optimization technique that removes | parameters duplication on each GPU or offload entirely to | CPU makes sense. | | In computer vision, big chunk of memory was hold by forward | layer activations and the memory optimization technique | applicable in these cases would be binomial checkpointing. ___________________________________________________________________ (page generated 2021-03-13 23:01 UTC)