[HN Gopher] How to train large models on many GPUs? (2021) ___________________________________________________________________ How to train large models on many GPUs? (2021) Author : eternalban Score : 132 points Date : 2023-02-11 14:22 UTC (8 hours ago) (HTM) web link (lilianweng.github.io) (TXT) w3m dump (lilianweng.github.io) | dauertewigkeit wrote: | Why isn't there a framework that does all this automatically for | you? | | I tried torch FSDP but it only managed to increase the memory to | something like 150% of 1 GPU. | | I eventually ended up sharding my model manually with .cuda() and | .to() which works much better, but now I am limited to one module | on one GPU and I would like to expand even more, and that would | mean spinning up more nodes and splitting the model over how many | GPUs manually. | | I would be interested if anyone knows of a framework that manages | this automatically and just works. | | EDIT: BTW I am talking about model sharding not data parallelism | which works very well with DDP. | atty wrote: | Beyond the other answers, I'll point out that pytorch is | developing tools that will make doing this work by hand or | implementing in a framework much easier. They're building a | native DTensor implementation and testing out SPMD-style | distributed models with pipelining. DTensor is in | torch.distributed, and the SPMD code is in the repo called Tau | under the pytorch org on github. | amelius wrote: | > Why isn't there a framework that does all this automatically | for you? | | Question: could this be implemented in PyTorch in an opaque | way? Or would it require changes to its API? | guardiantesla wrote: | >Why isn't there a framework that does all this automatically | for you? | | Check MosaicML if it might help in your case. I haven't tried | myself but they've most customizations and speed up | optimizations I came across in the recent times | | https://www.mosaicml.com/blog/supercharge-training-composer | | Also worth checking out their "training from scratch" blog | posts. | | Training StableDiffusion: | https://www.mosaicml.com/blog/training-stable-diffusion-from... | | Training GPT-3: https://www.mosaicml.com/blog/billion- | parameter-gpt-training... | NerdyDrone wrote: | Mosaic's open source library is excellent: Composer | https://github.com/mosaicml/composer. | | * It gives you PyTorch DDP for free. Makes FSDP about as easy | as can be, and provides best in class performance monitoring | tools. https://docs.mosaicml.com/en/v0.12.1/notes/distributed | _train... | | Here's a nice intro to using Huggingface models: https://docs | .mosaicml.com/en/v0.12.1/examples/finetune_huggi... | | I'm just a huge fan of their developer experience. It's up | there with Transformers and Datasets as the nicest tools to | use. | sandkoan wrote: | Might this be what you're looking for: | https://github.com/bigscience-workshop/petals ? | arcanus wrote: | It's a fair question. | | Nvidia's NCCL and AMD's RCCL provide parallelism constructs | that really are hidden at the framework level (such as PyT). | | However, I don't think that you would want to hide model, data, | or tensor parallelism. It's too important a consideration for | performance and training convergence impact. | | At least in scientific computing, I've never observed effective | means of automatic parallelism expressed across many nodes | despite decades of research. I'm not optimistic this will be | effective anytime soon. | minimaxir wrote: | DeepSpeed became popular soon after this post was originally | published and is natively supported by many PyTorch training | frameworks. | | https://www.deepspeed.ai | | https://www.deepspeed.ai/training/ | dauertewigkeit wrote: | I tried that as well, but maybe I did not use it correctly. I | did not see the full sharding that I was hoping for. I only | saw results similiar to FSDP. | cma wrote: | How about flexflow? | | https://huggingface.co/transformers/v4.9.2/parallelism.html | #... | buildbot wrote: | Any framework that "just works" tends to not work when some | small change is needed or a new model with new data/compute | roofline comes out. | option wrote: | There is - https://docs.nvidia.com/deeplearning/nemo/user- | guide/docs/en... | | Supports data, tensor, pipeline, sequence parallelisms, | activation checkpointing, distributed optimizers, fused kernels | and more. | amelius wrote: | I'm waiting for GPU cards that allow the user to plug in memory | modules. | saurik wrote: | Instead of waiting for the future maybe you could look to the | past? That's how graphics cards used to work back a couple | decades ago. | TheGuyWhoCodes wrote: | AMD Radeon Pro SSG had 4 nvme slots on the card itself but that | was 2017 but with direct storage API that might be able to have | some gains for large models. | buildbot wrote: | I could never get a solid answer wether that was presented as | memory to the GPU or just as a PCIE switch with NVME drives | hanging off one side and the GPU on another. | TheGuyWhoCodes wrote: | As far as I remember it was presented as a drive and was | good for sequential read but you had to use AMD's API to | get the full benefit | dang wrote: | Discussed (a bit) at the time: | | _How to train large models on many GPUs?_ - | https://news.ycombinator.com/item?id=28657797 - Sept 2021 (9 | comments) | eternalban wrote: | Somewhat amazed, dang, that this topic is not discussed more | widely here or elsewhere. There is a _lot_ of HPC and DS | expertise out there which lacks understanding of ML system | architecture (in the sense of the deployed machinery in toto). | | Her follow up post [1] is also recommended for those who (like | me, are experienced but not in ML) finally had things click | because of the OP writeup: | | _Large Transformer Model Inference Optimization_ (2023) | | https://lilianweng.github.io/posts/2023-01-10-inference-opti... | | A very cool cite from that article is LLM.int8(): | https://arxiv.org/abs/2208.07339 | 631246101 wrote: | [flagged] | 631246101 wrote: | https://news.ycombinator.com/login ___________________________________________________________________ (page generated 2023-02-11 23:00 UTC)