[HN Gopher] How to train large models on many GPUs? (2021)
       ___________________________________________________________________
        
       How to train large models on many GPUs? (2021)
        
       Author : eternalban
       Score  : 132 points
       Date   : 2023-02-11 14:22 UTC (8 hours ago)
        
 (HTM) web link (lilianweng.github.io)
 (TXT) w3m dump (lilianweng.github.io)
        
       | dauertewigkeit wrote:
       | Why isn't there a framework that does all this automatically for
       | you?
       | 
       | I tried torch FSDP but it only managed to increase the memory to
       | something like 150% of 1 GPU.
       | 
       | I eventually ended up sharding my model manually with .cuda() and
       | .to() which works much better, but now I am limited to one module
       | on one GPU and I would like to expand even more, and that would
       | mean spinning up more nodes and splitting the model over how many
       | GPUs manually.
       | 
       | I would be interested if anyone knows of a framework that manages
       | this automatically and just works.
       | 
       | EDIT: BTW I am talking about model sharding not data parallelism
       | which works very well with DDP.
        
         | atty wrote:
         | Beyond the other answers, I'll point out that pytorch is
         | developing tools that will make doing this work by hand or
         | implementing in a framework much easier. They're building a
         | native DTensor implementation and testing out SPMD-style
         | distributed models with pipelining. DTensor is in
         | torch.distributed, and the SPMD code is in the repo called Tau
         | under the pytorch org on github.
        
         | amelius wrote:
         | > Why isn't there a framework that does all this automatically
         | for you?
         | 
         | Question: could this be implemented in PyTorch in an opaque
         | way? Or would it require changes to its API?
        
         | guardiantesla wrote:
         | >Why isn't there a framework that does all this automatically
         | for you?
         | 
         | Check MosaicML if it might help in your case. I haven't tried
         | myself but they've most customizations and speed up
         | optimizations I came across in the recent times
         | 
         | https://www.mosaicml.com/blog/supercharge-training-composer
         | 
         | Also worth checking out their "training from scratch" blog
         | posts.
         | 
         | Training StableDiffusion:
         | https://www.mosaicml.com/blog/training-stable-diffusion-from...
         | 
         | Training GPT-3: https://www.mosaicml.com/blog/billion-
         | parameter-gpt-training...
        
           | NerdyDrone wrote:
           | Mosaic's open source library is excellent: Composer
           | https://github.com/mosaicml/composer.
           | 
           | * It gives you PyTorch DDP for free. Makes FSDP about as easy
           | as can be, and provides best in class performance monitoring
           | tools. https://docs.mosaicml.com/en/v0.12.1/notes/distributed
           | _train...
           | 
           | Here's a nice intro to using Huggingface models: https://docs
           | .mosaicml.com/en/v0.12.1/examples/finetune_huggi...
           | 
           | I'm just a huge fan of their developer experience. It's up
           | there with Transformers and Datasets as the nicest tools to
           | use.
        
         | sandkoan wrote:
         | Might this be what you're looking for:
         | https://github.com/bigscience-workshop/petals ?
        
         | arcanus wrote:
         | It's a fair question.
         | 
         | Nvidia's NCCL and AMD's RCCL provide parallelism constructs
         | that really are hidden at the framework level (such as PyT).
         | 
         | However, I don't think that you would want to hide model, data,
         | or tensor parallelism. It's too important a consideration for
         | performance and training convergence impact.
         | 
         | At least in scientific computing, I've never observed effective
         | means of automatic parallelism expressed across many nodes
         | despite decades of research. I'm not optimistic this will be
         | effective anytime soon.
        
         | minimaxir wrote:
         | DeepSpeed became popular soon after this post was originally
         | published and is natively supported by many PyTorch training
         | frameworks.
         | 
         | https://www.deepspeed.ai
         | 
         | https://www.deepspeed.ai/training/
        
           | dauertewigkeit wrote:
           | I tried that as well, but maybe I did not use it correctly. I
           | did not see the full sharding that I was hoping for. I only
           | saw results similiar to FSDP.
        
             | cma wrote:
             | How about flexflow?
             | 
             | https://huggingface.co/transformers/v4.9.2/parallelism.html
             | #...
        
         | buildbot wrote:
         | Any framework that "just works" tends to not work when some
         | small change is needed or a new model with new data/compute
         | roofline comes out.
        
         | option wrote:
         | There is - https://docs.nvidia.com/deeplearning/nemo/user-
         | guide/docs/en...
         | 
         | Supports data, tensor, pipeline, sequence parallelisms,
         | activation checkpointing, distributed optimizers, fused kernels
         | and more.
        
       | amelius wrote:
       | I'm waiting for GPU cards that allow the user to plug in memory
       | modules.
        
         | saurik wrote:
         | Instead of waiting for the future maybe you could look to the
         | past? That's how graphics cards used to work back a couple
         | decades ago.
        
         | TheGuyWhoCodes wrote:
         | AMD Radeon Pro SSG had 4 nvme slots on the card itself but that
         | was 2017 but with direct storage API that might be able to have
         | some gains for large models.
        
           | buildbot wrote:
           | I could never get a solid answer wether that was presented as
           | memory to the GPU or just as a PCIE switch with NVME drives
           | hanging off one side and the GPU on another.
        
             | TheGuyWhoCodes wrote:
             | As far as I remember it was presented as a drive and was
             | good for sequential read but you had to use AMD's API to
             | get the full benefit
        
       | dang wrote:
       | Discussed (a bit) at the time:
       | 
       |  _How to train large models on many GPUs?_ -
       | https://news.ycombinator.com/item?id=28657797 - Sept 2021 (9
       | comments)
        
         | eternalban wrote:
         | Somewhat amazed, dang, that this topic is not discussed more
         | widely here or elsewhere. There is a _lot_ of HPC and DS
         | expertise out there which lacks understanding of ML system
         | architecture (in the sense of the deployed machinery in toto).
         | 
         | Her follow up post [1] is also recommended for those who (like
         | me, are experienced but not in ML) finally had things click
         | because of the OP writeup:
         | 
         |  _Large Transformer Model Inference Optimization_ (2023)
         | 
         | https://lilianweng.github.io/posts/2023-01-10-inference-opti...
         | 
         | A very cool cite from that article is LLM.int8():
         | https://arxiv.org/abs/2208.07339
        
       | 631246101 wrote:
       | [flagged]
        
         | 631246101 wrote:
         | https://news.ycombinator.com/login
        
       ___________________________________________________________________
       (page generated 2023-02-11 23:00 UTC)