[HN Gopher] Use pytorch2+cu118 with ADA hardware for 50%+ speedup
       ___________________________________________________________________
        
       Use pytorch2+cu118 with ADA hardware for 50%+ speedup
        
       Author : vans554
       Score  : 116 points
       Date   : 2023-07-19 14:55 UTC (8 hours ago)
        
 (HTM) web link (gpux.ai)
 (TXT) w3m dump (gpux.ai)
        
       | mrwizrd wrote:
       | Can the same speedup be obtained on a 3090?
        
       | vans554 wrote:
       | I accidentally stumbled upon this and did not expect such a
       | speedup. It seems anything less than cu118 does not properly
       | support the RTX4090 (or H100).
       | 
       | Bumping to cuda12.2 with pytorch2.0.1+cu118 made my SDXL go 50%
       | faster and ESRGAN 80% faster on the 4090.
        
         | latchkey wrote:
         | Good find!
        
         | brucethemoose2 wrote:
         | You can also run PyTorch cu121 nightly builds,
         | 
         | These also allow `torch.compile` to function properly with
         | dynamic input, which should net another 30%+ boost to SD.
        
           | cheald wrote:
           | Is there a trick to getting pytorch+cu121 and xformers to
           | play nicely together? All the xformers packages I can find
           | are torch==2.01+cu118.
           | 
           | Edit: After a bit more research it looks like scaled dot
           | product attention in Pytorch 2 provides much the same benefit
           | as xformers without the need for xformers proper. Nice.
        
             | brucethemoose2 wrote:
             | xformers has to match the PyTorch build. For PyTorch
             | nightly, you need to build from source.
             | 
             | xformers still has a tiny performance benefit (especially
             | at higher resolutions IIRC), but yeah, PyTorch's SDP is
             | good.
        
           | vans554 wrote:
           | Pretty interesting. Using nightly + cu121 im getting 8.18
           | it/s, another 5% improvement vs 7.78 that cu118 gave.
        
           | voz_ wrote:
           | This comment brings a tear to my eye.
        
             | doctorpangloss wrote:
             | The underlying problem is the community's decision to make
             | users manage this in the first place.
             | 
             | This is an example of a setup.py that correctly installs
             | the accelerated PyTorch for your platform:
             | 
             | https://github.com/comfyanonymous/ComfyUI/blob/9aeaac4af5e1
             | 9...
             | 
             | As you can see, never merged. For philosophical reasons I
             | believe. The author wanted to merge it earlier and changed
             | his mind.
             | 
             | Like why make end users deal with this at all? The ROI from
             | a layperson choosing these details is very low.
             | 
             | Python has a packaging problem, this is well known. Fixing
             | setuptools would be highest yield. Other package tooling
             | can't install PyTorch, for example:
             | https://github.com/python-
             | poetry/poetry/issues/6409#issuecom....
             | 
             | PyTorch itself is wonkily packaged. But I'm sure they have
             | a good reason for this. Anyway, it goes to show that you
             | can put a huge amount of effort into fixing this particular
             | problem that everyone touching this technology has, and the
             | maintainers everywhere will go nowhere with it. And I don't
             | think this is a "me" problem, because there is so much
             | demand for packaging PyTorch correctly - all the easy UIs,
             | etc.
        
               | brucethemoose2 wrote:
               | > But I'm sure they have a good reason for this.
               | 
               | CUDA and ROCM make this an intractable problem. Basically
               | there is no way to sanely package everything users need,
               | and the absolutely enormous, cude/rocm versioned pytorch
               | packages with missing libs are already a compromise.
               | 
               | TBH the whole ecosystem is not meant to be for end user
               | inference anyway.
        
               | voz_ wrote:
               | Sorry, no idea what you are talking about.
               | 
               | I am talking about dynamic shapes in torch.compile.
               | 
               | You seem to be talking about software packaging. You also
               | make heavy use of the word "this" without it being clear
               | what "this" is.
        
               | brucethemoose2 wrote:
               | The two most popular stable diffusion UIs (automatic1111
               | and comfy) have longstanding issues with a few known but
               | poorly documented bugs, like the ADA performance issue.
               | 
               | For instance, the torch.compile thing we are talking
               | about is (last I checked) totally irrelevant for those
               | UIs because they are still using the Stability AI
               | implementation, not diffusers package that Huggingface
               | checks for graph breaks. This may extend to SDXL.
        
         | hospitalJail wrote:
         | This was one of the reasons I skipped the 4090.
         | 
         | So few people have the technology that I knew I'd be spending
         | significant time figuring out solutions to problems.
         | 
         | The other reason is that I'd wait a few years and get some 6090
         | with 4x the VRAM.
        
       | yakorevivan wrote:
       | [dead]
        
       | VadimPR wrote:
       | I can confirm that it's true on RTX 4080 on Ubuntu 22.04 LTS.
        
       | SekstiNi wrote:
       | Surprised people don't know about this, as it has been common
       | knowledge in the SD community [1] since october last year.
       | Strictly speaking you don't even need cuda 11.8+ to get the
       | speedup; it's sufficient to use cuDNN 8.6+, though you should use
       | the newest versions for other reasons.
       | 
       | [1]: https://github.com/AUTOMATIC1111/stable-diffusion-
       | webui/issu...
        
       | voz_ wrote:
       | Always cool to see :)
       | 
       | If you build from source, it should be even faster compared to
       | release builds, if only because we keep on landing fixes and
       | speedups regularly.
       | 
       | If anyone tries this and runs into bugs or issues, feel free to
       | respond here and I can take a look.
        
       | alfalfasprout wrote:
       | Oh man, I deal with CUDA version nuances all the time. ML
       | dependency management in particular is always extra fun. Between
       | all the different CUDA, CuDNN, NCCL versions and versions of TF
       | frameworks and numpy dependencies, etc. it can quickly become a
       | mess.
       | 
       | We've started really investing into a better solution-- always
       | interesting to see just how big a difference getting the right
       | CUDA version for a given build of eg; torch is.
        
       | WithinReason wrote:
       | PyTorch has been listing this install option for months, just
       | click the "CUDA 11.8" button:
       | 
       | https://pytorch.org/get-started/locally/
        
         | [deleted]
        
         | baby_souffle wrote:
         | Yes, but 11.7 has been the "stable" release:
         | https://github.com/pytorch/pytorch/blob/main/RELEASE.md#rele...
        
       | lostmsu wrote:
       | Does it apply to Windows?
        
       | boredumb wrote:
       | wow if those benchmarks are true that is amazing to read.
        
         | valine wrote:
         | Its true. I've been installing nightly builds of pytorch for
         | months specifically to access this fix. Have been getting
         | 40it/s outputting a 512x512 image on my 4090. Prior to the fix
         | would get around 19it/s.
        
           | photoGrant wrote:
           | Why am I with a 3090 @ 3 it/s?
           | 
           | Am I doing something heavily wrong? All through WSL2
        
             | valine wrote:
             | is/s depends on resolution and other factors like batch
             | size. What are you getting for 512x image?
        
               | photoGrant wrote:
               | Fair, 12.3. My numbers are with the dev branch and 1024
               | with the xl model
        
               | valine wrote:
               | Yeah that'll do it. 3it/s sounds normal then.
        
               | capableweb wrote:
               | Also sampler and bunch of other parameters.
        
       | bilsbie wrote:
       | Eli5?
        
         | thangngoc89 wrote:
         | Using newer CUDA version with supported hardware and software
         | boost performance
        
       ___________________________________________________________________
       (page generated 2023-07-19 23:01 UTC)