[HN Gopher] Use pytorch2+cu118 with ADA hardware for 50%+ speedup ___________________________________________________________________ Use pytorch2+cu118 with ADA hardware for 50%+ speedup Author : vans554 Score : 116 points Date : 2023-07-19 14:55 UTC (8 hours ago) (HTM) web link (gpux.ai) (TXT) w3m dump (gpux.ai) | mrwizrd wrote: | Can the same speedup be obtained on a 3090? | vans554 wrote: | I accidentally stumbled upon this and did not expect such a | speedup. It seems anything less than cu118 does not properly | support the RTX4090 (or H100). | | Bumping to cuda12.2 with pytorch2.0.1+cu118 made my SDXL go 50% | faster and ESRGAN 80% faster on the 4090. | latchkey wrote: | Good find! | brucethemoose2 wrote: | You can also run PyTorch cu121 nightly builds, | | These also allow `torch.compile` to function properly with | dynamic input, which should net another 30%+ boost to SD. | cheald wrote: | Is there a trick to getting pytorch+cu121 and xformers to | play nicely together? All the xformers packages I can find | are torch==2.01+cu118. | | Edit: After a bit more research it looks like scaled dot | product attention in Pytorch 2 provides much the same benefit | as xformers without the need for xformers proper. Nice. | brucethemoose2 wrote: | xformers has to match the PyTorch build. For PyTorch | nightly, you need to build from source. | | xformers still has a tiny performance benefit (especially | at higher resolutions IIRC), but yeah, PyTorch's SDP is | good. | vans554 wrote: | Pretty interesting. Using nightly + cu121 im getting 8.18 | it/s, another 5% improvement vs 7.78 that cu118 gave. | voz_ wrote: | This comment brings a tear to my eye. | doctorpangloss wrote: | The underlying problem is the community's decision to make | users manage this in the first place. | | This is an example of a setup.py that correctly installs | the accelerated PyTorch for your platform: | | https://github.com/comfyanonymous/ComfyUI/blob/9aeaac4af5e1 | 9... | | As you can see, never merged. For philosophical reasons I | believe. The author wanted to merge it earlier and changed | his mind. | | Like why make end users deal with this at all? The ROI from | a layperson choosing these details is very low. | | Python has a packaging problem, this is well known. Fixing | setuptools would be highest yield. Other package tooling | can't install PyTorch, for example: | https://github.com/python- | poetry/poetry/issues/6409#issuecom.... | | PyTorch itself is wonkily packaged. But I'm sure they have | a good reason for this. Anyway, it goes to show that you | can put a huge amount of effort into fixing this particular | problem that everyone touching this technology has, and the | maintainers everywhere will go nowhere with it. And I don't | think this is a "me" problem, because there is so much | demand for packaging PyTorch correctly - all the easy UIs, | etc. | brucethemoose2 wrote: | > But I'm sure they have a good reason for this. | | CUDA and ROCM make this an intractable problem. Basically | there is no way to sanely package everything users need, | and the absolutely enormous, cude/rocm versioned pytorch | packages with missing libs are already a compromise. | | TBH the whole ecosystem is not meant to be for end user | inference anyway. | voz_ wrote: | Sorry, no idea what you are talking about. | | I am talking about dynamic shapes in torch.compile. | | You seem to be talking about software packaging. You also | make heavy use of the word "this" without it being clear | what "this" is. | brucethemoose2 wrote: | The two most popular stable diffusion UIs (automatic1111 | and comfy) have longstanding issues with a few known but | poorly documented bugs, like the ADA performance issue. | | For instance, the torch.compile thing we are talking | about is (last I checked) totally irrelevant for those | UIs because they are still using the Stability AI | implementation, not diffusers package that Huggingface | checks for graph breaks. This may extend to SDXL. | hospitalJail wrote: | This was one of the reasons I skipped the 4090. | | So few people have the technology that I knew I'd be spending | significant time figuring out solutions to problems. | | The other reason is that I'd wait a few years and get some 6090 | with 4x the VRAM. | yakorevivan wrote: | [dead] | VadimPR wrote: | I can confirm that it's true on RTX 4080 on Ubuntu 22.04 LTS. | SekstiNi wrote: | Surprised people don't know about this, as it has been common | knowledge in the SD community [1] since october last year. | Strictly speaking you don't even need cuda 11.8+ to get the | speedup; it's sufficient to use cuDNN 8.6+, though you should use | the newest versions for other reasons. | | [1]: https://github.com/AUTOMATIC1111/stable-diffusion- | webui/issu... | voz_ wrote: | Always cool to see :) | | If you build from source, it should be even faster compared to | release builds, if only because we keep on landing fixes and | speedups regularly. | | If anyone tries this and runs into bugs or issues, feel free to | respond here and I can take a look. | alfalfasprout wrote: | Oh man, I deal with CUDA version nuances all the time. ML | dependency management in particular is always extra fun. Between | all the different CUDA, CuDNN, NCCL versions and versions of TF | frameworks and numpy dependencies, etc. it can quickly become a | mess. | | We've started really investing into a better solution-- always | interesting to see just how big a difference getting the right | CUDA version for a given build of eg; torch is. | WithinReason wrote: | PyTorch has been listing this install option for months, just | click the "CUDA 11.8" button: | | https://pytorch.org/get-started/locally/ | [deleted] | baby_souffle wrote: | Yes, but 11.7 has been the "stable" release: | https://github.com/pytorch/pytorch/blob/main/RELEASE.md#rele... | lostmsu wrote: | Does it apply to Windows? | boredumb wrote: | wow if those benchmarks are true that is amazing to read. | valine wrote: | Its true. I've been installing nightly builds of pytorch for | months specifically to access this fix. Have been getting | 40it/s outputting a 512x512 image on my 4090. Prior to the fix | would get around 19it/s. | photoGrant wrote: | Why am I with a 3090 @ 3 it/s? | | Am I doing something heavily wrong? All through WSL2 | valine wrote: | is/s depends on resolution and other factors like batch | size. What are you getting for 512x image? | photoGrant wrote: | Fair, 12.3. My numbers are with the dev branch and 1024 | with the xl model | valine wrote: | Yeah that'll do it. 3it/s sounds normal then. | capableweb wrote: | Also sampler and bunch of other parameters. | bilsbie wrote: | Eli5? | thangngoc89 wrote: | Using newer CUDA version with supported hardware and software | boost performance ___________________________________________________________________ (page generated 2023-07-19 23:01 UTC)