[HN Gopher] High-performance image generation using Stable Diffu... ___________________________________________________________________ High-performance image generation using Stable Diffusion in KerasCV Author : tosh Score : 317 points Date : 2022-09-28 08:28 UTC (14 hours ago) (HTM) web link (keras.io) (TXT) w3m dump (keras.io) | ShamelessC wrote: | Nice! I'll take anything over the huggingface version - the API | design by huggingface where CLIP is in transformers, everything | else is in diffusers...not a great developer experience [unless | youre the type of person that likes their python to look like | half-baked J2EE). | capableweb wrote: | Tried to get this running on my 2080ti (11GB VRAM) but hitting | OOM issues. So while performance seems better (but can't actually | test this myself), I'm unable to actually verify it as it doesn't | run. Some of the Pytorch forks works on as little as 6GB of VRAM | (or maybe even 4GB?), but always good to have implementations | that optimize for various factors, this one seems to trade memory | usage for raw generation speed. | | Edit: there seems to be a more "full" version of the same work | available here, made by one of the authors of the submission | article: https://github.com/divamgupta/stable-diffusion- | tensorflow | WithinReason wrote: | Just breaking the attention matrix multiply into parts allows a | significant reduction of memory consumption at minimal cost. | There are variants out there that do that and more. | | Short version: Attention works as a matrix multiply that looks | like this: s(QK)V where QK is a large matrix but Q,K,V and the | result are all small. You can break Q and V into horizontal | strips. Then the result is the vertical concatenation of: | s(Q1*K)*V1 s(Q2*K)*V2 s(Q3*K)*V3 ... | s(QN*K)*VN | | Since you're reusing the memory for the computation of each | block you can get away with much less simultaneous RAM use. | liuliu wrote: | PyTorch doesn't offer an inplace softmax which contributes | about 1GiB extra memory for inference (of stable diffusion). | Although all these are not significant improvements comparing | to just switch to FlashAttention inside the UNet model. | GistNoesis wrote: | Yeah, the problem is indeed in the attention computation. | | You can do something like that but it's far from optimal. | | From memory consumption perspective, the right way to do it, | is to never materialize the intermediate matrices. | | You can do it, by using a customop, that compute att = | scaledAttention(Q,K,V) and the gradient dQ,dK,dV = | scaledAttentionBackward(Q,K,V,att,datt) | | The memory needed for these ops is the memory to store | Q,K,V,attn,dQ,dK,dV,dattn + extra temporary memory. | | When you do the work to minimize memory consumption, this | extra temporary memory is really small : 6 | _attention_horizon^2_ number_of_core_running_in_parallel | numbers. | | But even though there is not much re computation, this kernel | won't run as fast due to the pattern of memory access, unless | you spend some time manually optimizing it. | | The place to do it is at the level of the autodiff framework | aka tensorflow or pytorch, with low level c++/cuda code. | | Anybody can write some custom kernel, but deploying, | maintaining them and distributing them is a nightmare. So the | only people that could and should have done it, are the | tensorflow or pytorch guys. | | In fact they probably have, but it's considered a strategic | advantage and reserved for internal use only. | | The mere mortals like us, have to use some workarounds | (splitting matrices, Kheops, gradient checkpointing... ) to | not be too much penalized by the limited ops of the out of | the box autodiff frameworks like tensorflow or torch. | Karuma wrote: | There are forks that even work on 1.8 of VRAM! They work great | on my GTX 1050 2GB. | | This is by far the most popular and active right now: | https://github.com/AUTOMATIC1111/stable-diffusion-webui | jtap wrote: | Just as another point of reference. I followed the windows | install. I'm running this on my 1060 with 6GB memory. With no | setting changes takes about 10 seconds to generate an image. | I often run with sampling steps up to 50 and that takes about | 40 seconds to generate an image. | rmurri wrote: | What settings and repo are you using for GTX 1050 with 2GB? | Karuma wrote: | I'm using the one I linked in my original post: | https://github.com/AUTOMATIC1111/stable-diffusion-webui | | The only command line argument I'm using is --lowvram, and | usually generate pictures at the default settings at | 512x512 image size. | | You can see all the command line arguments and what they do | here: https://github.com/AUTOMATIC1111/stable-diffusion- | webui/wiki... | [deleted] | extesy wrote: | > This is by far the most popular and active right now: | https://github.com/AUTOMATIC1111/stable-diffusion-webui | | While technically the most popular, I wouldn't call it "by | far". This one is a very close second (500 vs 580 forks): | https://github.com/sd-webui/stable-diffusion-webui/tree/dev | Karuma wrote: | That's why I said "right now", since I feel that most | people have moved from the one you linked to AUTOMATIC's | fork by now. hlky's fork (the one you linked) was by far | the most popular one until a couple of weeks ago, but some | problems with the main developer's attitude and a never- | ending migration from Gradio to Streamlit filled with | issues made it lose its popularity. | | AUTOMATIC has the attention of most devs nowadays. When you | see any new ideas come up, they usually appear in | AUTOMATIC's fork first. | jaggs wrote: | This needs Windows 10/11 though? | Karuma wrote: | Nope. There are instructions for Windows, Linux and Apple | Silicon in the readme: | https://github.com/AUTOMATIC1111/stable-diffusion-webui | | There's also this fork of AUTOMATIC1111's fork, which also | has a Colab notebook ready to run, and it's way, way faster | than the KerasCV version: | https://github.com/TheLastBen/fast-stable-diffusion | | (It also has many, many more options and some nice, user- | friendly GUIs. It's the best version for Google Colab!) | jaggs wrote: | Brilliant thanks. | sophrocyne wrote: | While AUTOMATIC is certainly popular, calling it the most | active/popular would be ignoring the community working on | Invoke. Forks don't lie. | | https://github.com/invoke-ai/InvokeAI | counttheforks wrote: | > Forks don't lie. | | They sure do. InvokeAI is a fork of the original repo | CompVis/stable-diffusion and thus shares its fork counter. | Those 4.1k forks are coming from CompVis/stable-diffusion, | not InvokeAI. | | Meanwhile AUTOMATIC1111/stable-diffusion-webui is not a | fork itself, and has 511 forks. | pwillia7 wrote: | Subjectively, AUTOMATIC has taken over -- I have not | heard of invoke yet but will check it out. | toqy wrote: | The only reason to use it imo has been if you need mac/m1 | support, but that's probably in other forks by now | sophrocyne wrote: | Welp - TIL. | | Thanks for the correction. | | Any idea on how to count forks of a downstream fork? If | anyone would know... :) | rcarmo wrote: | This is _markedly_ faster than the PyTorch versions I've seen | (nothing against the library, just categorizing the | implementations). It would be nice to see this including the | little quality of life additional models (eye fixes, upscaling, | etc.), but I suspect the optimizations are transferrable. | | Either way, getting 3 images for 25 iterations under 10 seconds | (quick Colab test, which is where I've taken to comparing these | things) is just ridiculously faster. | zone411 wrote: | Which GPU did you test on Colab? Are you comparing with one of | the fp16 PyTorch versions? Their test shows little improvement | on V100. | | PyTorch is now quite a bit more popular than Keras in research- | type code (except when it comes from Google) so I don't know if | these enhancements will get ported. This port was done by | people working on Keras which is kind of telling - there isn't | a lot of outside interest. | _ntka wrote: | This is not true, the initial Keras port of the model was | done by Divam Gupta who is not affiliated with Keras or | Google. He works at Meta. | | The benchmark in the article uses mixed precision (and | equivalent generation settings) for both implementations, | it's a fair benchmark. | | In the latest StackOverflow global developer survey, | TensorFlow had 50% more users than PyTorch. | zone411 wrote: | Two Keras creators are listed as authors on this post. If | they were not involved, this should be specified. I | specifically talked about research and StackOverflow is not | in any way representative of what's used. Do you disagree | that the majority of neural net research papers now only | have PyTorch implementations, not TensorFlow? Also, | according to Google Trends, PyTorch is more popular: https: | //trends.google.com/trends/explore?geo=US&q=pytorch,te.... | BTW, I would love it if TF made a strong comeback, it's | always better to have two big competing frameworks and I | have some issues with PyTorch, including with its | performance. | polygamous_bat wrote: | > In the latest StackOverflow global developer survey, | TensorFlow had 50% more users than PyTorch. | | It also doesn't help that PyTorch has its own discussion | forum [1] where most pytorch questions end up. | | [1]: https://discuss.pytorch.org/ | kgwgk wrote: | Should we expect people not working on keras to have the | interest and ability to get it to work on keras? | zone411 wrote: | If these people have existing Keras code they want to | integrate or they are interested in developing it further | in Keras, then it shouldn't require any insider knowledge | to create a Keras version of a small but popular open- | source project like this. I am very sure we'd get a PyTorch | version made by outsiders quickly if Stable Diffusion was | originally released in Keras/TF. | kgwgk wrote: | What is your definition of outsider? | | We got a Keras version made by Divam Gupta very quickly | after Stable Diffusion was released. | | Is he not an outsider? | zone411 wrote: | From what I can tell this Keras version was just released | (the date on the post is Sep. 25) and the first author | listed is the creator of Keras. Is this incorrect? I am | not familiar with Divam Gupta and I would consider | outsiders to be people not paid by Google. | kgwgk wrote: | https://mobile.twitter.com/divamgupta/status/157123450432 | 020... | | https://github.com/divamgupta/stable-diffusion-tensorflow | | Now they are working together. That may be "telling" to | you but I'm not sure why that should cast a negative | light on Keras, really. | zone411 wrote: | I didn't say that it casts a negative light on Keras. | Just on its popularity among outsiders. There are | thousands of great libraries out there that are much less | popular than Keras or PyTorch. And BTW, JAX is a useful | Google-created framework that's growing in popularity | among researchers and pushed PyTorch to improve | (functorch), so I have nothing against Google projects. | kgwgk wrote: | The reason why we're having this discussion is that what | you call a Keras outsider ported Stable Diffusion to | Keras last week. | | It's hard to understand how that can say anything | negative about the popularity of Keras among outsiders. | zone411 wrote: | So why are Keras creators listed as authors on this post | and why is it on Keras' official site? Compare this to | hundreds of PyTorch SD forks that have been thrown up on | GitHub. | | The OP was wondering whether additional enhancements will | also be ported and that's what I responding to. It's | simply much less likely that a new paper will get a Keras | implementation than a PyTorch implementation. | nextaccountic wrote: | Is this faster even after applying the optimizations that | reduce VRAM usage? (some of which the Keras version seem to | lack) | labarilem wrote: | Very interesting performance. Also a very good write-up. Can't | wait to try this. | gpderetta wrote: | I have a mediocre GPU but a fast CPU (with a lot of RAM). Would I | see improvements there? | | I guess I should give it a try. | senthilnayagam wrote: | tried it yesterday, on intel i9 macbook pro it takes about 300 | seconds per image. | gpderetta wrote: | You mean the keras version? How does it compare to the | original one? Currently on my 10850k I get 2.4s/iteration, | which is borderline usable. I haven't managed (nor tried very | hard) to get the cuda version working on my 1070; I expect to | be a little better, but I don't want to fight with ram | issues. | ttflee wrote: | How many steps did you perform? | | I tried some and found no major differences after 16 steps or | so with given random seed. | ttflee wrote: | On intel MacBookPro 2020, CPU-only, the original one[1] using | pytorch utilized one core only. A tensorflow implementation[2] | with oneDNN support which utilized most of the cores ran at | ~11sec/iteration. Another OpenVINO based implementation[3] ran | at ~6.0sec/iteration. | | [1] https://github.com/CompVis/stable-diffusion/ | | [2] https://github.com/divamgupta/stable-diffusion-tensorflow/ | | [3] https://github.com/bes-dev/stable_diffusion.openvino/ | gpderetta wrote: | Yes, I use [3] and I get 2.4s/iter on my 10 core machine. I | was wondering if keras would give additional help here. I'll | have to try I guess. | erwinh wrote: | Not necessarily my expertise but if as stated by the article, 2 | lines of code can already get a 2x performance gain, what more | can be done to improve performance in the coming years? | londons_explore wrote: | It's not two lines of code... It's 2 lines that enable tens of | thousands of lines of library code by invoking a new | optimizer... | MintsJohn wrote: | I'm curious whether this really is "the fastest model yet" | there are pytorch optimizations as well. | | Something like global optimization has been done in pytorch, | here's a blog about it: https://www.photoroom.com/tech/stable- | diffusion-25-percent-f... | | Mixed precision seems pretty much default looking at a few | Stable Diffusion notebooks. | | More intriguing, there's also a more local optimization that | makes pytorch faster: https://www.photoroom.com/tech/stable- | diffusion-100-percent-... | | Unless it's already there, that last one would be interesting | to add to keras. | | All in all this machine learning ecosystem is wild, as a | software dev, things like cache locality and preferring | computation over memory access are basic optimizations, yet in | machine learning it seems wildly disregarded, I've seen models | happily swapping between gpu and system memory to do numpy | calculations. | | Hopefully stable diffusion changes things, the work towards | optimizations is there, it just seems often disregarded. As | stable diffusion is one popular open model that, when | optimized, can be run locally (and not as saas, where you just | add extra compute power, which seems cheaper than engineers) | and has a lot of enthusiasm behind it, it might just be the | spark that makes optimization sexy again. | shadowgovt wrote: | Bonus points for this article being one of the clearest | explanations for how Stable Diffusion works that I've seen to- | date. | unspecldn wrote: | How do I deploy this? Can someone offer some guidance please? | monkmartinez wrote: | Is the H5 file type that much different than whatever the Pytorch | versions are using? | | The model is loaded from Huggingface during the instantiation of | the stable diffusion class. It is loaded as an H5 file which I | believe is unique to Keras[0]. I don't have any experience with | Keras so I can't say if that is good or bad. I wanted to see | where they were getting the weights as the blog post didn't | demonstrate an explicit loading function/call like Pytorch. | | Gonna run it and see... although I have like 40GB of stable | diffusion weights on my computer now. | | [0] https://github.com/keras-team/keras- | cv/blob/master/keras_cv/... | mikereen wrote: | enhance | xiphias2 wrote: | ,, Note that when running on a M1 MacBookPro, you should not | enable mixed precision, as it is not yet well supported by | Apple's Metal runtime" | | It is a bit sad if this is just a closed software issue that | cannot be fixed :( | ribit wrote: | Mixed precision won't do anything on Apple Silicon anyway since | there is no performance advantage to using FP16 (aside from | decreasing register pressure and RAM bandwidth which won't | happen here as data is FP32 to start with). | capableweb wrote: | Is it really that sad? Closed software/hardware won't get | support (official nor community) for things until the | maintainer of the software adds it, and people who buy that | kind of hardware is more than aware of this pitfall (and in | fact, see it as a benefit sometimes too). | lynndotpy wrote: | I'm a new MacOS user and, while I did anticipate some of | these issues, I do often find myself surprised when running | into them. This was one such surprise I hit recently | nextaccountic wrote: | Does this run on AMD? | | A problem I see is that a lot of times everything works fine on | rocm+hip, but since nvidia dominates the machine learning market | (and thus most researches run nvidia), most forks don't bother | checking and just advertise compatibility with nvidia and | sometimes apple M1. | | Problem is, AMD GPUs are much cheaper! | mrtksn wrote: | Well, high-end stuff is always on Nvidia and Apple Silicon | seems to get some love because of its unified memory that makes | it possible in first place plus its popularity among | developers. | | AMD seems to be popular among gamers on budget and the budget | cards often don't have the VRAM required by default. So, AMD | seems to be in this weird place where the people who can make | it work don't care. | mrtranscendence wrote: | For what it's worth, at the consumer level AMD cards -- at | least recently -- have tended to have more VRAM than Nvidia | cards. My 3080 Ti, which I bought for $1400 (though it now | goes for ~$1k), has less RAM (12GB) than a 6800 XT that you | can get for $600 (16GB). | cypress66 wrote: | > Problem is, AMD GPUs are much cheaper! | | Are they? I believe Nvidia (consumer) gpus have better | price/performance than amd for AI. | nextaccountic wrote: | I don't know about AI performance (does this happen only | because of the overhead of providing CUDA through rocm+HIP?), | but I was just checking and at least in my country (Brazil), | for any given memory size (12GB, 8GB, 4GB) I can find cheaper | AMD GPUs than NVidia GPUs | | Here I'm considering that the main constraint is VRAM and | while stable diffusion now runs even on GPUs with 2GB RAM, | there's always new developments that require more VRAM (for | example, Dreambooth requires 12GB as of today) | mrtranscendence wrote: | Maybe for AI? For other tasks, especially gaming, they punch | well above their weight relative to Nvidia (though they lack | features in comparison). It's also possible to get a 16GB | card for much cheaper from AMD than Nvidia. | gdubs wrote: | Has anyone tried running this with an AMD card on Mac? At first | glance it's able to run on Metal (given the M1 compatibility)... | mrtksn wrote: | On a 16GB 8c8g Macbook Air M1, the PyTorch implementation takes | about 3.6s/step which is about 3 minutes per image with the | default parameters. I wonder how faster this would be. If there's | anyone out there with a similar system and wants to compare, | could you please write your findings? | thisisjasononhn wrote: | Not M1 comparible but I'm working on testing various GPU vs M1 | comparisons, with a few accessible cloud providers. My | impression is times should be the same, but it's nice to hear | other real-world stats for M1 with SD. Makes me really want to | rent the Hetzner M1 now. | | Which repo or build are you using BTW, is it the one related to | this readme? | | https://github.com/magnusviri/stable-diffusion/blob/main/REA... | stared wrote: | I would love to see it, but this file is not accessible. | thisisjasononhn wrote: | Sorry about that, web link rot sure is real eh. | | This is an example of the original file: | https://github.com/magnusviri/stable- | diffusion/blob/79ac0f34... | | Which seems to have been renamed, and cleaned up a bit | here: https://github.com/magnusviri/stable- | diffusion/blob/main/doc... | | However, per the note on the magnusviri repo, the following | repo should be used for a stable set of this SD Toolkit: | https://github.com/invoke-ai/InvokeAI | | with instructions here https://github.com/invoke- | ai/InvokeAI/blob/main/docs/install... | mrtksn wrote: | >Which repo or build are you using BTW, is it the one related | to this readme? https://github.com/magnusviri/stable- | diffusion/blob/main/REA... | | Yes, this one. However it was like a month ago I think, so | speeds might have improved. I'm getting ~2.2s/step with | another implementation: | https://news.ycombinator.com/item?id=33006447 | thisisjasononhn wrote: | Wow, that sounds like a good improvement. | | I am also wondering, do you follow the general advice of 1 | iteration and 1 sample, for example: | | --n_samples 1 --n_iter 1 (when referencing commands using | txt2img.py) | | I figure you could wait a bit for things to process going | further, but curious just if you're getting results like | that with higher sample/iter settings. | mrtksn wrote: | I usually go with the default parameters. | mft_ wrote: | I've not tried it, but this approach apparently takes 10-20s | per image? | | https://reddit.com/r/StableDiffusion/comments/xbo3y7/oneclic... | mrtksn wrote: | I just gave it a spin, it took 1 min 52 sec for a 50 steps | image and that is ~2.2s/step. It seems faster than my | original installation(which might also have improved speed as | it was at very beta stage when I tried it) but definitely not | 20 seconds for 50 steps image at 512x512 resolution. | | Maybe they use lower parameters. | | edit: | | 50 steps at 256x256 resolution took 55 seconds. | | 50 steps at 768x768 resolution took 8 min, exactly. | | PS: my Macbook Air is modified with thermal pads, it takes a | bit longer to start throttling than usual. Either way, it's | very dependent on the ambient temperature. | WatchDog wrote: | I don't quite understand the benefit of mixed precision. | | It seems like using high precision is useful for training, but if | not training, why not just use float16 weights and save the | memory? | NavinF wrote: | Converting weights to float16 after training will reduce | quality/accuracy whereas mixed precision has a negligible | effect on quality/accuracy and dramatically improves | performance. | | If you really just want to save memory, there's plenty of other | low hanging fruit. It's just not a priority for most devs since | mid tier GPUs start at 10GB whereas a typical model only has | 0.5GB weights. Activations and intermediate calculations use | way more memory. | zone411 wrote: | You usually can. But it can take some work if you're using any | libraries that expect FP32 and it might be slower, depending on | the GPU. The FP16 support isn't quite as good as FP32. | dennisy wrote: | This is amazing! I am more used to TF so very happy to see this! | | Has anyone got a suggestion on how to fine tune this model? | itronitron wrote: | someone should compare results with just doing a keyword search | on deviantart | JoeAltmaier wrote: | The otter examples highlight something you can't control using | these things: the 'eats shoots and leaves' phenomenon. | | The prompt was "A cute otter in a rainbow whirlpool holding | shells, watercolor" | | Seems like the otter should be holding shells, the way a normal | human parses it. | | The tool showed the otter 'in holding-shells', which are shells | that hold otters apparently. Also some random shells strewn | about, as the technique is sensitive to spurious detail sprouting | up from single words. | | Until the tool permits some kind of syntactic diagramming or so | forth, we'll not be able to control for this. | | Just the other day here, I saw a picture of a fork and some | plastic mushrooms. The prompt was 'plastic eating mushrooms' | which was ambiguous even to humans. The tool chose to illustrate | the subclass of mushrooms 'eating-mushrooms' (as opposed to | poison mushrooms or decorative mushrooms I suppose) made of | plastic. | | When we're playing around this can seem whimsical and artistic. | But a graphic designer might want some semblance of control over | the process. | | Not sure how a solution would work. | CuriouslyC wrote: | Graphic designers lean on img2img in their workflows more than | txt2img, as that gives you the control you speak of. | UncleEntity wrote: | My favorite is when you do "<whatever> bla, bla, bla, wearing a | t-shirt by <artist>" and it gives you an image of <whatever> | wearing a t-shirt with a print in the style of the artist. | Which adds extra dimensions to play with so isn't all that bad. | CrazyStat wrote: | This is the compositionality problem--the language model | sometimes doesn't quite know how to put the words together. | Better language models will help in the future; in the mean | time you can give it a helping hand by prompt engineering or | using img2img. | honksillet wrote: | Can this be used to train you own model? I have a moderately | large medical image dataset that would like to try this with for | data augmentation. | jawadch93 wrote: ___________________________________________________________________ (page generated 2022-09-28 23:00 UTC)