[HN Gopher] Run Stable Diffusion on Your M1 Mac's GPU ___________________________________________________________________ Run Stable Diffusion on Your M1 Mac's GPU Author : bfirsh Score : 619 points Date : 2022-09-01 16:19 UTC (6 hours ago) (HTM) web link (replicate.com) (TXT) w3m dump (replicate.com) | mrkstu wrote: | I consistently have items only partially in frame- | horses/fish/etc- any tips on getting the algo to keep specified | items fully in frame? | SirYandi wrote: | Struggle with this too. One keyword which helps is 'wide | angle'. Sometimes 'full body shot' works for generating humans. | [deleted] | ChildOfChaos wrote: | Is there anyway to keep up with this stuff / beginners guide? I | really want to play around with it but it's kinda confusing to | me. | | I don't have an M1 Mac, I have an Intel one with an AMD GPU, not | sure if i can run it? don't mind if it's a bit slow, or what is | the best way of running it in the cloud? Anything that can | product high res for free? | holoduke wrote: | follow this guide: https://github.com/lstein/stable- | diffusion/blob/main/README-... | | i am runnig it on my 2019 intel macbook pro. 10 minutes per | picture | Karuma wrote: | Yes, you can run it on your Intel CPU: https://github.com/bes- | dev/stable_diffusion.openvino | | And this should work on an AMD GPU (I haven't tried it, I only | have NVIDIA): https://github.com/AshleyYakeley/stable- | diffusion-rocm | | There are also many ways to run it in the cloud (and even more | coming every hour!) I think this one is the most popular: | https://colab.research.google.com/github/altryne/sd-webui-co... | EddySchauHai wrote: | https://beta.dreamstudio.ai/dream | | It's not free but I've played with it a lot over the last two | days for around $10, generating the most complex photos I can | (1024x1024, 150 steps, 9 images, etc) | code51 wrote: | Without k-diffusion support, I don't think this replicates Stable | Diffusion experience: | | https://github.com/crowsonkb/k-diffusion | | Yes, running on M1/M2 (MPS device) was possible with | modifications. img2img and inpainting also works. | | However you'll run into problems when you want k-diffusion | sampling or textual inversion support. | djhworld wrote: | Note that once you run the python script for the first time it | seems to download a further ~2GB of data | nonethewiser wrote: | Including a rick astley image for the first thing you gen -_- | ErneX wrote: | That's the NSFW filter :D | johnfn wrote: | Hm, when I run the example, I get this error: | | > expected scalar type BFloat16 but found Float | | Has anyone seen this error? It's pretty hard to google for. | bfirsh wrote: | Are you running macOS >=12.3? | johnfn wrote: | Oh, no I'm not, I'm on 12.0. Does this make a difference? | johnfn wrote: | Update: I solved this error more properly by upgrading to the | latest version. Thanks bfirsh. | wuyishan wrote: | I am having the same issue on MacOS 12.2.1 (21D62); Python | 3.10.6 What did you upgrade to solve this? Thanks! (I can get | it working with `--precision full`) | johnfn wrote: | I'm now on 12.5.1 | nathas wrote: | Yeah. Try running with PYTORCH_ENABLE_MPS_FALLBACK=1 <script> | --full-precision | johnfn wrote: | This worked! | yboris wrote: | For me `--full-precision` kept erroring out, but `--precision | full` worked correctly. | omginternets wrote: | Is there a way to get it to run on an an Intel-based Mac? I've | attempted several times, but quickly ran into dependency issues | and other quirks. | vimy wrote: | Comment from github: "By the way, i confirmed to work on my | Intel 16-in MacBook Pro via mps. GPU (Radeon Pro 5500M 8GB) | usage is 70-80% and It takes 3 min where --n_samples 1 --n_iter | 1. My repo https://github.com/cruller0704/stable-diffusion- | intel-mac" | | For comparison, my RTX 2070 takes 10 seconds for one image | (512x512) | jw1224 wrote: | I believe the branch which adds support for Apple Silicon also | adds support for running on Intel chips (albeit extremely | slowly). I haven't tested it myself, but I've seen several | people in the GitHub issues saying this. | pja wrote: | The standard release works fine, if you tweak the code to use | the CPU pytorch device instead of CUDA. It does take about an | hour to generate a set of images with the standard options on | my AMD 2600 CPU though! | fossuser wrote: | Thanks for this - it's rare to see a setup guide that actually | works on each step! | | I did need to run the troubleshooting step too, could probably | just move that up as a required step in the guide. | bfirsh wrote: | It isn't required for some (most?) users. Weirdly sometimes pip | is picking up the wheel for `onnx`, sometimes it isn't, and we | can't figure out why. | | Any Python packaging experts know what's going on? all macOS | 12, arm64, Python 3.10. Can't think it wouldn't resolve the | wheel. | | But yes, good idea to move up. I'll stick it next to the `pip | install`. | amelius wrote: | I'd rather see someone implemented glue that allows you to run | arbitrary (deep learning) code on any platform. | | I mean, are we going to see X on M1 Mac, for any X now in the | future? | | Also, weren't torch and tensorflow supposed to be this glue? | nathas wrote: | Broadly speaking, it looks like they are. The implementation of | Stable Diffusion doesn't appear to be using all of those | features correctly (i.e. device selection fails if you don't | have CUDA enabled even though MPS | (https://pytorch.org/docs/stable/notes/mps.html) is supported | by PyTorch. | | Similar goes for quirks of Tensorflow that weren't taken | advantage of. That's largely the work that is on-going in the | OSX and M1 forks. | davedx wrote: | I got stuck on this roadblock, couldn't get CUDA to work on | my Mac, was very confusing | desindol wrote: | Didn't apple stop supporting Nvidia cards like 5 years ago? | How could it be confusing that Cuda wouldn't run? | root_axis wrote: | lol presumably the OP didn't know that... hence the | confusion. | cercatrova wrote: | That's because CUDA is only for Nvidia GPUs and Apple | doesn't support Nvidia GPUs, it has its own now. | dustingetz wrote: | (base) stable-diffusion git:(main) conda env create -f | environment.yaml Collecting package metadata | (repodata.json): done Solving environment: failed | ResolvePackageNotFound: - cudatoolkit=11.3 | | oh i was following the github fork readme, there is a | special macos blog post | calrizien wrote: | link? | scoopertrooper wrote: | If you look at the substance of the changes being made to | support Apple Silicon, they're essentially detecting an M* mac | and switching to PyTorch's Metal backend. | | So, yeah PyTorch is correctly serving as a 'glue'. | | https://github.com/CompVis/stable-diffusion/commit/0763d366e... | sxp wrote: | Is there a good set of benchmarks available for Stable Diffusion? | I was able to run a custom Stable Diffusion build on a GCE A100 | instance (~$1/hour) at around 1Mpix per 10 seconds. I.e, I could | create a 512x512 image in 2.5 seconds with some batching | optimizations. A consumer GPU like a 3090 runs at ~1Mpix per 20 | seconds. | | I'm wondering what the price floor of stock art will be when | someone can use https://lexica.art/ as a starting point, generate | variations of a prompt locally, and then spend a few minutes | sifting through the results. It should be possible to get most | stock art or concept art at a price of <$1 per image. | skybrian wrote: | So you're estimating over a thousand generated images an hour | and less than a tenth of a cent per image using the A100. If | that turns out to be accurate, it seems like some online image | generation will included in the price of the stock art. | | (DreamStudio is charging a bit over one cent per generated | image at default settings, depending on exchange rates.) | sowbug wrote: | Related: I wrote up instructions for running Stable Diffusion | on GCE. I used a Tesla T4, which is probably the cheapest that | can handle the original code. If you're spinning up an instance | to play with, rather than to batch-process, then cheaper makes | more sense because most of the machine's time is spent waiting | for you to type stuff and look at the results. | | https://sowbug.com/posts/stable-diffusion-on-google-cloud/ | fleddr wrote: | It can be even cheaper. | | Midjourney, in case you appreciate their output, has an | unlimited plan for 30$ a month. The only limitation is that if | you're an extremely heavy user, they may "relax" you, which | means results come in a bit slower. | | Note that they've been also experimenting with a --beta | parameter which basically means the algorithm uses | StableDiffusion's algorithm behind the scenes, or you can use | any of 4 versions of MidJourney's more stylistic algorithms. | | So if you don't want to tinker or don't have a high-end GPU, | it's a cheap way to play around. I have StableDiffusion running | locally but still prefer MidJourney. I enjoy the stylistic | output but it's also a highly social way to generate art. | Everybody is doing it in the open. | | Anyway, the stock art part is a hairy subject. You should | assume that you AI image is not copyrighted. Which begs the | question why they would pay at all. | TekMol wrote: | Does running it locally give you anything over using the web | version? | schleck8 wrote: | You can finetune the model with Textual Inversion. And you | don't have the safety mechanism for nudity. | bfirsh wrote: | You can hack on it, modify it, integrate it with other code, | etc! | johnfn wrote: | Also, of course, it's entirely free. The web version is | actually paid, though it's hard to tell because they're not | super transparent about the fact that you're steadily eating | through a quota of initial tokens. | joshstrange wrote: | It's insane to me how fast this is moving. I jumped through a | bunch of hoops 2-3 days ago to get this running on my M1 Mac's | GPU and now it's way easier. I imagine we will have a nice GUI | (I'm aware of the web-ui, I haven't set it up yet) packaged in an | mac .app by the end of next week. Really cool stuff. | dekervin wrote: | Just yesterday I read another comment on HN saying we will have | to wait another decade before being able train it in someone | "basement"( https://news.ycombinator.com/item?id=32658941 ). I | made a bookmark for myself ( | https://datum.alwaysdata.net/?explorer_view=quest&quest_id=q... | ) to look for data that help estimate when it will be feasible | to run Stable Diffusion "at home". I guess it's already | outdated! | squeaky-clean wrote: | To run stable diffusion at home you have to download the | model file, which took the equivalent of tens of thousands of | hours spread across cloud provided GPUs. | | If the model file just vanished from everyone's hard drive | one day, and cloud providers installed heuristics to detect | and ban image dataset training, retraining the model file | would actually take decades for any consumer, even an | enthusiast with a dozen powerful GPUs. The image dataset | alone is 240TB. | zone411 wrote: | Umm training is not the same as running it. | addandsubtract wrote: | I hope this kickstarts some kind of M1 migration. There are so | many ML projects I'd like to try, but they all depend on CUDA. | joshstrange wrote: | Yep, I was just thinking the same thing. M1/M2 appears to be | a huge untapped resource for ML stuff as this proves. I maxed | out my MBP Max and this is probably the first time I'm | actually fully using the GPU cores and it's pretty freaking | cool. Creating landscapes or fictional characters (think D&D) | is already super fun, I look forward to playing with img2img | some more as well. | zone411 wrote: | The performance gap to the top-end Nvidia cards will get | much larger as they release new cards later this year, | though. | jdminhbg wrote: | Maybe, but I can buy a Mac, you just order one from | Apple. | wincy wrote: | An RTX 3090ti with 24GB of VRAM is widely available now | that the crypto markets have crashed for $1150 or so. | They were $2500 a year ago if you could find them. | sethhochberg wrote: | A twist on the above comment: I _already own_ an M2 Mac, | but I'm never gonna buy a high-end GPU to play around | with this sort of tech. If the things people (who aren't | gamers, crypto miners, or ML researchers) already own can | be useful for some hobby-level work in the space, we'll | see a lot more work and experimentation in the space. Its | super exciting stuff. | quitit wrote: | Arguably the cost effective solution is to use cloud | services, since we're talking just a few seconds | difference (or you might be lucky like one HN reader who | got allocated an A100 today.) | | But to play devil's advocate there are clear strengths | available to the different platforms. PCs can readily | upgrade into high end GPUs, but the compromise is that | this becomes a requirement as basic GPUs don't feature | enough VRAM and CPU-only mode is woeful. | | On the mac side of things, the GPU is not going to be the | latest and greatest, but the M-series features unified | memory, so a relatively normal M-series mac is going to | have the necessary hardware to load the models. Not the | fastest (but still fast), and ready to go. (Also as it | stands the M-series can offer additional pathways to | optimisation.) | zone411 wrote: | It's not clear if the shortages will happen with this new | release as they did last time. Ethereum mining is going | away and not as many people are stuck at home because of | Covid. On the other hand, the performance increase looks | to be substantial, increasing the demand. | bee_rider wrote: | Do they depend on CUDA, or are they just much better tuned | for NVIDIA cards? I thought the whole ML ecosystem was based | on training models and then running them on frameworks, where | model was sorta like data and the framework handles the | hardware? (albeit with models that can be tweaked to run more | efficiently on different hardware) (I don't really know the | ecosystem so it is definitely possible that they are more | closely tied together than I thought). | upbeat_general wrote: | From my experience the bigger frameworks may have support | for non-CUDA devices (that is not just the CPU fallback) | but many smaller libraries and models will not, and will | only have a CUDA kernel for some specialized operation. | | I encounter this all the time in computer vision models. | joshvm wrote: | The latter. The major frameworks, at least, can be run in | CPU-only mode, with a hardware abstraction layer for other | devices (like CUDA-capable cards, TPUs etc). So practically | it means you need an Nvidia GPU to get anywhere in a | reasonable amount of time, but if you're not super | dependent on latency (for inference) then CPU is an option. | In principle, CPUs can run much bigger model inputs (at the | expense of even more latency) because RAM is an order of | magnitude more available typically. | bee_rider wrote: | I was thinking (as someone who knows nothing about this | really) that the Apple chips might be interesting | because, while they obviously don't have the GPGPU grunt | to compete with NVIDIA, they might have a more practical | memory:compute ratio... depending on the application of | course. | amilios wrote: | How long does it take to generate a single image? Is it in the 30 | min type range or a few mins? It's hypothetically "possible" to | run e.g. OPT175B on a consumer GPU via Huggingface Accelerate, | but in practice it takes like 30 mins to generate a single token. | LanternLight83 wrote: | Runs on my 2070S at 12s/image (no batch optimization) and on my | GTX1050 4GB at 90s/image | holoduke wrote: | On my late 2019 intel macbook pro with 32gb and a AMD 5550m it | takes about 7-10 minutes to generate an image. | keepquestioning wrote: | Gamechanger! | butUhmErm wrote: | Between this and efforts to add 3D dimension to 2D images, I | don't see much of a future for digital multimedia creator jobs. | | Even TikTok could be an endless stream of ML models. | | Fears of a tech dystopia may be overblown; the masses will just | shut off their gadgets and live simpler if labor markets implode | within the traditional political correct economic system we have. | | Open source AI is on the verge of upending the software industry | and copyright. I dig it. | jclardy wrote: | Is there a proper term to encapsulate M1/M2 Macs now that we have | the M2? IE Apple Silicon Macs works but is a bit long. MX Macs? | M-Series? ARM Macs? | fragmede wrote: | The backend, as contributed by Apple is called MPS. | pavlov wrote: | "uname -m" returns "arm64" on these computers, so you could say | macOS on arm64. | qayxc wrote: | M-series sounds great, IMHO. | sgt101 wrote: | might be easier to wait for Diffusers to merge the pull | request... | johnfn wrote: | For those as keen as I am to try this out, I ran these steps, | only to run into an error during the pip install phase: | | > ERROR: Failed building wheel for onnx | | I was able to resolve it by doing this: | | > brew install protobuf | | Then I ran pip install again, and it worked! | geerlingguy wrote: | In the troubleshooting section it mentions running: | brew install Cmake protobuf rust | | To fix onnx build errors. I had the same issue. | jonplackett wrote: | What kind of speed does this run at? Eg. How long to make a | 512x512 image at standard settings? | jw1224 wrote: | On my M1 Pro MBP with 16GB RAM, it takes ~3 minutes. | pwinnski wrote: | I haven't installed from this link specifically, but I used | one of the branches on which this is based a few days ago, so | the results should be similar. | | On a first-gen M1 Mac mini with 8GB RAM, it takes 70-90 | minutes for each image. | | Still feels like magic, but old-school magic. | antihero wrote: | On an M1 Pro 16GB it is taking a couple minutes for each | image. | hbn wrote: | Is that the difference in graphics performance between | the M1 and M1 Pro or did the other person do something | wrong? 70-90 minutes seems nuts | pwinnski wrote: | I have the M1 8GB I mentioned in my first comment, and | the M1 Pro 16GB I mentioned in my second component, side- | by-side. However, the first one was running a Stable | Diffusion branch from earlier in the week, so I replaced | using the same instructions. The only difference _now_ is | the physical hardware. | | The thing to understand is that the 8GB M1 has 8GB. When | I run txt2img.py, my Activity Monitor shows a Python | process with 9.42GB of memory, and the "Memory Pressure" | graph spends time in the red zone as the machine is | swapping. While the 16GB M1 Pro _immediately_ shows PLMS | Sampler progress, and consistently spends around 3 | seconds per iteration (e.g. "3.29s/it" and "2.97s/it"), | the 8GB M1 takes _several_ minutes before it jumps from | 0% to 2% progress, and it accurately reports | "326.24s/it" | | So yes, whether it's M1 vs M1 Pro, or 8GB vs 16GB, it | really is that stark a difference. | | Update: after the second iteration it is 208.44s/it, so | it is speeding up. It should drop to less than 120s/it | before it finishes, if it runs as quickly as my previous | install. And yes, 186.04s/it after the third iteration, | and 159.22s/it after the fourth. | smoldesu wrote: | Sounds entirely like a swap-constrained operation. You | need ~8gb of VRAM to load the uncompressed model into | memory, which obviously won't work well on a Macbook with | 8gb of memory. | nicoburns wrote: | Might be the RAM difference. RAM is shared between CPU | and GPU on the M1 series processors. | mattkevan wrote: | My 16gb M1 Air was initially taking 13 minutes for a 50 | step generation. But when I closed all the open tabs and | apps it went down to 3 minutes. | | Looks like RAM drastically affects the speed. | ralferoo wrote: | My first-gen M1 MacBook Air with 16GB takes just under 4 | minutes per image. Running top while it's generating | shows memory usage fluctuating between 10GB and 13GB, so | if you're running on 8GB it's probably swapping a lot. | pwinnski wrote: | Installed from this link on a MacBook Pro (16-inch, 2021) | with Apple M1 Pro and 16GB. First run downloads stuff, so I | omit that result. | | I had a YouTube video playing while I kicked off the exact | command in the install docs, and got: 16.84s user 99.43s | system 61% cpu 3:08.51 total | | Next attempt, python aborted 78 seconds in! Weird. | | Next attempt, with YouTube paused: 16.31s user 95.48s | system 65% cpu 2:49.45 total | | So around three minutes, I'd say. | Turing_Machine wrote: | A little over three minutes on a first-gen M1 iMac with | 16GB. | | It looks like memory is super-important for this (which | isn't all that surprising, really...). | johnfn wrote: | Looks like I'm getting around 4s per iteration on my M1 Max. | At 50 iterations, that's 200 seconds. | whywhywhywhy wrote: | M1 Max (32gb) is around 35 seconds per image. | moneycantbuy wrote: | For 512x512 on M1 MAX (32 core) with 64 GB RAM I'm getting | 1.67it/s so 30.59s with the default ddim_steps=50. | colaco wrote: | I've gotten 1.35it/s that corresponds to 38s, but I've the | M1 Max with the 24 cores GPU (the "lower end" one). | _ph_ wrote: | On my M2 Air, 16G, 10 CPU cores, the default command as in | the installing instructions takes like 2m20s. | chemeng wrote: | Getting around 4 minutes per image on M1 MacBook Air 16GB | matsemann wrote: | Python dependency hell in a nutshell. Impossible to distribute | ML projects that can easily be ran. | jw1224 wrote: | Are we being pranked? I just followed the steps but the image | output from my prompt is just a single frame of Rick Astley... | | EDIT: It was a false-positive (honest!) on the NSFW filter. To | disable it, edit txt2img.py around line 325. | | Comment this line out: x_checked_image, | has_nsfw_concept = check_safety(x_samples_ddim) | | And replace it with: x_checked_image = | x_samples_ddim | pja wrote: | That means the NSFW filter kicked in IIRC from reading the | code. | | Change your prompt, or remove the filter from the code. | [deleted] | johnfn wrote: | Haha, busted! | pja wrote: | To be fair, the reason the filter is there is that if you | ask for a picture of a woman, stable diffusion is pretty | likely to generate a naked one! | | If you tweak the prompt to explicitly mention clothing, you | should be OK though. | [deleted] | creddit wrote: | Same thing happened to me which is especially odd as I | literally just pasted the example command. | r3trohack3r wrote: | If you open up the script txt2img and img2img scripts, there is | a content filter. If your prompt generated anything that gets | detected as "inappropriate" the image is replaced with Rick | Astley. | | Removing the censor should be pretty straightforward, just | comment out those lines. | nonethewiser wrote: | It bothers me that this isn't just configurable. Why would | they not want to expose this as a feature? | joshmlewis wrote: | When the model detects NSFW content it replaces the output with | the frame of Rick Astley. | rhacker wrote: | It's kind of amazing that ML can now intelligently rick roll | people. | | I think it would be awesome to update the rickroll feature to | the following: | | Auto Re-run the img2img with some text prompt: "all of the | people are now Rick Astley" with low strength so it can | adjust the faces, but not change the nudity!!!1 | GordonS wrote: | Hah, it would be hilarious if it generated all the nudity | you wanted - but with Rick Astley's face on every naked | person! | wenbin wrote: | Thanks for the writeup! It works smoothly on my M1 Macbook Pro! | | A few days ago, I tried Stable Diffusion code and was not able to | get it work :( Then I gave up... | | Today, following steps in this blog post, it works for the very | first try. Happy! | adrianvoica wrote: | Tried "transparent dog", got rickrolled. Why is this NSFW? | ...anyway, I disabled the filter and... it's pretty neat! Calling | all AI Overlords, soon. :)) | _venkatasg wrote: | I keep running into issues, even after installing Rust in my | condo environment (using conda). Specifically the issue seems to | be building wheels for `tokenizers`: warning: | build failed, waiting for other jobs to finish... error: | build failed error: `cargo rustc --lib --message- | format=json-render-diagnostics --manifest-path Cargo.toml | --release -v --features pyo3/extension-module -- --crate-type | cdylib -C 'link-args=-undefined dynamic_lookup | -Wl,-install_name,@rpath/tokenizers.cpython-310-darwin.so'` | failed with code 101 [end of output] note: This error | originates from a subprocess, and is likely not a problem with | pip. ERROR: Failed building wheel for tokenizers | Failed to build tokenizers ERROR: Could not build wheels | for tokenizers, which is required to install pyproject.toml-based | projects | | Any suggestions? | benhalllondon wrote: | I played around a bit and found out dropping the tokenisers | version to 0.11.6 worked | | `pip install tokenizers==0.11.6` first | keepquestioning wrote: | One beautiful thing I realized about all this progress in AI. | | We will still need people to do the hard yards, and get dirt | between their fingernails. I am firmly in the camp of those | people. | | Fancy algorithms won't dig holes, or lay out rail tracks of over | hundreds of miles.. or build houses all across the world. | blagie wrote: | Are you following progress in robotics? | keepquestioning wrote: | Once they combine the progress in generative algorithms and | robotics, then we are truly done for. | dougmwne wrote: | School us! What's the latest in robotics that is going to | knock our socks off? | scoopertrooper wrote: | They've got death robots that can fly now. | | Why is nobody impressed by the future? | dougmwne wrote: | I believe those are human controlled, no? Robotics gets | really interesting when the robots can start driving and | building the roads as well. | neurostimulant wrote: | Not necessarily. Some just require human to turn it on | and it'll loiter and attack enemy autonomously (loitering | munition [1]). | | [1] https://en.wikipedia.org/wiki/Loitering_munition | imtemplain wrote: | I'm ready to pay for a Windows + AMD GPU guide at this point, why | is there no single blogpost on this, please help. | usehackernews wrote: | Magnusviri[0], the original author of the SD M1 repo credited in | this article, has merged his fork into the Lstein Stable | Diffusion fork. | | You can now run the Lstein fork[1] with M1 as of a few hours ago. | | This adds a ton of functionality - GUI, Upscaling & Facial | improvements, weighted subprompts etc. | | This has been a big undertaking over the last few days, and I | highly recommend checking it out. See the mac m1 readme [3] | | [0] https://github.com/magnusviri/stable-diffusion | | [1] https://github.com/lstein/stable-diffusion | | [2] https://github.com/lstein/stable- | diffusion/blob/main/README-... | yieldcrv wrote: | are there benchmarks? | | I was following the github issue and the CPU bound one was at | 4-5 minutes, the MDS one was at 30 seconds, then 18 seconds, | and people were still calling that slow. | | What is it currently at now? | | and I don't know what "fast" is, to compare | | What are the Windows 10 with nice Nvidia chips w/ CUDA getting? | Just curious whats comprehensive | squeaky-clean wrote: | > What are the Windows 10 with nice Nvidia chips w/ CUDA | getting? | | Are you referring to single iteration step times, or whole | images? Because obviously it depends on the number of | iteration steps used. | | Windows 10, RTX 2070 (laptop model), lstein repo. I get about | 3.2 iter/sec. A 50 step 512x512 image takes me 15 seconds. | yieldcrv wrote: | I'm referring to there being a community effort to | normalize performance metrics and results at all, with the | M1 devices being in that list as well, so that we dont have | to ask these questions to begin with | | Are you aware of any wiki or table like that? | Aeolun wrote: | Huh, that's the same speed I get on Collab. Pretty good. | zone411 wrote: | Around 6 seconds. | dmd wrote: | Wait, what? On my M1 imac I'm getting about 25 _minutes_. | What am i doing wrong? | BrentOzar wrote: | It's falling back to CPU. Follow the instructions to use a | GPU version - sometimes it's even a completely different | repo, depending on whose instructions you're following. | dmd wrote: | I followed https://replicate.com/blog/run-stable- | diffusion-on-m1-mac | bfirsh wrote: | Nice. We'll get this guide updated for this fork. Everything's | moving so fast it's hard to keep track! | | We struggled to get Conda working reliably for people, which it | looks like lstein's fork recommends. I'll see if we can get it | working with plain pip. | pugio wrote: | I really appreciate the use of pip > conda. Looking forward | to the update for the repo! | bfirsh wrote: | Running lstein's fork with these requirements[0] but seeing | this output[1]. Same steps as original guide otherwise. | | Anyone got any ideas? | | [0] https://github.com/bfirsh/stable- | diffusion/blob/392cda328a69... | | [1] https://gist.github.com/bfirsh/594c50fd9b2e6b173e31de753a | 842... | sork_hn wrote: | Same output for me also. | | EDIT: https://github.com/lstein/stable- | diffusion/issues/293#issuec... fixed it for me. | bfirsh wrote: | Boom - nice. Here's a fork with that: | https://github.com/bfirsh/stable-diffusion/tree/lstein | | Requirements are "requirements-mac.txt" which'll need | subbing in the guide. | | We're testing this out with a few people in Discord | before shipping to the blog post. | jw1224 wrote: | Check my comment alongside yours, I got Conda to work but it | did require the pre-requisite Homebrew packages you | originally recommended before it would cooperate :) | toinewx wrote: | Everything works excepts it only generates black images, | | did you run | | python scripts/preload_models.py | | python scripts/dream.py --full_precision ? | arthurcolle wrote: | Disable safety check | jw1224 wrote: | Brilliant, thank you! I just got OP's setup working, but this | seems much more user-friendly. Giving it a try now... | | EDIT: Got it working, with a couple of pre-requisite steps: | | 0. `rm` the existing `stable-diffusion` repo (assuming you | followed OP's original setup) | | 1. Install `conda`, if you don't already have it: | brew install --cask miniconda | | 2. Install the other build requirements referenced in OP's | setup: brew install Cmake protobuf rust | | 3. Follow the main installation instructions here: | https://github.com/lstein/stable-diffusion/blob/main/README-... | | Then you should be good to go! | | EDIT 2: After playing around with this repo, I've found: | | - It offers better UX for interacting with Stable Diffusion, | and seems to be a promising project. | | - Running txt2img.py from lstein's repo seems to run about 30% | faster than OP's. Not sure if that's a coincidence, or if | they've included extra optimisations. | | - I couldn't get the web UI to work. It kept throwing the | "leaked semaphor objects" error someone else reported (even | when rendering at 64x64). | | - Sometimes it rendered images just as a black canvas, other | times it worked. This is apparently a known issue and a fix is | being tested. | | I've reached the limits of my knowledge on this, but will | following closely as new PRs are merged in over the coming | days. Exciting! | pugio wrote: | Can you describe how you did (/ are doing) this? Do you now | need to use conda (as opposed to OPs pip only version)? | jw1224 wrote: | See my edit for more info. (Just ironing out a couple of | other issues I've found, so might update it again shortly) | toinewx wrote: | I was able not to have black images by using a different | sampler | | --sampler k_euler | | full command: | | "photography of a cat on the moon" -s 20 -n 3 --sampler | k_euler -W 384 -H 384 | jastanton wrote: | I tried that as well but resulted in an error: | | AttributeError: module 'torch._C' has no attribute | '_cuda_resetPeakMemoryStats' | | https://gist.github.com/JAStanton/73673d249927588c93ee530d0 | 8... | johnfn wrote: | I followed all these steps, but I got this error: | | > User specified autocast device_type must be 'cuda' or 'cpu' | | > Are you sure your system has an adequate NVIDIA GPU? | | I found the solution here: https://github.com/lstein/stable- | diffusion/issues/293#issuec... | toinewx wrote: | I only get black images. | arthurcolle wrote: | You have to disable the safety checker after creating the | pipe | [deleted] | mark_l_watson wrote: | Thanks for writing this up!! I enjoyed getting TensorFlow running | with the M1, although a multi-headed model I was working on | wouldn't run. | | I just made my Dad's 101 year old birthday card using OpenAI's | image generating service (he loved it) and when I get home from | travel I will use your instructions in the linked article. | | Any advice for running Stable Diffusion locally vs. Colab Pro or | Pro+? My M1 MacBook Pro only has 8G ram (I didn't want to wait a | month for a 16G model). Is that enough? I have a 1080 with 10G | graphics memory. Is that sufficient? | ErneX wrote: | From the comments here 8GB is not enough, it will swap a lot | and take way more time than a 16GB MacBook. | Razengan wrote: | 101 years! Congratulations!! Does he own a suspiciously plain | gold ring by any chance? | gzer0 wrote: | The difference between an M2 air (8gb/512gb) versus an M1 pro | (16gb/1tb) is much more than I expected. * M1 pro | (16gb/1tb) can run the model in around 3 minutes. * M2 air | (8gb/512gb) takes ~60 minutes for the same model. | | I knew there would be some throttling due to the m2 air's fanless | model, but I had no idea it would be a 20x difference (albeit, | the m1 pro does have double the RAM. I don't have any other | macbooks to test this on). | andybak wrote: | Unscientifically that puts the M1 Pro GPU at about 25% of the | performance of a RTX 3080. | | Not too shabby... | | EDIT - this comment implies it's _much_ faster: | https://news.ycombinator.com/item?id=32679518 | | If that's correct then it's close to matching my 3080 (mobile). | fassssst wrote: | img2img runs in 6 seconds on my GeForce 3080 12 GB. 6+ it\s | depending on how much GPU memory is available. If I have any | electron apps running it slows down dramatically. | andybak wrote: | Curious about: | | 1. Image size | | 2. Steps | | 3. What your numbers are for text2img | | 4. (most importantly) are you including the 30 seconds or | so it takes to load the model initially? i.e. if you were | to run 10 prompts and then divide the total time by 10, | what are your numbers? | cube2222 wrote: | Re 4 the lstein repo gives you an interactive repl, so | you don't have to reload the model on every prompt. | | I also have a 3080 and as far as I remember (not at my pc | right now) it was 3-10 secs for img2img 512px cfg13 50 | steps batch size 1 dimm sampler. | fragmede wrote: | what args are you passing to img2img? | valley_guy_12 wrote: | It's likely that a significant fraction of the perf | difference between Apple' GPUs and NVIDIA GPUs is due to | NVIDIA's CUDA being high optimized, and Pytorch being tuned | to work with CUDA. | | If Pytorch's metal support improves and Apple's Metal drivers | improve (big ifs), it's likely that Apple's GPUs will perform | better relatively to NVIDIA than they currently do. | DubiousPusher wrote: | > It's likely that a significant fraction of the perf | difference between Apple' GPUs and NVIDIA GPUs is due to | NVIDIA's CUDA being high optimized, and Pytorch being tuned | to work with CUDA. | | You really think the orders of magnitude more parallelism | in AMD and Nvidia's discrete GPUs has nothing to do with | it? | valley_guy_12 wrote: | That's probably due to swapping due to the 8GB of RAM. People | who have run Stable Diffusion on M2 airs with 16 GB of RAM seem | to get performance that is in line with their GPU core count. | bfirsh wrote: | Correct. We've been seeing 8GB is super slow, >=16GB is fast. | We'll add that to the prerequisites. | _ph_ wrote: | I would assume it is the memory. The test command from the | discussed link runs in slightly over 2 minutes on my M2 Air | (16gb). How long does it take for yours? | JimmyAustin wrote: | I suspect that the M2 air is thrashing the disk pretty | aggressively. Diffusion models rerun the same model once per | step, so for a generation with 50 steps, you copy the entire | model in and out of memory 50 times. That's going to kill | performance. | schleck8 wrote: | It's only copied to VRAM once when implemented correctly. | astrange wrote: | M1 is a unified memory system and doesn't have VRAM. | bm-rf wrote: | I believe the model is copied into ram once upon calling | StableDiffusionPipeline, unless the mac implementation | partially loads the model due to only having 8G of ram. | bm-rf wrote: | just water cool it! https://www.youtube.com/watch?v=9DyUitTVWlw | qayxc wrote: | I suspect the lack of RAM is the issue here. | moneycantbuy wrote: | Anyone know the largest possible image size > 512x512? I'm | getting the following error when trying 1024x1024 with 64 GB RAM | on M1 MAX: | | /opt/homebrew/Cellar/python@3.10/3.10.6_2/Frameworks/Python.frame | work/Versions/3.10/lib/python3.10/multiprocessing/resource_tracke | r.py:224: UserWarning: resource_tracker: There appear to be 1 | leaked semaphore objects to clean up at shutdown | warnings.warn('resource_tracker: There appear to be %d ' | enduser wrote: | I have the same problem with anything over 512x512 on my M1 | Ultra with 128GB. VRAM must be capped. | moneycantbuy wrote: | Thanks for the Ultra data point. | | I'm able to get 768x896 to run, but the output image is still | white noisy at 50 ddim steps, perhaps related to the | phenomena of being trained/windowed on 512x512 images as | sibling squeaky-clean described. | | RAM usage at various sizes: 512x512 14 GB, 768x768 26 GB, | 768x896 32 GB | squeaky-clean wrote: | Those values seem really high compared to my setup, | windows/nvidia/lstein repo. For me 512x512 uses 6.1GB. | | Random guess but I think your pipeline is running with | full-precision floats (32bit), while by default the repo | should be using autocast() which will try to use half- | precision floats wherever possible. | | I know an optimizedSD repo exists and one of the steps they | take is explicitly setting precision to half. (And other | changes that reduce memory usage but decrease iteration | speed). However I don't know how M1/Metal handles half- | precision, hopefully it doesn't just cast them back to | 32bit. | | Also white noisy images at 50 steps seems off to me. At 50 | steps in a large image I definitely get a visible product. | It's just often non-euclidean or very scattered bits of | organization and chaos. | squeaky-clean wrote: | Don't specifically know sorry, largest I can generate with my | (windows pc) vram size is 512x1024. But just wanted to comment | that SD is trained on 512x512 images and runs iterations in a | 512x512 window. | | This means anything larger than 512x512 tends to confuse it. | For example 1024x1024 will have 4 non-overlapping windows and | many overlapping windows. | | So if your prompt is "a cat wearing sunglasses", you may get 4 | separate cats as the 512x512 windows have no knowledge of each | other, and each window is trying to fulfill the goal. Even more | likely you'll get some sort of eldrtich horror 16-legged cat | being as the windows shuffle around. | | Sometimes it just works perfectly somehow, but 90% of the time | the non native resolution really screws it up. I'd suggest | generating 512x512 images and using a different AI to upscale | them in most cases. | | However it does lead to some amazing fantasy landscape art as | you get weird terrains and mountains shoved up against | eachother in fantastic/magical ways. | valley_guy_12 wrote: | Supposedly Stable Diffusion was trained on 512 x 512 images, so | it's not clear that it will work well for larger images even if | you had the RAM. | | To generate larger images, it is standard practice to generate | 512 x 512 and then use a separate tool to upscale, and maybe a | second separate tool to improve the face. The Windows versions | of SD environments are starting to incorporate these additional | tools, but the Apple Silicon versions of SD environments are | lagging behind due to Pytorch metal limitations.... It'll | hopefully sort itself out in the next few months. | capableweb wrote: | just 512x require something like 10GB VRAM on your GPU, 1024x | would need even more. How much VRAM does the M1 Max GPU have? | You're probably running out of memory. | tgtweak wrote: | M architecture is unified memory - so the system memory is | shared as GPU memory. There is likely a cap somewhere on how | much can be used by a single app (or collectively by all | apps). | deckeraa wrote: | Very nice to see this available for hardware I own. | | Now I can achieve my dream of a Corporate Memphis + Hieronymus | Bosch mashup. | avereveard wrote: | How fast is it on a m1? | ebiester wrote: | It takes 1-2 mins for a 512x512 image. It's been a lot of fun | since I did this last night. | smoldesu wrote: | For reference, inferencing the model on a 2070 takes 10-12 | seconds for the same size at max-precision, and the 3070 can | synthesize an image in almost 6 seconds. | | If you extrapolate the power consumption (3070 @~300w vs M1 | Pro GPU@~30-50w) the metrics make a lot of sense. | nathas wrote: | I haven't ran this fork yet, but about 1.3 sec/iter. Usually | ~30-50 iters/sample (image). | pwinnski wrote: | The answer depends VERY MUCH on RAM. | | My M1 with 8GB takes 70-90 minutes per image. | | My M1 Pro with 16GB takes 3 minutes per image. | fifafu wrote: | Another data point: M1 Max with 64GB takes ~40 seconds | astrange wrote: | Hopefully the library is intentionally going very slowly | trying to fit in RAM and that's not just your 8GB machine | totally falling over. | adamj9431 wrote: | How is Stable Diffusion on DreamStudio.ai so much faster than the | reports here? Seems to only take 5-10 seconds to generate an | image with the default settings. | | I.e. How are they providing access to GPU compute several orders | of magnitude more powerful than an M1, for free? | tgtweak wrote: | A100 devices in the cloud on preemptive/spot instances? | schleck8 wrote: | 1. Dreamstudio is paid, with 2 EUR worth of credit for free. | | 2. The M1 GPU is an iGPU. A good iGPU, sure, but it's not | anywhere near the performance of a dedicated, cooled GPU with | dedicated VRAM. | | On a 2060 Super with 8 GB of VRAM and with tensor cores it | takes 15 seconds to infer with the default settings. If | Dreamstudio uses deep learning GPUs then there is your answer | to why it is as fast. | fleddr wrote: | The guy behind it is an ex hedge fund manager. Using private | funds he's built a massive fleet of A1000s at AWS. | | So it's an enormous amount of compute created from private | funds that he considers to be "for humanity". Currently, he | funds it and is also the "GPU overlord", he exclusively decides | which applications gets to use it. | | His plan, or at least his claim, is to transform this situation | in it being more diversely funded (institutions, businesses, | even the UN) and for access to be decided by committee with | main criteria it being useful for humanity. | | Let's see if he sticks to his word, but I find it | inspirational. AI was on a trajectory to be solely in the hands | of a hand full of ultra rich companies that can afford to train | and run it, and us poor mortals being at the whims of | gatekeeper terms. | | This guy is on a trajectory to put AI in the hands of the | people. Not just for art, for everything. If he fully sees this | through, he's destined to be a tech icon. | preommr wrote: | The guy's name is Emad Mostaque btw | | There's a recent video interview he did that goes into his | vision: https://www.youtube.com/watch?v=YQ2QtKcK2dA | vvanirudh wrote: | Running into this error `RuntimeError: expected scalar type | BFloat16 but found Float` when I run `txt2img.py` | yboris wrote: | Confirming I'm stuck on the same error when running the | tutorial-instructed python scripts/txt2img.py command | RuntimeError: expected scalar type BFloat16 but found Float | ml_basics wrote: | Yes, me too! Please post here if you find a solution for all | the other people that come and find this by commmand-F'ing | this error | sytse wrote: | I'm stuck on 'RuntimeError: expected scalar type BFloat16 | but found Float' too. Most relevant links seems | https://github.com/CompVis/stable-diffusion/pull/47 but I'm | not sure. Please post when there is a solution. | alvb wrote: | That might have to do with your Mac OS version. Pre-12.4 | Mac OS does not allow the Torch backend to use the M1 | GPU, and so the script attempts to use the cpu, but then | the cpu does not support half-precision numbers. | yboris wrote: | _SOLUTION_ - append the command with `--precision full` | sytse wrote: | Awesome, that works | | For reference the full command: | | `python scripts/txt2img.py \ --prompt "a red juicy apple | floating in outer space, like a planet" \ --n_samples 1 | --n_iter 1 --plms --precision full` | moneycantbuy wrote: | What's with the ~25% chance of an image being all black? Also, | seeds aren't replicating. | sroussey wrote: | This should be put into a docket image to avoid various potential | conflicts with locally installed libraries. | | Anyone do this for the M1? | schleck8 wrote: | Conda environment | hnarayanan wrote: | Do you want to add more layers to make it extra slow? | bfirsh wrote: | Unfortunately this can't run in Docker because Docker for Mac | can't access the M1 GPU. (Several layers of virtualization and | emulation!) | sroussey wrote: | Ah, thanks. Yes, that makes sense. | mdswanson wrote: | virtualenv isn't required. You can just use python -m venv venv | and get the same results with one fewer dependency. | simonebrunozzi wrote: | I don't want to sound lazy, but I would be expecting a .dmg for | Macs, and I don't seem to find it. Am I blind, or it simply | hasn't been prepared yet? | omarelbie wrote: | At the rate it's moving it doesn't seem too far off, but I | think it's just a tad too early. | cageface wrote: | After playing around with all of these ML image generators I've | found myself surprisingly disenchanted. The tech is extremely | impressive but I think it's just human psychology that when you | have an unlimited supply of something you tend to value each | instance of it less. | | Turns out I don't really want thousands of good images. I want a | handful of excellent ones. | andrethegiant wrote: | Yesssss I've been waiting for this! | sgt101 wrote: | also brew upgrade not brew update | sp332 wrote: | Any chance of this running on an M1 iPad Pro? | boppo1 wrote: | Seconding this | liuliu wrote: | Probably another week or two. running on M1 iPad Pro needs to | get out of PyTorch, possibly export the model either through | TorchScript and then do onnx conversion. From what I found so | far, not many of these conversions done (except the OpenVINO | one maybe?). | ebiester wrote: | Note: I ran this and haven't yet been able to get img2img working | yet. I borked it up trying to get conda working. | | It's been a lot of fun to play with so far though! | bravura wrote: | Apparently, this release should include a Dockerfile for easier | replicability. | bfirsh wrote: | Unfortunately this can't run in Docker because Docker for Mac | can't access the M1 GPU. (Several layers of virtualization | and emulation!) | nathas wrote: | Try the lstein fork: https://github.com/lstein/stable- | diffusion/tree/fix-cuda-res... | | You'll still need to play with modifying some of the code to | get it to run, but `dream.py` works for me. Funny enough, I got | _only_ img2img effectively working with the lstein branch; it | broke txt2img for me. | StapleHorse wrote: | Yesterday I thought I broke it too. In my case, the solution | was just to make sure that the input image (from the editor or | otherwise) was the same size as the output image. Hope it | helps. | gregsadetsky wrote: | Bananas. Thanks so much... to everyone involved. It works. | | 14 seconds to generate an image on an M1 Max with the given | instructions (`--n_samples 1 --n_iter 1`) | | Also, interesting/curious small note: images generated with this | script are "invisibly watermarked" i.e. steganographied! | | See https://github.com/bfirsh/stable- | diffusion/blob/main/scripts... | msoad wrote: | Please someone package all of this and the WebUI into an Electron | app so common people can also hack on it! | andrewmunsell wrote: | There are several packages that provide web UIs, like this one | for example: https://github.com/hlky/stable-diffusion-webui | | It's not quite the ease of setup of an Electron app, but once | setup it's pretty easy to use. | r3trohack3r wrote: | I've been playing with Stable Diffusion a lot the past few days | on a Dell R620 CPU (24 cores, 96 GB of RAM). With a little | fiddling (not knowing any python or anything about machine | learning) I was able to get img2img.py working by simply | comparing that script to the txt2img.py CPU patch. Was only a few | lines of tweaking. img2img takes ~2 minutes to generate an image | with 1 sample and 50 iterations, txt2img takes about 10 minutes | for 1 sample and 50 generations. | | The real bummer is that I can only get ddim and plms to run using | a CPU. All of the other diffusions crash and burn. ddim and plms | don't seem to do a great job of converging for hyper-realistic | scenes involving humans. I've seen other algorithms "shape up" | after 10 or so iterations from explorations people do online - | where increasing the step count just gives you a higher fidelity | and/or more realistic image. With ddim/plms on a CPU, every step | seems to give me a wildly different image. You wouldn't know that | steps 10 and steps 15 came from the same seed/sample they change | so much. | | I'm not sure if this is just because I'm running it on a CPU or | if ddim and plms are just inferior to the other diffusion models | - but I've mostly given up on generating anything worthwhile | until I can get my hands on an nvida GPU and experiment more with | faster turn arounds. | squeaky-clean wrote: | > You wouldn't know that steps 10 and steps 15 came from the | same seed/sample they change so much. | | I don't think this is CPU specific, this happens at these very | low number of samples, even on the GPU. Most guides recommend | starting with 45 steps as a useful minimum for quickly trialing | prompt and setting changes, and then increasing that number | once you've found values you like for your prompt and other | parameters. | | I've also noticed another big change sometimes happens between | 70-90 steps. It's not all the time and it doesn't drastically | change your image, but orientations may get rotated, colors | will change, the background may change completely. | | > img2img takes ~2 minutes to generate an image with 1 sample | and 50 iterations | | If you check the console logs you'll notice img2img doesn't | actually run the real number of steps. It's number of steps | multiplied by the Denoising Strength factor. So with a | denoising strength of 0.5 and 50 steps, you're actually running | 25 steps. | | Later edit: Oh and if you do end up liking an image from step | 10 or whatever, but iterating further completely changes the | image, one thing you can do is save your output at 10 steps, | and use that as your base image for the img2img script to do | further work. | schleck8 wrote: | With the 1.4 checkpoint, everything under 40 steps can't be | used basically and you only get good fidelity with >75 steps. I | usually use 100, that's a good middleground. | auggierose wrote: | How do you change these steps in the given script? Is it the | --ddim_steps parameter? Or --n_iter? Or ... ? | caxco93 wrote: | could someone who has already done this please share how long it | takes for a 50 steps image to be generated? | nathas wrote: | 1.3 sec/iter on my M1 Mac, so ~39 seconds. | jw1224 wrote: | That was fast. I'm only getting 5.26s/iter on an M1 Pro MBP | with 16GB RAM. | | EDIT: Speed increased to 2.3s/iter after a reboot | nathas wrote: | Depends what fork you're running... Some seem to be using | CPU-based generation, others use the MPS device backend | correctly which is MUCH faster. I have another comment | floating around about lstein's fork, but it takes some | massaging to get it to run happily. | https://github.com/lstein/stable-diffusion/ | jw1224 wrote: | The fork linked by OP is MPS-based, I can see GPU usage | way up in Activity Monitor. Seems performance doubled | after a reboot though :) | geerlingguy wrote: | Weird, on M1 Max Mac Studio, only getting 1.42 it/s :/ | nathas wrote: | I got my units backwards :sweat: My bad! | yoyohello13 wrote: | Has anybody had success getting newer AMD cards working? | | ROCm support seems spotty at best, I have a 5700xt and I haven't | had much luck getting it working. | geerlingguy wrote: | I've tried using this set of steps [1], but have so far not had | luck, mostly because the ROCm driver setup is throwing me for a | loop. Tried it with an RX 6700 XT and first was going to test | on Ubuntu 22.04 but realized ROCm doesn't support that OS yet, | so tried again on 20.04 and ended up breaking my GPU driver! | | [1] | https://gist.github.com/geerlingguy/ff3c3cbcf4416be2c0c1e0f8... | my123 wrote: | Yes. That's expected. | | AMD market segmented their RDNA2 support in ROCm to the | Navi21 set only (6800/6800 XT/6900 XT). | | It is not officially supported in any way on other RDNA2 | GPUs. (Or even on the desktop RDNA2 range at all, that only | works because their top end Pro cards share the same die) | geerlingguy wrote: | Oh... had no clue! Thanks for letting me know so I wouldn't | have to spend hours banging my head against the wall. | yoyohello13 wrote: | Looks like I may be out of luck with NAVI 10. | my123 wrote: | As an aside, a totally unsupported hack to make it somewhat | work on Navi2x smaller dies which you use: | | HSA_OVERRIDE_GFX_VERSION=10.3.0 to force using the Navi21 | binary slice. | | This is totally unsupported and please don't complain if | something doesn't work when using that trick. | | But basic PyTorch use works using this, so you might get | away with it for this scenario. | | (TL;DR: AMD just doesn't care about GPGPU on the | mainstream, better to switch to another GPU vendor that | does next time...) | switchers wrote: | 6600XT reporting in. Spent a few hours on Windows and WSL2 | setup attempts, got no where. I don't run Ubuntu at home and | don't want to dual boot just for this. From looking around I | think I'd have a better chance on native Ubuntu. | my123 wrote: | Buy an NVIDIA card. ROCm isn't supported in any way on WSL2, | but CUDA is. | | AMD just doesn't invest in their developer ecosystem. Also as | you use a 6600 XT, no official ROCm support for the die that | you use. Only for navi21. | MintsJohn wrote: | Or wait, if its just about stable diffusion multiple people | try to create onnx and directml forks of the | models/scripts, which atleast in theory can work for AMD | gpus in windows and wsl2 | haxxorfreak wrote: | I have it working on an RX 6800, used the scripts from this | repo[0] to build a docker image that has ROCm drivers and | PyTorch installed. | | I'm running Ubuntu 22.04 LTS as the host OS, didn't have to | touch anything beyond the basic Docker install. Next step is | build a new Dockerfile that adds in the Stable Diffusion | WebUI.[1] | | [0] https://github.com/AshleyYakeley/stable-diffusion-rocm [1] | https://github.com/hlky/stable-diffusion-webui | tgtweak wrote: | The RX6800 seems like a great card for this - 16GB of | relatively fast VRAM for a good price. | | How long does it take to do 50 iterations on a 512x512? | ece wrote: | I tried getting pytorch vulkan inference working with radv, it | gives me a missing dtype error in vkformat. Fp16 or normal | precision have the same error. I think it's some bf16 thing. | dgreensp wrote: | I'm working on getting this running. Instead of | "venv/bin/activate" I had to run "source venv/bin/activate". And | I got an error installing the requirements, fixed by running "pip | install pyyaml" as a separate command. | valley_guy_12 wrote: | Having to use "source" means you have an older version of | conda. Python package management is kind of a mess. | bfirsh wrote: | There's a little "." before "venv/bin/activate" that's easy to | miss. I'll update it to "source" to make it more obvious. | dgreensp wrote: | Wow, I'm getting as low as 1.2 seconds per "step" (about a | minute for a 512x512 image with default settings) on my 32 GB | M1 laptop (2021, 16-inch). | blagie wrote: | How large an image will this handle (versus how much RAM you | have)? | | It seems the GPU memory requirements beyond 512x512 are obscene. | schleck8 wrote: | This model was mostly trained on 512x512 so you should stick to | approximately that size. | | Use external upscalers like RealESRGAN, SwinIR or BSRGAN or | GFPGAN (faces). | | Alternatively use hacks like txt2imghd to get it to natively | create 1 MP images. | michaelchisari wrote: | I'm on a iMac M1 16gb and I can handle up to 768x768 but since | it's shared memory I close out every other application and run | things overnight. | | The biggest issue with apple chips is that the --seed setting | doesn't work. I _should_ be able to set a seed to, for | instance, 1083958 and if I re-run a command at the same | resolution with that seed, I should get the same image every | time. This would allow me to test different steps so I could | generate a 100 images at 16 steps (which is quite fast) and | pick the ones that are most promising and re-render at 64 or | 128 steps. | | But currently you can't do that on apple hardware because of an | open issue in PyTorch. Genuinely hoping a fix comes soon, until | it is this is more of a novelty than a tool on Apple hardware. | fragmede wrote: | There's a partial fix for the seed issue on Reddit. | michaelchisari wrote: | I can't seem to find it, do you have a link? | pugio wrote: | Me at the end of last year: "Should I really go for the full | 64GB on this M1 Pro? What could I possibly use this for? mmbml | mumble... something about unified GPU... something Deep | Learning, one day..." | | Me now: "a red juicy apple floating in outer space, like a | planet" --H 768 --W 768 | | Uses about 27GB. 1.81s/it. | | Can't do 1024x1024 yet because of some hardcoded Metal issue | (https://github.com/pytorch/pytorch/issues/84039. | neurostimulant wrote: | Most people that want hires would just feed the resulting image | into AI upscaler like gigapixel. | rhacker wrote: | I wonder if this is going to be a huge boon to m1 sales. | ThrowawayTestr wrote: | That was fast. ___________________________________________________________________ (page generated 2022-09-01 23:00 UTC)