[HN Gopher] Run Stable Diffusion on Your M1 Mac's GPU
       ___________________________________________________________________
        
       Run Stable Diffusion on Your M1 Mac's GPU
        
       Author : bfirsh
       Score  : 619 points
       Date   : 2022-09-01 16:19 UTC (6 hours ago)
        
 (HTM) web link (replicate.com)
 (TXT) w3m dump (replicate.com)
        
       | mrkstu wrote:
       | I consistently have items only partially in frame-
       | horses/fish/etc- any tips on getting the algo to keep specified
       | items fully in frame?
        
         | SirYandi wrote:
         | Struggle with this too. One keyword which helps is 'wide
         | angle'. Sometimes 'full body shot' works for generating humans.
        
       | [deleted]
        
       | ChildOfChaos wrote:
       | Is there anyway to keep up with this stuff / beginners guide? I
       | really want to play around with it but it's kinda confusing to
       | me.
       | 
       | I don't have an M1 Mac, I have an Intel one with an AMD GPU, not
       | sure if i can run it? don't mind if it's a bit slow, or what is
       | the best way of running it in the cloud? Anything that can
       | product high res for free?
        
         | holoduke wrote:
         | follow this guide: https://github.com/lstein/stable-
         | diffusion/blob/main/README-...
         | 
         | i am runnig it on my 2019 intel macbook pro. 10 minutes per
         | picture
        
         | Karuma wrote:
         | Yes, you can run it on your Intel CPU: https://github.com/bes-
         | dev/stable_diffusion.openvino
         | 
         | And this should work on an AMD GPU (I haven't tried it, I only
         | have NVIDIA): https://github.com/AshleyYakeley/stable-
         | diffusion-rocm
         | 
         | There are also many ways to run it in the cloud (and even more
         | coming every hour!) I think this one is the most popular:
         | https://colab.research.google.com/github/altryne/sd-webui-co...
        
         | EddySchauHai wrote:
         | https://beta.dreamstudio.ai/dream
         | 
         | It's not free but I've played with it a lot over the last two
         | days for around $10, generating the most complex photos I can
         | (1024x1024, 150 steps, 9 images, etc)
        
       | code51 wrote:
       | Without k-diffusion support, I don't think this replicates Stable
       | Diffusion experience:
       | 
       | https://github.com/crowsonkb/k-diffusion
       | 
       | Yes, running on M1/M2 (MPS device) was possible with
       | modifications. img2img and inpainting also works.
       | 
       | However you'll run into problems when you want k-diffusion
       | sampling or textual inversion support.
        
       | djhworld wrote:
       | Note that once you run the python script for the first time it
       | seems to download a further ~2GB of data
        
         | nonethewiser wrote:
         | Including a rick astley image for the first thing you gen -_-
        
           | ErneX wrote:
           | That's the NSFW filter :D
        
       | johnfn wrote:
       | Hm, when I run the example, I get this error:
       | 
       | > expected scalar type BFloat16 but found Float
       | 
       | Has anyone seen this error? It's pretty hard to google for.
        
         | bfirsh wrote:
         | Are you running macOS >=12.3?
        
           | johnfn wrote:
           | Oh, no I'm not, I'm on 12.0. Does this make a difference?
        
         | johnfn wrote:
         | Update: I solved this error more properly by upgrading to the
         | latest version. Thanks bfirsh.
        
           | wuyishan wrote:
           | I am having the same issue on MacOS 12.2.1 (21D62); Python
           | 3.10.6 What did you upgrade to solve this? Thanks! (I can get
           | it working with `--precision full`)
        
             | johnfn wrote:
             | I'm now on 12.5.1
        
         | nathas wrote:
         | Yeah. Try running with PYTORCH_ENABLE_MPS_FALLBACK=1 <script>
         | --full-precision
        
           | johnfn wrote:
           | This worked!
        
           | yboris wrote:
           | For me `--full-precision` kept erroring out, but `--precision
           | full` worked correctly.
        
       | omginternets wrote:
       | Is there a way to get it to run on an an Intel-based Mac? I've
       | attempted several times, but quickly ran into dependency issues
       | and other quirks.
        
         | vimy wrote:
         | Comment from github: "By the way, i confirmed to work on my
         | Intel 16-in MacBook Pro via mps. GPU (Radeon Pro 5500M 8GB)
         | usage is 70-80% and It takes 3 min where --n_samples 1 --n_iter
         | 1. My repo https://github.com/cruller0704/stable-diffusion-
         | intel-mac"
         | 
         | For comparison, my RTX 2070 takes 10 seconds for one image
         | (512x512)
        
         | jw1224 wrote:
         | I believe the branch which adds support for Apple Silicon also
         | adds support for running on Intel chips (albeit extremely
         | slowly). I haven't tested it myself, but I've seen several
         | people in the GitHub issues saying this.
        
           | pja wrote:
           | The standard release works fine, if you tweak the code to use
           | the CPU pytorch device instead of CUDA. It does take about an
           | hour to generate a set of images with the standard options on
           | my AMD 2600 CPU though!
        
       | fossuser wrote:
       | Thanks for this - it's rare to see a setup guide that actually
       | works on each step!
       | 
       | I did need to run the troubleshooting step too, could probably
       | just move that up as a required step in the guide.
        
         | bfirsh wrote:
         | It isn't required for some (most?) users. Weirdly sometimes pip
         | is picking up the wheel for `onnx`, sometimes it isn't, and we
         | can't figure out why.
         | 
         | Any Python packaging experts know what's going on? all macOS
         | 12, arm64, Python 3.10. Can't think it wouldn't resolve the
         | wheel.
         | 
         | But yes, good idea to move up. I'll stick it next to the `pip
         | install`.
        
       | amelius wrote:
       | I'd rather see someone implemented glue that allows you to run
       | arbitrary (deep learning) code on any platform.
       | 
       | I mean, are we going to see X on M1 Mac, for any X now in the
       | future?
       | 
       | Also, weren't torch and tensorflow supposed to be this glue?
        
         | nathas wrote:
         | Broadly speaking, it looks like they are. The implementation of
         | Stable Diffusion doesn't appear to be using all of those
         | features correctly (i.e. device selection fails if you don't
         | have CUDA enabled even though MPS
         | (https://pytorch.org/docs/stable/notes/mps.html) is supported
         | by PyTorch.
         | 
         | Similar goes for quirks of Tensorflow that weren't taken
         | advantage of. That's largely the work that is on-going in the
         | OSX and M1 forks.
        
           | davedx wrote:
           | I got stuck on this roadblock, couldn't get CUDA to work on
           | my Mac, was very confusing
        
             | desindol wrote:
             | Didn't apple stop supporting Nvidia cards like 5 years ago?
             | How could it be confusing that Cuda wouldn't run?
        
               | root_axis wrote:
               | lol presumably the OP didn't know that... hence the
               | confusion.
        
             | cercatrova wrote:
             | That's because CUDA is only for Nvidia GPUs and Apple
             | doesn't support Nvidia GPUs, it has its own now.
        
             | dustingetz wrote:
             | (base)   stable-diffusion git:(main) conda env create -f
             | environment.yaml         Collecting package metadata
             | (repodata.json): done         Solving environment: failed
             | ResolvePackageNotFound:           - cudatoolkit=11.3
             | 
             | oh i was following the github fork readme, there is a
             | special macos blog post
        
               | calrizien wrote:
               | link?
        
         | scoopertrooper wrote:
         | If you look at the substance of the changes being made to
         | support Apple Silicon, they're essentially detecting an M* mac
         | and switching to PyTorch's Metal backend.
         | 
         | So, yeah PyTorch is correctly serving as a 'glue'.
         | 
         | https://github.com/CompVis/stable-diffusion/commit/0763d366e...
        
       | sxp wrote:
       | Is there a good set of benchmarks available for Stable Diffusion?
       | I was able to run a custom Stable Diffusion build on a GCE A100
       | instance (~$1/hour) at around 1Mpix per 10 seconds. I.e, I could
       | create a 512x512 image in 2.5 seconds with some batching
       | optimizations. A consumer GPU like a 3090 runs at ~1Mpix per 20
       | seconds.
       | 
       | I'm wondering what the price floor of stock art will be when
       | someone can use https://lexica.art/ as a starting point, generate
       | variations of a prompt locally, and then spend a few minutes
       | sifting through the results. It should be possible to get most
       | stock art or concept art at a price of <$1 per image.
        
         | skybrian wrote:
         | So you're estimating over a thousand generated images an hour
         | and less than a tenth of a cent per image using the A100. If
         | that turns out to be accurate, it seems like some online image
         | generation will included in the price of the stock art.
         | 
         | (DreamStudio is charging a bit over one cent per generated
         | image at default settings, depending on exchange rates.)
        
         | sowbug wrote:
         | Related: I wrote up instructions for running Stable Diffusion
         | on GCE. I used a Tesla T4, which is probably the cheapest that
         | can handle the original code. If you're spinning up an instance
         | to play with, rather than to batch-process, then cheaper makes
         | more sense because most of the machine's time is spent waiting
         | for you to type stuff and look at the results.
         | 
         | https://sowbug.com/posts/stable-diffusion-on-google-cloud/
        
         | fleddr wrote:
         | It can be even cheaper.
         | 
         | Midjourney, in case you appreciate their output, has an
         | unlimited plan for 30$ a month. The only limitation is that if
         | you're an extremely heavy user, they may "relax" you, which
         | means results come in a bit slower.
         | 
         | Note that they've been also experimenting with a --beta
         | parameter which basically means the algorithm uses
         | StableDiffusion's algorithm behind the scenes, or you can use
         | any of 4 versions of MidJourney's more stylistic algorithms.
         | 
         | So if you don't want to tinker or don't have a high-end GPU,
         | it's a cheap way to play around. I have StableDiffusion running
         | locally but still prefer MidJourney. I enjoy the stylistic
         | output but it's also a highly social way to generate art.
         | Everybody is doing it in the open.
         | 
         | Anyway, the stock art part is a hairy subject. You should
         | assume that you AI image is not copyrighted. Which begs the
         | question why they would pay at all.
        
       | TekMol wrote:
       | Does running it locally give you anything over using the web
       | version?
        
         | schleck8 wrote:
         | You can finetune the model with Textual Inversion. And you
         | don't have the safety mechanism for nudity.
        
         | bfirsh wrote:
         | You can hack on it, modify it, integrate it with other code,
         | etc!
        
           | johnfn wrote:
           | Also, of course, it's entirely free. The web version is
           | actually paid, though it's hard to tell because they're not
           | super transparent about the fact that you're steadily eating
           | through a quota of initial tokens.
        
       | joshstrange wrote:
       | It's insane to me how fast this is moving. I jumped through a
       | bunch of hoops 2-3 days ago to get this running on my M1 Mac's
       | GPU and now it's way easier. I imagine we will have a nice GUI
       | (I'm aware of the web-ui, I haven't set it up yet) packaged in an
       | mac .app by the end of next week. Really cool stuff.
        
         | dekervin wrote:
         | Just yesterday I read another comment on HN saying we will have
         | to wait another decade before being able train it in someone
         | "basement"( https://news.ycombinator.com/item?id=32658941 ). I
         | made a bookmark for myself (
         | https://datum.alwaysdata.net/?explorer_view=quest&quest_id=q...
         | ) to look for data that help estimate when it will be feasible
         | to run Stable Diffusion "at home". I guess it's already
         | outdated!
        
           | squeaky-clean wrote:
           | To run stable diffusion at home you have to download the
           | model file, which took the equivalent of tens of thousands of
           | hours spread across cloud provided GPUs.
           | 
           | If the model file just vanished from everyone's hard drive
           | one day, and cloud providers installed heuristics to detect
           | and ban image dataset training, retraining the model file
           | would actually take decades for any consumer, even an
           | enthusiast with a dozen powerful GPUs. The image dataset
           | alone is 240TB.
        
           | zone411 wrote:
           | Umm training is not the same as running it.
        
         | addandsubtract wrote:
         | I hope this kickstarts some kind of M1 migration. There are so
         | many ML projects I'd like to try, but they all depend on CUDA.
        
           | joshstrange wrote:
           | Yep, I was just thinking the same thing. M1/M2 appears to be
           | a huge untapped resource for ML stuff as this proves. I maxed
           | out my MBP Max and this is probably the first time I'm
           | actually fully using the GPU cores and it's pretty freaking
           | cool. Creating landscapes or fictional characters (think D&D)
           | is already super fun, I look forward to playing with img2img
           | some more as well.
        
             | zone411 wrote:
             | The performance gap to the top-end Nvidia cards will get
             | much larger as they release new cards later this year,
             | though.
        
               | jdminhbg wrote:
               | Maybe, but I can buy a Mac, you just order one from
               | Apple.
        
               | wincy wrote:
               | An RTX 3090ti with 24GB of VRAM is widely available now
               | that the crypto markets have crashed for $1150 or so.
               | They were $2500 a year ago if you could find them.
        
               | sethhochberg wrote:
               | A twist on the above comment: I _already own_ an M2 Mac,
               | but I'm never gonna buy a high-end GPU to play around
               | with this sort of tech. If the things people (who aren't
               | gamers, crypto miners, or ML researchers) already own can
               | be useful for some hobby-level work in the space, we'll
               | see a lot more work and experimentation in the space. Its
               | super exciting stuff.
        
               | quitit wrote:
               | Arguably the cost effective solution is to use cloud
               | services, since we're talking just a few seconds
               | difference (or you might be lucky like one HN reader who
               | got allocated an A100 today.)
               | 
               | But to play devil's advocate there are clear strengths
               | available to the different platforms. PCs can readily
               | upgrade into high end GPUs, but the compromise is that
               | this becomes a requirement as basic GPUs don't feature
               | enough VRAM and CPU-only mode is woeful.
               | 
               | On the mac side of things, the GPU is not going to be the
               | latest and greatest, but the M-series features unified
               | memory, so a relatively normal M-series mac is going to
               | have the necessary hardware to load the models. Not the
               | fastest (but still fast), and ready to go. (Also as it
               | stands the M-series can offer additional pathways to
               | optimisation.)
        
               | zone411 wrote:
               | It's not clear if the shortages will happen with this new
               | release as they did last time. Ethereum mining is going
               | away and not as many people are stuck at home because of
               | Covid. On the other hand, the performance increase looks
               | to be substantial, increasing the demand.
        
           | bee_rider wrote:
           | Do they depend on CUDA, or are they just much better tuned
           | for NVIDIA cards? I thought the whole ML ecosystem was based
           | on training models and then running them on frameworks, where
           | model was sorta like data and the framework handles the
           | hardware? (albeit with models that can be tweaked to run more
           | efficiently on different hardware) (I don't really know the
           | ecosystem so it is definitely possible that they are more
           | closely tied together than I thought).
        
             | upbeat_general wrote:
             | From my experience the bigger frameworks may have support
             | for non-CUDA devices (that is not just the CPU fallback)
             | but many smaller libraries and models will not, and will
             | only have a CUDA kernel for some specialized operation.
             | 
             | I encounter this all the time in computer vision models.
        
             | joshvm wrote:
             | The latter. The major frameworks, at least, can be run in
             | CPU-only mode, with a hardware abstraction layer for other
             | devices (like CUDA-capable cards, TPUs etc). So practically
             | it means you need an Nvidia GPU to get anywhere in a
             | reasonable amount of time, but if you're not super
             | dependent on latency (for inference) then CPU is an option.
             | In principle, CPUs can run much bigger model inputs (at the
             | expense of even more latency) because RAM is an order of
             | magnitude more available typically.
        
               | bee_rider wrote:
               | I was thinking (as someone who knows nothing about this
               | really) that the Apple chips might be interesting
               | because, while they obviously don't have the GPGPU grunt
               | to compete with NVIDIA, they might have a more practical
               | memory:compute ratio... depending on the application of
               | course.
        
       | amilios wrote:
       | How long does it take to generate a single image? Is it in the 30
       | min type range or a few mins? It's hypothetically "possible" to
       | run e.g. OPT175B on a consumer GPU via Huggingface Accelerate,
       | but in practice it takes like 30 mins to generate a single token.
        
         | LanternLight83 wrote:
         | Runs on my 2070S at 12s/image (no batch optimization) and on my
         | GTX1050 4GB at 90s/image
        
         | holoduke wrote:
         | On my late 2019 intel macbook pro with 32gb and a AMD 5550m it
         | takes about 7-10 minutes to generate an image.
        
       | keepquestioning wrote:
       | Gamechanger!
        
       | butUhmErm wrote:
       | Between this and efforts to add 3D dimension to 2D images, I
       | don't see much of a future for digital multimedia creator jobs.
       | 
       | Even TikTok could be an endless stream of ML models.
       | 
       | Fears of a tech dystopia may be overblown; the masses will just
       | shut off their gadgets and live simpler if labor markets implode
       | within the traditional political correct economic system we have.
       | 
       | Open source AI is on the verge of upending the software industry
       | and copyright. I dig it.
        
       | jclardy wrote:
       | Is there a proper term to encapsulate M1/M2 Macs now that we have
       | the M2? IE Apple Silicon Macs works but is a bit long. MX Macs?
       | M-Series? ARM Macs?
        
         | fragmede wrote:
         | The backend, as contributed by Apple is called MPS.
        
         | pavlov wrote:
         | "uname -m" returns "arm64" on these computers, so you could say
         | macOS on arm64.
        
         | qayxc wrote:
         | M-series sounds great, IMHO.
        
       | sgt101 wrote:
       | might be easier to wait for Diffusers to merge the pull
       | request...
        
       | johnfn wrote:
       | For those as keen as I am to try this out, I ran these steps,
       | only to run into an error during the pip install phase:
       | 
       | > ERROR: Failed building wheel for onnx
       | 
       | I was able to resolve it by doing this:
       | 
       | > brew install protobuf
       | 
       | Then I ran pip install again, and it worked!
        
         | geerlingguy wrote:
         | In the troubleshooting section it mentions running:
         | brew install Cmake protobuf rust
         | 
         | To fix onnx build errors. I had the same issue.
        
         | jonplackett wrote:
         | What kind of speed does this run at? Eg. How long to make a
         | 512x512 image at standard settings?
        
           | jw1224 wrote:
           | On my M1 Pro MBP with 16GB RAM, it takes ~3 minutes.
        
           | pwinnski wrote:
           | I haven't installed from this link specifically, but I used
           | one of the branches on which this is based a few days ago, so
           | the results should be similar.
           | 
           | On a first-gen M1 Mac mini with 8GB RAM, it takes 70-90
           | minutes for each image.
           | 
           | Still feels like magic, but old-school magic.
        
             | antihero wrote:
             | On an M1 Pro 16GB it is taking a couple minutes for each
             | image.
        
               | hbn wrote:
               | Is that the difference in graphics performance between
               | the M1 and M1 Pro or did the other person do something
               | wrong? 70-90 minutes seems nuts
        
               | pwinnski wrote:
               | I have the M1 8GB I mentioned in my first comment, and
               | the M1 Pro 16GB I mentioned in my second component, side-
               | by-side. However, the first one was running a Stable
               | Diffusion branch from earlier in the week, so I replaced
               | using the same instructions. The only difference _now_ is
               | the physical hardware.
               | 
               | The thing to understand is that the 8GB M1 has 8GB. When
               | I run txt2img.py, my Activity Monitor shows a Python
               | process with 9.42GB of memory, and the "Memory Pressure"
               | graph spends time in the red zone as the machine is
               | swapping. While the 16GB M1 Pro _immediately_ shows PLMS
               | Sampler progress, and consistently spends around 3
               | seconds per iteration (e.g.  "3.29s/it" and "2.97s/it"),
               | the 8GB M1 takes _several_ minutes before it jumps from
               | 0% to 2% progress, and it accurately reports
               | "326.24s/it"
               | 
               | So yes, whether it's M1 vs M1 Pro, or 8GB vs 16GB, it
               | really is that stark a difference.
               | 
               | Update: after the second iteration it is 208.44s/it, so
               | it is speeding up. It should drop to less than 120s/it
               | before it finishes, if it runs as quickly as my previous
               | install. And yes, 186.04s/it after the third iteration,
               | and 159.22s/it after the fourth.
        
               | smoldesu wrote:
               | Sounds entirely like a swap-constrained operation. You
               | need ~8gb of VRAM to load the uncompressed model into
               | memory, which obviously won't work well on a Macbook with
               | 8gb of memory.
        
               | nicoburns wrote:
               | Might be the RAM difference. RAM is shared between CPU
               | and GPU on the M1 series processors.
        
               | mattkevan wrote:
               | My 16gb M1 Air was initially taking 13 minutes for a 50
               | step generation. But when I closed all the open tabs and
               | apps it went down to 3 minutes.
               | 
               | Looks like RAM drastically affects the speed.
        
               | ralferoo wrote:
               | My first-gen M1 MacBook Air with 16GB takes just under 4
               | minutes per image. Running top while it's generating
               | shows memory usage fluctuating between 10GB and 13GB, so
               | if you're running on 8GB it's probably swapping a lot.
        
             | pwinnski wrote:
             | Installed from this link on a MacBook Pro (16-inch, 2021)
             | with Apple M1 Pro and 16GB. First run downloads stuff, so I
             | omit that result.
             | 
             | I had a YouTube video playing while I kicked off the exact
             | command in the install docs, and got: 16.84s user 99.43s
             | system 61% cpu 3:08.51 total
             | 
             | Next attempt, python aborted 78 seconds in! Weird.
             | 
             | Next attempt, with YouTube paused: 16.31s user 95.48s
             | system 65% cpu 2:49.45 total
             | 
             | So around three minutes, I'd say.
        
             | Turing_Machine wrote:
             | A little over three minutes on a first-gen M1 iMac with
             | 16GB.
             | 
             | It looks like memory is super-important for this (which
             | isn't all that surprising, really...).
        
           | johnfn wrote:
           | Looks like I'm getting around 4s per iteration on my M1 Max.
           | At 50 iterations, that's 200 seconds.
        
           | whywhywhywhy wrote:
           | M1 Max (32gb) is around 35 seconds per image.
        
           | moneycantbuy wrote:
           | For 512x512 on M1 MAX (32 core) with 64 GB RAM I'm getting
           | 1.67it/s so 30.59s with the default ddim_steps=50.
        
             | colaco wrote:
             | I've gotten 1.35it/s that corresponds to 38s, but I've the
             | M1 Max with the 24 cores GPU (the "lower end" one).
        
           | _ph_ wrote:
           | On my M2 Air, 16G, 10 CPU cores, the default command as in
           | the installing instructions takes like 2m20s.
        
           | chemeng wrote:
           | Getting around 4 minutes per image on M1 MacBook Air 16GB
        
         | matsemann wrote:
         | Python dependency hell in a nutshell. Impossible to distribute
         | ML projects that can easily be ran.
        
       | jw1224 wrote:
       | Are we being pranked? I just followed the steps but the image
       | output from my prompt is just a single frame of Rick Astley...
       | 
       | EDIT: It was a false-positive (honest!) on the NSFW filter. To
       | disable it, edit txt2img.py around line 325.
       | 
       | Comment this line out:                   x_checked_image,
       | has_nsfw_concept = check_safety(x_samples_ddim)
       | 
       | And replace it with:                   x_checked_image =
       | x_samples_ddim
        
         | pja wrote:
         | That means the NSFW filter kicked in IIRC from reading the
         | code.
         | 
         | Change your prompt, or remove the filter from the code.
        
           | [deleted]
        
           | johnfn wrote:
           | Haha, busted!
        
             | pja wrote:
             | To be fair, the reason the filter is there is that if you
             | ask for a picture of a woman, stable diffusion is pretty
             | likely to generate a naked one!
             | 
             | If you tweak the prompt to explicitly mention clothing, you
             | should be OK though.
        
             | [deleted]
        
         | creddit wrote:
         | Same thing happened to me which is especially odd as I
         | literally just pasted the example command.
        
         | r3trohack3r wrote:
         | If you open up the script txt2img and img2img scripts, there is
         | a content filter. If your prompt generated anything that gets
         | detected as "inappropriate" the image is replaced with Rick
         | Astley.
         | 
         | Removing the censor should be pretty straightforward, just
         | comment out those lines.
        
           | nonethewiser wrote:
           | It bothers me that this isn't just configurable. Why would
           | they not want to expose this as a feature?
        
         | joshmlewis wrote:
         | When the model detects NSFW content it replaces the output with
         | the frame of Rick Astley.
        
           | rhacker wrote:
           | It's kind of amazing that ML can now intelligently rick roll
           | people.
           | 
           | I think it would be awesome to update the rickroll feature to
           | the following:
           | 
           | Auto Re-run the img2img with some text prompt: "all of the
           | people are now Rick Astley" with low strength so it can
           | adjust the faces, but not change the nudity!!!1
        
             | GordonS wrote:
             | Hah, it would be hilarious if it generated all the nudity
             | you wanted - but with Rick Astley's face on every naked
             | person!
        
       | wenbin wrote:
       | Thanks for the writeup! It works smoothly on my M1 Macbook Pro!
       | 
       | A few days ago, I tried Stable Diffusion code and was not able to
       | get it work :( Then I gave up...
       | 
       | Today, following steps in this blog post, it works for the very
       | first try. Happy!
        
       | adrianvoica wrote:
       | Tried "transparent dog", got rickrolled. Why is this NSFW?
       | ...anyway, I disabled the filter and... it's pretty neat! Calling
       | all AI Overlords, soon. :))
        
       | _venkatasg wrote:
       | I keep running into issues, even after installing Rust in my
       | condo environment (using conda). Specifically the issue seems to
       | be building wheels for `tokenizers`:                 warning:
       | build failed, waiting for other jobs to finish...       error:
       | build failed       error: `cargo rustc --lib --message-
       | format=json-render-diagnostics --manifest-path Cargo.toml
       | --release -v --features pyo3/extension-module -- --crate-type
       | cdylib -C 'link-args=-undefined dynamic_lookup
       | -Wl,-install_name,@rpath/tokenizers.cpython-310-darwin.so'`
       | failed with code 101       [end of output]      note: This error
       | originates from a subprocess, and is likely not a problem with
       | pip.       ERROR: Failed building wheel for tokenizers
       | Failed to build tokenizers       ERROR: Could not build wheels
       | for tokenizers, which is required to install pyproject.toml-based
       | projects
       | 
       | Any suggestions?
        
         | benhalllondon wrote:
         | I played around a bit and found out dropping the tokenisers
         | version to 0.11.6 worked
         | 
         | `pip install tokenizers==0.11.6` first
        
       | keepquestioning wrote:
       | One beautiful thing I realized about all this progress in AI.
       | 
       | We will still need people to do the hard yards, and get dirt
       | between their fingernails. I am firmly in the camp of those
       | people.
       | 
       | Fancy algorithms won't dig holes, or lay out rail tracks of over
       | hundreds of miles.. or build houses all across the world.
        
         | blagie wrote:
         | Are you following progress in robotics?
        
           | keepquestioning wrote:
           | Once they combine the progress in generative algorithms and
           | robotics, then we are truly done for.
        
           | dougmwne wrote:
           | School us! What's the latest in robotics that is going to
           | knock our socks off?
        
             | scoopertrooper wrote:
             | They've got death robots that can fly now.
             | 
             | Why is nobody impressed by the future?
        
               | dougmwne wrote:
               | I believe those are human controlled, no? Robotics gets
               | really interesting when the robots can start driving and
               | building the roads as well.
        
               | neurostimulant wrote:
               | Not necessarily. Some just require human to turn it on
               | and it'll loiter and attack enemy autonomously (loitering
               | munition [1]).
               | 
               | [1] https://en.wikipedia.org/wiki/Loitering_munition
        
       | imtemplain wrote:
       | I'm ready to pay for a Windows + AMD GPU guide at this point, why
       | is there no single blogpost on this, please help.
        
       | usehackernews wrote:
       | Magnusviri[0], the original author of the SD M1 repo credited in
       | this article, has merged his fork into the Lstein Stable
       | Diffusion fork.
       | 
       | You can now run the Lstein fork[1] with M1 as of a few hours ago.
       | 
       | This adds a ton of functionality - GUI, Upscaling & Facial
       | improvements, weighted subprompts etc.
       | 
       | This has been a big undertaking over the last few days, and I
       | highly recommend checking it out. See the mac m1 readme [3]
       | 
       | [0] https://github.com/magnusviri/stable-diffusion
       | 
       | [1] https://github.com/lstein/stable-diffusion
       | 
       | [2] https://github.com/lstein/stable-
       | diffusion/blob/main/README-...
        
         | yieldcrv wrote:
         | are there benchmarks?
         | 
         | I was following the github issue and the CPU bound one was at
         | 4-5 minutes, the MDS one was at 30 seconds, then 18 seconds,
         | and people were still calling that slow.
         | 
         | What is it currently at now?
         | 
         | and I don't know what "fast" is, to compare
         | 
         | What are the Windows 10 with nice Nvidia chips w/ CUDA getting?
         | Just curious whats comprehensive
        
           | squeaky-clean wrote:
           | > What are the Windows 10 with nice Nvidia chips w/ CUDA
           | getting?
           | 
           | Are you referring to single iteration step times, or whole
           | images? Because obviously it depends on the number of
           | iteration steps used.
           | 
           | Windows 10, RTX 2070 (laptop model), lstein repo. I get about
           | 3.2 iter/sec. A 50 step 512x512 image takes me 15 seconds.
        
             | yieldcrv wrote:
             | I'm referring to there being a community effort to
             | normalize performance metrics and results at all, with the
             | M1 devices being in that list as well, so that we dont have
             | to ask these questions to begin with
             | 
             | Are you aware of any wiki or table like that?
        
             | Aeolun wrote:
             | Huh, that's the same speed I get on Collab. Pretty good.
        
           | zone411 wrote:
           | Around 6 seconds.
        
           | dmd wrote:
           | Wait, what? On my M1 imac I'm getting about 25 _minutes_.
           | What am i doing wrong?
        
             | BrentOzar wrote:
             | It's falling back to CPU. Follow the instructions to use a
             | GPU version - sometimes it's even a completely different
             | repo, depending on whose instructions you're following.
        
               | dmd wrote:
               | I followed https://replicate.com/blog/run-stable-
               | diffusion-on-m1-mac
        
         | bfirsh wrote:
         | Nice. We'll get this guide updated for this fork. Everything's
         | moving so fast it's hard to keep track!
         | 
         | We struggled to get Conda working reliably for people, which it
         | looks like lstein's fork recommends. I'll see if we can get it
         | working with plain pip.
        
           | pugio wrote:
           | I really appreciate the use of pip > conda. Looking forward
           | to the update for the repo!
        
           | bfirsh wrote:
           | Running lstein's fork with these requirements[0] but seeing
           | this output[1]. Same steps as original guide otherwise.
           | 
           | Anyone got any ideas?
           | 
           | [0] https://github.com/bfirsh/stable-
           | diffusion/blob/392cda328a69...
           | 
           | [1] https://gist.github.com/bfirsh/594c50fd9b2e6b173e31de753a
           | 842...
        
             | sork_hn wrote:
             | Same output for me also.
             | 
             | EDIT: https://github.com/lstein/stable-
             | diffusion/issues/293#issuec... fixed it for me.
        
               | bfirsh wrote:
               | Boom - nice. Here's a fork with that:
               | https://github.com/bfirsh/stable-diffusion/tree/lstein
               | 
               | Requirements are "requirements-mac.txt" which'll need
               | subbing in the guide.
               | 
               | We're testing this out with a few people in Discord
               | before shipping to the blog post.
        
           | jw1224 wrote:
           | Check my comment alongside yours, I got Conda to work but it
           | did require the pre-requisite Homebrew packages you
           | originally recommended before it would cooperate :)
        
         | toinewx wrote:
         | Everything works excepts it only generates black images,
         | 
         | did you run
         | 
         | python scripts/preload_models.py
         | 
         | python scripts/dream.py --full_precision ?
        
           | arthurcolle wrote:
           | Disable safety check
        
         | jw1224 wrote:
         | Brilliant, thank you! I just got OP's setup working, but this
         | seems much more user-friendly. Giving it a try now...
         | 
         | EDIT: Got it working, with a couple of pre-requisite steps:
         | 
         | 0. `rm` the existing `stable-diffusion` repo (assuming you
         | followed OP's original setup)
         | 
         | 1. Install `conda`, if you don't already have it:
         | brew install --cask miniconda
         | 
         | 2. Install the other build requirements referenced in OP's
         | setup:                   brew install Cmake protobuf rust
         | 
         | 3. Follow the main installation instructions here:
         | https://github.com/lstein/stable-diffusion/blob/main/README-...
         | 
         | Then you should be good to go!
         | 
         | EDIT 2: After playing around with this repo, I've found:
         | 
         | - It offers better UX for interacting with Stable Diffusion,
         | and seems to be a promising project.
         | 
         | - Running txt2img.py from lstein's repo seems to run about 30%
         | faster than OP's. Not sure if that's a coincidence, or if
         | they've included extra optimisations.
         | 
         | - I couldn't get the web UI to work. It kept throwing the
         | "leaked semaphor objects" error someone else reported (even
         | when rendering at 64x64).
         | 
         | - Sometimes it rendered images just as a black canvas, other
         | times it worked. This is apparently a known issue and a fix is
         | being tested.
         | 
         | I've reached the limits of my knowledge on this, but will
         | following closely as new PRs are merged in over the coming
         | days. Exciting!
        
           | pugio wrote:
           | Can you describe how you did (/ are doing) this? Do you now
           | need to use conda (as opposed to OPs pip only version)?
        
             | jw1224 wrote:
             | See my edit for more info. (Just ironing out a couple of
             | other issues I've found, so might update it again shortly)
        
           | toinewx wrote:
           | I was able not to have black images by using a different
           | sampler
           | 
           | --sampler k_euler
           | 
           | full command:
           | 
           | "photography of a cat on the moon" -s 20 -n 3 --sampler
           | k_euler -W 384 -H 384
        
             | jastanton wrote:
             | I tried that as well but resulted in an error:
             | 
             | AttributeError: module 'torch._C' has no attribute
             | '_cuda_resetPeakMemoryStats'
             | 
             | https://gist.github.com/JAStanton/73673d249927588c93ee530d0
             | 8...
        
           | johnfn wrote:
           | I followed all these steps, but I got this error:
           | 
           | > User specified autocast device_type must be 'cuda' or 'cpu'
           | 
           | > Are you sure your system has an adequate NVIDIA GPU?
           | 
           | I found the solution here: https://github.com/lstein/stable-
           | diffusion/issues/293#issuec...
        
           | toinewx wrote:
           | I only get black images.
        
             | arthurcolle wrote:
             | You have to disable the safety checker after creating the
             | pipe
        
           | [deleted]
        
       | mark_l_watson wrote:
       | Thanks for writing this up!! I enjoyed getting TensorFlow running
       | with the M1, although a multi-headed model I was working on
       | wouldn't run.
       | 
       | I just made my Dad's 101 year old birthday card using OpenAI's
       | image generating service (he loved it) and when I get home from
       | travel I will use your instructions in the linked article.
       | 
       | Any advice for running Stable Diffusion locally vs. Colab Pro or
       | Pro+? My M1 MacBook Pro only has 8G ram (I didn't want to wait a
       | month for a 16G model). Is that enough? I have a 1080 with 10G
       | graphics memory. Is that sufficient?
        
         | ErneX wrote:
         | From the comments here 8GB is not enough, it will swap a lot
         | and take way more time than a 16GB MacBook.
        
         | Razengan wrote:
         | 101 years! Congratulations!! Does he own a suspiciously plain
         | gold ring by any chance?
        
       | gzer0 wrote:
       | The difference between an M2 air (8gb/512gb) versus an M1 pro
       | (16gb/1tb) is much more than I expected.                 * M1 pro
       | (16gb/1tb) can run the model in around 3 minutes.       * M2 air
       | (8gb/512gb) takes ~60 minutes for the same model.
       | 
       | I knew there would be some throttling due to the m2 air's fanless
       | model, but I had no idea it would be a 20x difference (albeit,
       | the m1 pro does have double the RAM. I don't have any other
       | macbooks to test this on).
        
         | andybak wrote:
         | Unscientifically that puts the M1 Pro GPU at about 25% of the
         | performance of a RTX 3080.
         | 
         | Not too shabby...
         | 
         | EDIT - this comment implies it's _much_ faster:
         | https://news.ycombinator.com/item?id=32679518
         | 
         | If that's correct then it's close to matching my 3080 (mobile).
        
           | fassssst wrote:
           | img2img runs in 6 seconds on my GeForce 3080 12 GB. 6+ it\s
           | depending on how much GPU memory is available. If I have any
           | electron apps running it slows down dramatically.
        
             | andybak wrote:
             | Curious about:
             | 
             | 1. Image size
             | 
             | 2. Steps
             | 
             | 3. What your numbers are for text2img
             | 
             | 4. (most importantly) are you including the 30 seconds or
             | so it takes to load the model initially? i.e. if you were
             | to run 10 prompts and then divide the total time by 10,
             | what are your numbers?
        
               | cube2222 wrote:
               | Re 4 the lstein repo gives you an interactive repl, so
               | you don't have to reload the model on every prompt.
               | 
               | I also have a 3080 and as far as I remember (not at my pc
               | right now) it was 3-10 secs for img2img 512px cfg13 50
               | steps batch size 1 dimm sampler.
        
             | fragmede wrote:
             | what args are you passing to img2img?
        
           | valley_guy_12 wrote:
           | It's likely that a significant fraction of the perf
           | difference between Apple' GPUs and NVIDIA GPUs is due to
           | NVIDIA's CUDA being high optimized, and Pytorch being tuned
           | to work with CUDA.
           | 
           | If Pytorch's metal support improves and Apple's Metal drivers
           | improve (big ifs), it's likely that Apple's GPUs will perform
           | better relatively to NVIDIA than they currently do.
        
             | DubiousPusher wrote:
             | > It's likely that a significant fraction of the perf
             | difference between Apple' GPUs and NVIDIA GPUs is due to
             | NVIDIA's CUDA being high optimized, and Pytorch being tuned
             | to work with CUDA.
             | 
             | You really think the orders of magnitude more parallelism
             | in AMD and Nvidia's discrete GPUs has nothing to do with
             | it?
        
         | valley_guy_12 wrote:
         | That's probably due to swapping due to the 8GB of RAM. People
         | who have run Stable Diffusion on M2 airs with 16 GB of RAM seem
         | to get performance that is in line with their GPU core count.
        
           | bfirsh wrote:
           | Correct. We've been seeing 8GB is super slow, >=16GB is fast.
           | We'll add that to the prerequisites.
        
         | _ph_ wrote:
         | I would assume it is the memory. The test command from the
         | discussed link runs in slightly over 2 minutes on my M2 Air
         | (16gb). How long does it take for yours?
        
         | JimmyAustin wrote:
         | I suspect that the M2 air is thrashing the disk pretty
         | aggressively. Diffusion models rerun the same model once per
         | step, so for a generation with 50 steps, you copy the entire
         | model in and out of memory 50 times. That's going to kill
         | performance.
        
           | schleck8 wrote:
           | It's only copied to VRAM once when implemented correctly.
        
             | astrange wrote:
             | M1 is a unified memory system and doesn't have VRAM.
        
           | bm-rf wrote:
           | I believe the model is copied into ram once upon calling
           | StableDiffusionPipeline, unless the mac implementation
           | partially loads the model due to only having 8G of ram.
        
         | bm-rf wrote:
         | just water cool it! https://www.youtube.com/watch?v=9DyUitTVWlw
        
         | qayxc wrote:
         | I suspect the lack of RAM is the issue here.
        
       | moneycantbuy wrote:
       | Anyone know the largest possible image size > 512x512? I'm
       | getting the following error when trying 1024x1024 with 64 GB RAM
       | on M1 MAX:
       | 
       | /opt/homebrew/Cellar/python@3.10/3.10.6_2/Frameworks/Python.frame
       | work/Versions/3.10/lib/python3.10/multiprocessing/resource_tracke
       | r.py:224: UserWarning: resource_tracker: There appear to be 1
       | leaked semaphore objects to clean up at shutdown
       | warnings.warn('resource_tracker: There appear to be %d '
        
         | enduser wrote:
         | I have the same problem with anything over 512x512 on my M1
         | Ultra with 128GB. VRAM must be capped.
        
           | moneycantbuy wrote:
           | Thanks for the Ultra data point.
           | 
           | I'm able to get 768x896 to run, but the output image is still
           | white noisy at 50 ddim steps, perhaps related to the
           | phenomena of being trained/windowed on 512x512 images as
           | sibling squeaky-clean described.
           | 
           | RAM usage at various sizes: 512x512 14 GB, 768x768 26 GB,
           | 768x896 32 GB
        
             | squeaky-clean wrote:
             | Those values seem really high compared to my setup,
             | windows/nvidia/lstein repo. For me 512x512 uses 6.1GB.
             | 
             | Random guess but I think your pipeline is running with
             | full-precision floats (32bit), while by default the repo
             | should be using autocast() which will try to use half-
             | precision floats wherever possible.
             | 
             | I know an optimizedSD repo exists and one of the steps they
             | take is explicitly setting precision to half. (And other
             | changes that reduce memory usage but decrease iteration
             | speed). However I don't know how M1/Metal handles half-
             | precision, hopefully it doesn't just cast them back to
             | 32bit.
             | 
             | Also white noisy images at 50 steps seems off to me. At 50
             | steps in a large image I definitely get a visible product.
             | It's just often non-euclidean or very scattered bits of
             | organization and chaos.
        
         | squeaky-clean wrote:
         | Don't specifically know sorry, largest I can generate with my
         | (windows pc) vram size is 512x1024. But just wanted to comment
         | that SD is trained on 512x512 images and runs iterations in a
         | 512x512 window.
         | 
         | This means anything larger than 512x512 tends to confuse it.
         | For example 1024x1024 will have 4 non-overlapping windows and
         | many overlapping windows.
         | 
         | So if your prompt is "a cat wearing sunglasses", you may get 4
         | separate cats as the 512x512 windows have no knowledge of each
         | other, and each window is trying to fulfill the goal. Even more
         | likely you'll get some sort of eldrtich horror 16-legged cat
         | being as the windows shuffle around.
         | 
         | Sometimes it just works perfectly somehow, but 90% of the time
         | the non native resolution really screws it up. I'd suggest
         | generating 512x512 images and using a different AI to upscale
         | them in most cases.
         | 
         | However it does lead to some amazing fantasy landscape art as
         | you get weird terrains and mountains shoved up against
         | eachother in fantastic/magical ways.
        
         | valley_guy_12 wrote:
         | Supposedly Stable Diffusion was trained on 512 x 512 images, so
         | it's not clear that it will work well for larger images even if
         | you had the RAM.
         | 
         | To generate larger images, it is standard practice to generate
         | 512 x 512 and then use a separate tool to upscale, and maybe a
         | second separate tool to improve the face. The Windows versions
         | of SD environments are starting to incorporate these additional
         | tools, but the Apple Silicon versions of SD environments are
         | lagging behind due to Pytorch metal limitations.... It'll
         | hopefully sort itself out in the next few months.
        
         | capableweb wrote:
         | just 512x require something like 10GB VRAM on your GPU, 1024x
         | would need even more. How much VRAM does the M1 Max GPU have?
         | You're probably running out of memory.
        
           | tgtweak wrote:
           | M architecture is unified memory - so the system memory is
           | shared as GPU memory. There is likely a cap somewhere on how
           | much can be used by a single app (or collectively by all
           | apps).
        
       | deckeraa wrote:
       | Very nice to see this available for hardware I own.
       | 
       | Now I can achieve my dream of a Corporate Memphis + Hieronymus
       | Bosch mashup.
        
       | avereveard wrote:
       | How fast is it on a m1?
        
         | ebiester wrote:
         | It takes 1-2 mins for a 512x512 image. It's been a lot of fun
         | since I did this last night.
        
           | smoldesu wrote:
           | For reference, inferencing the model on a 2070 takes 10-12
           | seconds for the same size at max-precision, and the 3070 can
           | synthesize an image in almost 6 seconds.
           | 
           | If you extrapolate the power consumption (3070 @~300w vs M1
           | Pro GPU@~30-50w) the metrics make a lot of sense.
        
         | nathas wrote:
         | I haven't ran this fork yet, but about 1.3 sec/iter. Usually
         | ~30-50 iters/sample (image).
        
         | pwinnski wrote:
         | The answer depends VERY MUCH on RAM.
         | 
         | My M1 with 8GB takes 70-90 minutes per image.
         | 
         | My M1 Pro with 16GB takes 3 minutes per image.
        
           | fifafu wrote:
           | Another data point: M1 Max with 64GB takes ~40 seconds
        
           | astrange wrote:
           | Hopefully the library is intentionally going very slowly
           | trying to fit in RAM and that's not just your 8GB machine
           | totally falling over.
        
       | adamj9431 wrote:
       | How is Stable Diffusion on DreamStudio.ai so much faster than the
       | reports here? Seems to only take 5-10 seconds to generate an
       | image with the default settings.
       | 
       | I.e. How are they providing access to GPU compute several orders
       | of magnitude more powerful than an M1, for free?
        
         | tgtweak wrote:
         | A100 devices in the cloud on preemptive/spot instances?
        
         | schleck8 wrote:
         | 1. Dreamstudio is paid, with 2 EUR worth of credit for free.
         | 
         | 2. The M1 GPU is an iGPU. A good iGPU, sure, but it's not
         | anywhere near the performance of a dedicated, cooled GPU with
         | dedicated VRAM.
         | 
         | On a 2060 Super with 8 GB of VRAM and with tensor cores it
         | takes 15 seconds to infer with the default settings. If
         | Dreamstudio uses deep learning GPUs then there is your answer
         | to why it is as fast.
        
         | fleddr wrote:
         | The guy behind it is an ex hedge fund manager. Using private
         | funds he's built a massive fleet of A1000s at AWS.
         | 
         | So it's an enormous amount of compute created from private
         | funds that he considers to be "for humanity". Currently, he
         | funds it and is also the "GPU overlord", he exclusively decides
         | which applications gets to use it.
         | 
         | His plan, or at least his claim, is to transform this situation
         | in it being more diversely funded (institutions, businesses,
         | even the UN) and for access to be decided by committee with
         | main criteria it being useful for humanity.
         | 
         | Let's see if he sticks to his word, but I find it
         | inspirational. AI was on a trajectory to be solely in the hands
         | of a hand full of ultra rich companies that can afford to train
         | and run it, and us poor mortals being at the whims of
         | gatekeeper terms.
         | 
         | This guy is on a trajectory to put AI in the hands of the
         | people. Not just for art, for everything. If he fully sees this
         | through, he's destined to be a tech icon.
        
           | preommr wrote:
           | The guy's name is Emad Mostaque btw
           | 
           | There's a recent video interview he did that goes into his
           | vision: https://www.youtube.com/watch?v=YQ2QtKcK2dA
        
       | vvanirudh wrote:
       | Running into this error `RuntimeError: expected scalar type
       | BFloat16 but found Float` when I run `txt2img.py`
        
         | yboris wrote:
         | Confirming I'm stuck on the same error when running the
         | tutorial-instructed python scripts/txt2img.py command
         | RuntimeError: expected scalar type BFloat16 but found Float
        
           | ml_basics wrote:
           | Yes, me too! Please post here if you find a solution for all
           | the other people that come and find this by commmand-F'ing
           | this error
        
             | sytse wrote:
             | I'm stuck on 'RuntimeError: expected scalar type BFloat16
             | but found Float' too. Most relevant links seems
             | https://github.com/CompVis/stable-diffusion/pull/47 but I'm
             | not sure. Please post when there is a solution.
        
               | alvb wrote:
               | That might have to do with your Mac OS version. Pre-12.4
               | Mac OS does not allow the Torch backend to use the M1
               | GPU, and so the script attempts to use the cpu, but then
               | the cpu does not support half-precision numbers.
        
         | yboris wrote:
         | _SOLUTION_ - append the command with `--precision full`
        
           | sytse wrote:
           | Awesome, that works
           | 
           | For reference the full command:
           | 
           | `python scripts/txt2img.py \ --prompt "a red juicy apple
           | floating in outer space, like a planet" \ --n_samples 1
           | --n_iter 1 --plms --precision full`
        
       | moneycantbuy wrote:
       | What's with the ~25% chance of an image being all black? Also,
       | seeds aren't replicating.
        
       | sroussey wrote:
       | This should be put into a docket image to avoid various potential
       | conflicts with locally installed libraries.
       | 
       | Anyone do this for the M1?
        
         | schleck8 wrote:
         | Conda environment
        
         | hnarayanan wrote:
         | Do you want to add more layers to make it extra slow?
        
         | bfirsh wrote:
         | Unfortunately this can't run in Docker because Docker for Mac
         | can't access the M1 GPU. (Several layers of virtualization and
         | emulation!)
        
           | sroussey wrote:
           | Ah, thanks. Yes, that makes sense.
        
       | mdswanson wrote:
       | virtualenv isn't required. You can just use python -m venv venv
       | and get the same results with one fewer dependency.
        
       | simonebrunozzi wrote:
       | I don't want to sound lazy, but I would be expecting a .dmg for
       | Macs, and I don't seem to find it. Am I blind, or it simply
       | hasn't been prepared yet?
        
         | omarelbie wrote:
         | At the rate it's moving it doesn't seem too far off, but I
         | think it's just a tad too early.
        
       | cageface wrote:
       | After playing around with all of these ML image generators I've
       | found myself surprisingly disenchanted. The tech is extremely
       | impressive but I think it's just human psychology that when you
       | have an unlimited supply of something you tend to value each
       | instance of it less.
       | 
       | Turns out I don't really want thousands of good images. I want a
       | handful of excellent ones.
        
       | andrethegiant wrote:
       | Yesssss I've been waiting for this!
        
       | sgt101 wrote:
       | also brew upgrade not brew update
        
       | sp332 wrote:
       | Any chance of this running on an M1 iPad Pro?
        
         | boppo1 wrote:
         | Seconding this
        
         | liuliu wrote:
         | Probably another week or two. running on M1 iPad Pro needs to
         | get out of PyTorch, possibly export the model either through
         | TorchScript and then do onnx conversion. From what I found so
         | far, not many of these conversions done (except the OpenVINO
         | one maybe?).
        
       | ebiester wrote:
       | Note: I ran this and haven't yet been able to get img2img working
       | yet. I borked it up trying to get conda working.
       | 
       | It's been a lot of fun to play with so far though!
        
         | bravura wrote:
         | Apparently, this release should include a Dockerfile for easier
         | replicability.
        
           | bfirsh wrote:
           | Unfortunately this can't run in Docker because Docker for Mac
           | can't access the M1 GPU. (Several layers of virtualization
           | and emulation!)
        
         | nathas wrote:
         | Try the lstein fork: https://github.com/lstein/stable-
         | diffusion/tree/fix-cuda-res...
         | 
         | You'll still need to play with modifying some of the code to
         | get it to run, but `dream.py` works for me. Funny enough, I got
         | _only_ img2img effectively working with the lstein branch; it
         | broke txt2img for me.
        
         | StapleHorse wrote:
         | Yesterday I thought I broke it too. In my case, the solution
         | was just to make sure that the input image (from the editor or
         | otherwise) was the same size as the output image. Hope it
         | helps.
        
       | gregsadetsky wrote:
       | Bananas. Thanks so much... to everyone involved. It works.
       | 
       | 14 seconds to generate an image on an M1 Max with the given
       | instructions (`--n_samples 1 --n_iter 1`)
       | 
       | Also, interesting/curious small note: images generated with this
       | script are "invisibly watermarked" i.e. steganographied!
       | 
       | See https://github.com/bfirsh/stable-
       | diffusion/blob/main/scripts...
        
       | msoad wrote:
       | Please someone package all of this and the WebUI into an Electron
       | app so common people can also hack on it!
        
         | andrewmunsell wrote:
         | There are several packages that provide web UIs, like this one
         | for example: https://github.com/hlky/stable-diffusion-webui
         | 
         | It's not quite the ease of setup of an Electron app, but once
         | setup it's pretty easy to use.
        
       | r3trohack3r wrote:
       | I've been playing with Stable Diffusion a lot the past few days
       | on a Dell R620 CPU (24 cores, 96 GB of RAM). With a little
       | fiddling (not knowing any python or anything about machine
       | learning) I was able to get img2img.py working by simply
       | comparing that script to the txt2img.py CPU patch. Was only a few
       | lines of tweaking. img2img takes ~2 minutes to generate an image
       | with 1 sample and 50 iterations, txt2img takes about 10 minutes
       | for 1 sample and 50 generations.
       | 
       | The real bummer is that I can only get ddim and plms to run using
       | a CPU. All of the other diffusions crash and burn. ddim and plms
       | don't seem to do a great job of converging for hyper-realistic
       | scenes involving humans. I've seen other algorithms "shape up"
       | after 10 or so iterations from explorations people do online -
       | where increasing the step count just gives you a higher fidelity
       | and/or more realistic image. With ddim/plms on a CPU, every step
       | seems to give me a wildly different image. You wouldn't know that
       | steps 10 and steps 15 came from the same seed/sample they change
       | so much.
       | 
       | I'm not sure if this is just because I'm running it on a CPU or
       | if ddim and plms are just inferior to the other diffusion models
       | - but I've mostly given up on generating anything worthwhile
       | until I can get my hands on an nvida GPU and experiment more with
       | faster turn arounds.
        
         | squeaky-clean wrote:
         | > You wouldn't know that steps 10 and steps 15 came from the
         | same seed/sample they change so much.
         | 
         | I don't think this is CPU specific, this happens at these very
         | low number of samples, even on the GPU. Most guides recommend
         | starting with 45 steps as a useful minimum for quickly trialing
         | prompt and setting changes, and then increasing that number
         | once you've found values you like for your prompt and other
         | parameters.
         | 
         | I've also noticed another big change sometimes happens between
         | 70-90 steps. It's not all the time and it doesn't drastically
         | change your image, but orientations may get rotated, colors
         | will change, the background may change completely.
         | 
         | > img2img takes ~2 minutes to generate an image with 1 sample
         | and 50 iterations
         | 
         | If you check the console logs you'll notice img2img doesn't
         | actually run the real number of steps. It's number of steps
         | multiplied by the Denoising Strength factor. So with a
         | denoising strength of 0.5 and 50 steps, you're actually running
         | 25 steps.
         | 
         | Later edit: Oh and if you do end up liking an image from step
         | 10 or whatever, but iterating further completely changes the
         | image, one thing you can do is save your output at 10 steps,
         | and use that as your base image for the img2img script to do
         | further work.
        
         | schleck8 wrote:
         | With the 1.4 checkpoint, everything under 40 steps can't be
         | used basically and you only get good fidelity with >75 steps. I
         | usually use 100, that's a good middleground.
        
           | auggierose wrote:
           | How do you change these steps in the given script? Is it the
           | --ddim_steps parameter? Or --n_iter? Or ... ?
        
       | caxco93 wrote:
       | could someone who has already done this please share how long it
       | takes for a 50 steps image to be generated?
        
         | nathas wrote:
         | 1.3 sec/iter on my M1 Mac, so ~39 seconds.
        
           | jw1224 wrote:
           | That was fast. I'm only getting 5.26s/iter on an M1 Pro MBP
           | with 16GB RAM.
           | 
           | EDIT: Speed increased to 2.3s/iter after a reboot
        
             | nathas wrote:
             | Depends what fork you're running... Some seem to be using
             | CPU-based generation, others use the MPS device backend
             | correctly which is MUCH faster. I have another comment
             | floating around about lstein's fork, but it takes some
             | massaging to get it to run happily.
             | https://github.com/lstein/stable-diffusion/
        
               | jw1224 wrote:
               | The fork linked by OP is MPS-based, I can see GPU usage
               | way up in Activity Monitor. Seems performance doubled
               | after a reboot though :)
        
             | geerlingguy wrote:
             | Weird, on M1 Max Mac Studio, only getting 1.42 it/s :/
        
               | nathas wrote:
               | I got my units backwards :sweat: My bad!
        
       | yoyohello13 wrote:
       | Has anybody had success getting newer AMD cards working?
       | 
       | ROCm support seems spotty at best, I have a 5700xt and I haven't
       | had much luck getting it working.
        
         | geerlingguy wrote:
         | I've tried using this set of steps [1], but have so far not had
         | luck, mostly because the ROCm driver setup is throwing me for a
         | loop. Tried it with an RX 6700 XT and first was going to test
         | on Ubuntu 22.04 but realized ROCm doesn't support that OS yet,
         | so tried again on 20.04 and ended up breaking my GPU driver!
         | 
         | [1]
         | https://gist.github.com/geerlingguy/ff3c3cbcf4416be2c0c1e0f8...
        
           | my123 wrote:
           | Yes. That's expected.
           | 
           | AMD market segmented their RDNA2 support in ROCm to the
           | Navi21 set only (6800/6800 XT/6900 XT).
           | 
           | It is not officially supported in any way on other RDNA2
           | GPUs. (Or even on the desktop RDNA2 range at all, that only
           | works because their top end Pro cards share the same die)
        
             | geerlingguy wrote:
             | Oh... had no clue! Thanks for letting me know so I wouldn't
             | have to spend hours banging my head against the wall.
        
             | yoyohello13 wrote:
             | Looks like I may be out of luck with NAVI 10.
        
             | my123 wrote:
             | As an aside, a totally unsupported hack to make it somewhat
             | work on Navi2x smaller dies which you use:
             | 
             | HSA_OVERRIDE_GFX_VERSION=10.3.0 to force using the Navi21
             | binary slice.
             | 
             | This is totally unsupported and please don't complain if
             | something doesn't work when using that trick.
             | 
             | But basic PyTorch use works using this, so you might get
             | away with it for this scenario.
             | 
             | (TL;DR: AMD just doesn't care about GPGPU on the
             | mainstream, better to switch to another GPU vendor that
             | does next time...)
        
         | switchers wrote:
         | 6600XT reporting in. Spent a few hours on Windows and WSL2
         | setup attempts, got no where. I don't run Ubuntu at home and
         | don't want to dual boot just for this. From looking around I
         | think I'd have a better chance on native Ubuntu.
        
           | my123 wrote:
           | Buy an NVIDIA card. ROCm isn't supported in any way on WSL2,
           | but CUDA is.
           | 
           | AMD just doesn't invest in their developer ecosystem. Also as
           | you use a 6600 XT, no official ROCm support for the die that
           | you use. Only for navi21.
        
             | MintsJohn wrote:
             | Or wait, if its just about stable diffusion multiple people
             | try to create onnx and directml forks of the
             | models/scripts, which atleast in theory can work for AMD
             | gpus in windows and wsl2
        
         | haxxorfreak wrote:
         | I have it working on an RX 6800, used the scripts from this
         | repo[0] to build a docker image that has ROCm drivers and
         | PyTorch installed.
         | 
         | I'm running Ubuntu 22.04 LTS as the host OS, didn't have to
         | touch anything beyond the basic Docker install. Next step is
         | build a new Dockerfile that adds in the Stable Diffusion
         | WebUI.[1]
         | 
         | [0] https://github.com/AshleyYakeley/stable-diffusion-rocm [1]
         | https://github.com/hlky/stable-diffusion-webui
        
           | tgtweak wrote:
           | The RX6800 seems like a great card for this - 16GB of
           | relatively fast VRAM for a good price.
           | 
           | How long does it take to do 50 iterations on a 512x512?
        
         | ece wrote:
         | I tried getting pytorch vulkan inference working with radv, it
         | gives me a missing dtype error in vkformat. Fp16 or normal
         | precision have the same error. I think it's some bf16 thing.
        
       | dgreensp wrote:
       | I'm working on getting this running. Instead of
       | "venv/bin/activate" I had to run "source venv/bin/activate". And
       | I got an error installing the requirements, fixed by running "pip
       | install pyyaml" as a separate command.
        
         | valley_guy_12 wrote:
         | Having to use "source" means you have an older version of
         | conda. Python package management is kind of a mess.
        
         | bfirsh wrote:
         | There's a little "." before "venv/bin/activate" that's easy to
         | miss. I'll update it to "source" to make it more obvious.
        
         | dgreensp wrote:
         | Wow, I'm getting as low as 1.2 seconds per "step" (about a
         | minute for a 512x512 image with default settings) on my 32 GB
         | M1 laptop (2021, 16-inch).
        
       | blagie wrote:
       | How large an image will this handle (versus how much RAM you
       | have)?
       | 
       | It seems the GPU memory requirements beyond 512x512 are obscene.
        
         | schleck8 wrote:
         | This model was mostly trained on 512x512 so you should stick to
         | approximately that size.
         | 
         | Use external upscalers like RealESRGAN, SwinIR or BSRGAN or
         | GFPGAN (faces).
         | 
         | Alternatively use hacks like txt2imghd to get it to natively
         | create 1 MP images.
        
         | michaelchisari wrote:
         | I'm on a iMac M1 16gb and I can handle up to 768x768 but since
         | it's shared memory I close out every other application and run
         | things overnight.
         | 
         | The biggest issue with apple chips is that the --seed setting
         | doesn't work. I _should_ be able to set a seed to, for
         | instance, 1083958 and if I re-run a command at the same
         | resolution with that seed, I should get the same image every
         | time. This would allow me to test different steps so I could
         | generate a 100 images at 16 steps (which is quite fast) and
         | pick the ones that are most promising and re-render at 64 or
         | 128 steps.
         | 
         | But currently you can't do that on apple hardware because of an
         | open issue in PyTorch. Genuinely hoping a fix comes soon, until
         | it is this is more of a novelty than a tool on Apple hardware.
        
           | fragmede wrote:
           | There's a partial fix for the seed issue on Reddit.
        
             | michaelchisari wrote:
             | I can't seem to find it, do you have a link?
        
         | pugio wrote:
         | Me at the end of last year: "Should I really go for the full
         | 64GB on this M1 Pro? What could I possibly use this for? mmbml
         | mumble... something about unified GPU... something Deep
         | Learning, one day..."
         | 
         | Me now: "a red juicy apple floating in outer space, like a
         | planet" --H 768 --W 768
         | 
         | Uses about 27GB. 1.81s/it.
         | 
         | Can't do 1024x1024 yet because of some hardcoded Metal issue
         | (https://github.com/pytorch/pytorch/issues/84039.
        
         | neurostimulant wrote:
         | Most people that want hires would just feed the resulting image
         | into AI upscaler like gigapixel.
        
       | rhacker wrote:
       | I wonder if this is going to be a huge boon to m1 sales.
        
       | ThrowawayTestr wrote:
       | That was fast.
        
       ___________________________________________________________________
       (page generated 2022-09-01 23:00 UTC)