[HN Gopher] Stable Diffusion with Core ML on Apple Silicon
       ___________________________________________________________________
        
       Stable Diffusion with Core ML on Apple Silicon
        
       Author : 2bit
       Score  : 247 points
       Date   : 2022-12-01 20:21 UTC (2 hours ago)
        
 (HTM) web link (machinelearning.apple.com)
 (TXT) w3m dump (machinelearning.apple.com)
        
       | [deleted]
        
       | zimpenfish wrote:
       | Man, this takes a ton of room to do the CoreML conversions - ran
       | out of space doing the unet conversion even though I started with
       | 25GB free. Going on a delete spree to get it up to 50GB free
       | before trying again.
        
       | mark_l_watson wrote:
       | Great stuff. I like that they give directions for both Swift and
       | Python
       | 
       | This gets you text descriptions to images.
       | 
       | I have seen models that given a picture, then generate similar
       | pictures. I want this because while I have many pictures of my
       | grandmothers, I only have a couple of pictures of my grandfathers
       | and it would be nice to generate a few more.
       | 
       | Core ML is so well done. A year ago I wrote a book on Swift AI
       | and used Core ML in several examples.
        
       | tosh wrote:
       | Atila from Apple on the expected performance:
       | 
       | > For distilled StableDiffusion 2 which requires 1 to 4
       | iterations instead of 50, the same M2 device should generate an
       | image in <<1 second
       | 
       | https://twitter.com/atiorh/status/1598399408160342039
        
         | hbn wrote:
         | SD2 is the one that was neutered, right?
         | 
         | Maybe a dumb question but can the old model still be run?
        
           | [deleted]
        
           | qclibre22 wrote:
           | Also, can you not "upgrade" but still run new models?
        
             | astrange wrote:
             | You can do anything you want.
             | 
             | SD2 wasn't "neutered", the piece of it from OpenAI that
             | knew a lot of artist names but wasn't reproduceable was
             | replaced with a new one from Stability that doesn't. You
             | can fine-tune anything you want back in.
        
           | kyleyeats wrote:
           | It's less versatile out of the box. Give it a couple months
           | for the community to catch up. Everyone is still figuring out
           | what goes where, and SD 1.x was "everything goes in one
           | spot." It was cool and powerful, but limited.
        
           | minimaxir wrote:
           | You can still do nice things with SD2, it just requires a
           | different approach.
           | https://news.ycombinator.com/item?id=33780543
        
         | cammikebrown wrote:
         | If you told me this was possible when I bought an M1 Pro less
         | than a year ago, I wouldn't believe you. This is insane.
        
         | peppertree wrote:
         | Last nail in the coffin for DALL*E.
        
           | m00dy wrote:
           | yeah, finally we see the real openAI
        
             | visarga wrote:
             | more open than open source, it's the open model age
        
           | astrange wrote:
           | I think they can move upmarket just as well as anyone else.
        
           | mensetmanusman wrote:
           | Not really, everyone will have their own flavor on how to
           | rapidly train the model.
           | 
           | Dall-e et. al will still be able to bandwagon off of all the
           | free ecosystem being built around the $10M SD1.4 model that
           | is showing what is possible.
           | 
           | E.g. Dall-e could go straight to Hollywood if their model
           | training works better than SD's. The toolsets will work
        
         | chasd00 wrote:
         | i'm very ignorant here so forgive me but if it can generate
         | images that fast can it be used to generate a video?
        
           | valgaze wrote:
           | Video is really a series of frames, the framerate for
           | film/human can get away with 24 frames/second-- so maybe
           | ~40ms/image for real-time at least?
           | 
           | What's cool about the era in which we live is if you look at
           | high-performance graphics for games or simulations, for
           | instance, it may in fact be _faster_ to a the model to
           | "enhance" a low-resolution frame rather than trying to render
           | it fully on the machine.
           | 
           | ex. AMD's FSR vs NVIDIA DLSS
           | 
           | - AMD FSR (Fidelity FX Super Resolution):
           | https://www.amd.com/en/technologies/fidelityfx-super-
           | resolut...
           | 
           | - NVIDIA DLSS (Deep Learning Super Sampling):
           | jhttps://www.nvidia.com/en-us/geforce/technologies/dlss/
           | 
           | AMD's approach renders the game at a crummy, low-detail
           | resolution then each frame uses "upscales"
           | 
           | Both FSR and DLSS aim to improve frames-per-second in games
           | by rendering them below your monitor's native resolution,
           | then upscaling them to make up the difference in sharpness.
           | Currently, FSR uses spatial upscaling, meaning it only
           | applies its upscaling algorithm to one frame at a time.
           | Temporal upscalers, like DLSS, can compare multiple frames at
           | once, to reconstruct a more finely-detailed image that both
           | more closely resembles native res and can better handle
           | motion. DLSS specifically uses the machine learning
           | capabilities of GeForce RTX graphics cards to process all
           | that data in (more or less) real time.
           | 
           | Video is really a series of frames, the framerate for
           | film/human could get away with 24 frames/second-- ~40ms/image
           | for real-time.
           | 
           | What's cool about the era in which we live is if you look at
           | high-performance graphics for games or simulations, it may in
           | fact be _faster_ to run the model on each frame to  "enhance"
           | a low-resolution frame rather than trying to render it fully
           | on the machine.
           | 
           | ex. AMD's FSR vs NVIDIA DLSS
           | 
           | - AMD FSR (Fidelity FX Super Resolution):
           | https://www.amd.com/en/technologies/fidelityfx-super-
           | resolut...
           | 
           | - NVIDIA DLSS (Deep Learning Super Sampling):
           | https://www.nvidia.com/en-us/geforce/technologies/dlss/
           | 
           | AMD's approach renders the game at a crummy, low-detail
           | resolution then use "spatial upscaling" to enhance the images
           | one frame at a time.
           | 
           | NVIDIA DLSS uses "temporal upscaling" to pass over multiple
           | frames and uses other capabilities exclusive to Nvidia's
           | cards to stitch together the frames.
           | 
           | This is a different challenge than generating the content
           | from scratch
           | 
           | I don't think this is possible in real-time yet, but someone
           | put a filter trained on the German country side to produce
           | photorealistic Grand Theft Auto driving gameplay:
           | 
           | https://www.youtube.com/watch?v=P1IcaBn3ej0
           | 
           | Notice the mountains in the background go from Southern
           | California brown to lush green
           | 
           | https://www.rockpapershotgun.com/amd-fsr-20-is-a-more-
           | demand....
        
           | vletal wrote:
           | Yeah, sure. The issue is with temporal consistency. Meta and
           | Google have some successes in that area.
           | 
           | https://mezha.media/en/2022/10/06/google-is-working-on-
           | image...
           | 
           | Give it some time and SD will be able to do the same.
        
           | gcanyon wrote:
           | There are different requirements for generating video -- at a
           | minimum, continuity is tough. There are models for producing
           | video, but (as far as I've seen) they're still a bit wobbly.
        
         | mrtksn wrote:
         | With the full 50 iterations it appears to be about 30s on M1.
         | 
         | They have some benchmarks on the github repo:
         | https://github.com/apple/ml-stable-diffusion
         | 
         | For reference, previously I was getting about <3 minutes for 50
         | iterations on my Macbook Air M1. I haven't yet tried Apple's
         | implementation but it looks like a huge improvement. It might
         | take it from "possible" to "usable".
        
           | washadjeffmad wrote:
           | For comparison, it's also taking ~3min @ 50 iterations on my
           | 12c Threadripper using OpenVino. It sounds like the
           | improvements bring the M1 performance roughly in line with a
           | GTX 1080.
        
           | liuliu wrote:
           | Yeah, it is just PyTorch MPS backend is not fully baked and
           | have some slowness. You should be able to get close to that
           | number with maple-diffusion (probably 10% slower) or my app:
           | https://drawthings.ai/ (probably around 20% slower, but it
           | supports samplers that takes less steps (50 -> 30)).
        
         | minimaxir wrote:
         | Note that this is extrapolation for the _distilled_ model which
         | isn 't released quite yet. (but it will be very exciting when
         | it does!)
        
       | neonate wrote:
       | https://github.com/apple/ml-stable-diffusion
        
         | christiangenco wrote:
         | Oh gosh that's an intimidating installation process. I'll be
         | much more interested when I can just `brew install` a binary.
        
           | artimaeis wrote:
           | A bit different take is DiffusionBee, if you're curious to
           | try it out in a GUI form.
           | 
           | https://diffusionbee.com
        
             | aryamaan wrote:
             | does it use the optimised model for Apple chips?
        
               | belthesar wrote:
               | Not yet, likely, but the project is very active. I could
               | see it coming quite soon.
        
             | bredren wrote:
             | I've used this a fair amount but am not sure it's much
             | better place to begin than automatic1111, especially for
             | the HN crowd.
        
           | thepasswordis wrote:
           | Where are you seeing the installation process?
        
           | MuffinFlavored wrote:
           | I could be wrong but I think part of the issue is this needs
           | some large files for the trained dataset?
        
             | [deleted]
        
           | gedy wrote:
           | > Oh gosh that's an intimidating installation process
           | 
           | I'm not seeing any installation instructions on either link -
           | what am I missing?
        
             | alexfromapex wrote:
             | All I had to do was:
             | 
             | - create a virtual environment
             | 
             | - upgrade pip
             | 
             | - install the nightly PyTorch (command on their website)
             | 
             | - pip install -r requirements.txt
             | 
             | - and then, python setup.py install
             | 
             | - Still trying to figure out Swift part???
        
       | pkage wrote:
       | How does this compare with using the Hugging Face `diffusers`
       | package with MPS acceleration through PyTorch Nightly? I was
       | under the impression that that used CoreML under the hood as well
       | to convert the models so they ran on the Neural Engine.
        
         | [deleted]
        
         | liuliu wrote:
         | It doesn't. MPS largely is on GPU. PyTorch's MPS implementation
         | is incomplete a few weeks ago as well. This is about 3x faster.
        
       | behnamoh wrote:
       | This may sound naive, but what are some use cases of running SD
       | models locally? If the free/cheap options exist (like running SD
       | on powerful servers), then what's the advantage of this new
       | method?
        
         | gjsman-1000 wrote:
         | Powerful servers with GPUs are expensive. Laptops you already
         | own, aren't.
        
         | sofaygo wrote:
         | > There are a number of reasons why on-device deployment of
         | Stable Diffusion in an app is preferable to a server-based
         | approach. First, the privacy of the end user is protected
         | because any data the user provided as input to the model stays
         | on the user's device. Second, after initial download, users
         | don't require an internet connection to use the model. Finally,
         | locally deploying this model enables developers to reduce or
         | eliminate their server-related costs.
        
         | yazaddaruvala wrote:
         | "Hey Siri, draw me a purple duck" and it all happens without an
         | internet connection!
         | 
         | If you mean monetary usecases: Roughly something like
         | Photoshop/Blender/UnrealEngine with ML plugins that are low
         | latency, private, and $0 server hosting costs.
        
         | jwitthuhn wrote:
         | Even with the slower pytorch implementation my M1 Pro MBP,
         | which tops out at consuming ~100W of power, can generate a
         | decent image in 30 seconds.
         | 
         | I'm not sure exactly what that costs me in terms of power, but
         | it is assuredly less than any of these services charge for a
         | single image generation.
        
         | tosh wrote:
         | Works offline, privacy, independent of SaaS (API stability,
         | longevity, ...). I'm sure there are more.
        
         | mensetmanusman wrote:
         | Soon you will be able to render home imovies like they were
         | edited by the team that made the dark knight (which costs
         | ~$100k/min if done professionally).
        
       ___________________________________________________________________
       (page generated 2022-12-01 23:00 UTC)