[HN Gopher] The Illustrated Stable Diffusion
       ___________________________________________________________________
        
       The Illustrated Stable Diffusion
        
       Author : mariuz
       Score  : 223 points
       Date   : 2022-10-04 17:59 UTC (5 hours ago)
        
 (HTM) web link (jalammar.github.io)
 (TXT) w3m dump (jalammar.github.io)
        
       | uptown wrote:
       | "We then compare the resulting embeddings using cosine
       | similarity. When we begin the training process, the similarity
       | will be low, even if the text describes the image correctly."
       | 
       | How is this training performed? How is accuracy rated?
        
       | marshray wrote:
       | Nope, still don't understand it. :-/
        
       | torbTurret wrote:
       | Love the visual explainers for machine learning nowadays.
       | 
       | The author has more here: https://jalammar.github.io/
       | 
       | Amazon has some highly interactive ones here: https://mlu-
       | explain.github.io/
       | 
       | Google had: distill.pub
       | 
       | Hope to see education in this space grow more.
        
       | vanjajaja1 wrote:
       | This is the perfect level of description, thank you. Looking
       | forward to checking out more of your work.
        
       | minism wrote:
       | Great overview, I think the part for me which is still very
       | unintuitive is the denoising process.
       | 
       | If the diffusion process is removing noise by predicting a final
       | image and comparing it to the current one, why can't we just jump
       | to the final predicted image? Or is the point that because its an
       | iterative process, each noise step results in a different "final
       | image" prediction?
        
         | mota7 wrote:
         | The problem is that predicting a pixel requires knowing what
         | the pixels around it looks like. But if we start with lots of
         | noise, then the neighboring pixels are all just noise and have
         | no signal.
         | 
         | You could also think of this as: We start with a terrible
         | signal to noise ratio. So we need to average over very large
         | areas to get any reasonable signal. But as we increase the
         | signal, we can average over a smaller area to get the same
         | signal-to-ratio.
         | 
         | In the beginning, we're averaging over large areas, so all the
         | fine detail is lost. We just get 'might be a dog? maybe??'.
         | What the network is doing is saying "if this a dog, there
         | should be a head somewhere over here. So let me make it more
         | like a head". Which improves the signal to noise ratio a bit.
         | 
         | After a few more steps, the signal is strong enough that we can
         | get sufficient signal from smaller areas, so it starts saying
         | 'head of a dog' in places. So the network will then start doing
         | "Well, if this is a dog's head, there should be some eyes.
         | Maybe two, but probably not three. And they'll be kinda
         | somewhere around here".
         | 
         | Why do it this way?
         | 
         | Doing it this ways means the network doesn't need to learn
         | "Here are all the ways dogs can look". Instead, it can learn a
         | factored representation: A dog has a head and a body. The
         | network only needs to learn a very fuzzy representation at this
         | level. Then a head has some eyes and maybe a nose. Again, it
         | only needs to learn a very fuzzy representation and (very)
         | rough relative locations.
         | 
         | So it only when it get right down into fine detail that it
         | actually needs to learn pixel perfect representation. But this
         | is _way_ easier, because in small areas images have
         | surprisingly very low entropy.
         | 
         | The 'text-to-image' bit is a just a twist on the basic idea. At
         | the start when the network is going "dog? or it might be a
         | horse?", we fiddle with the probabilities a bit so that the
         | network starts out convinced there's a dog in there somewhere.
         | At which point it starts making the most likely places look a
         | little more like a dog.
        
         | astrange wrote:
         | Research is still ongoing here, but it seems like diffusion
         | models despite being named after the noise addition/removal
         | process don't actually work because of it.
         | 
         | There's a paper (which I can't remember the name of) that shows
         | the process still works with different information removal
         | operators, including one with a circle wipe, and one where it
         | blends the original picture with a cat photo.
         | 
         | Also, this article describes CLIP being trained on text-image
         | pairs, but Google's Imagen uses an off the shelf text model so
         | that part doesn't seem to be needed either.
        
           | krackers wrote:
           | I think it might be this paper [1] succintly described by the
           | author in this twitter thread [2]
           | 
           | [1] https://arxiv.org/abs/2208.09392 [2] https://twitter.com/
           | tomgoldsteincs/status/156250381442263040...
        
         | jayalammar wrote:
         | Two diffusion processes are involved:
         | 
         | 1- Forward Diffusion (adding noise, and training the Unet to
         | predict how much noise is added in each step)
         | 
         | 2- Generating the image by denoising. This doesn't predict the
         | final image, each step only predicts a small slice of noise
         | (the removal of which leads to images similar to what the model
         | encountered in step 1).
         | 
         | So it is indeed an iterative processes in that way, each step
         | taking one step towards the final image.
        
         | [deleted]
        
         | hanrelan wrote:
         | I was wondering the same and this video [1] helped me better
         | understand how the prediction is used. The original paper isn't
         | super clear about this either.
         | 
         | The diffusion process predicts the total noise that was added
         | to the image. But that prediction isn't great and applying it
         | immediately wouldn't result in a good output. So instead, the
         | noise is multiplied by a small epsilon and then subtracted from
         | the noisy image. That process is iterated to get to the final
         | result.
         | 
         | [1]https://www.youtube.com/watch?v=J87hffSMB60
        
         | cgearhart wrote:
         | I'm pretty sure it's a stability issue. With small steps the
         | noise is correlated between steps; if you tried it in one big
         | jump then you would essentially just memorize the input data.
         | The maximum noise would act as a "key" and the model would
         | memorize the corresponding image as the "value". But if we do
         | it as a bunch of little steps then the nearby steps are
         | correlated and in the training set you'll find lots of groups
         | of noise that are similar which allows the model to generalize
         | instead of memorizing.
        
         | nullc wrote:
         | You can think of it like solving a differential equation
         | numerically. The diffusion model encodes the relationships
         | between values in sensible images (technically in the
         | compressed representations of sensible images). You can try to
         | jump directly to the solution but the result won't be very good
         | compared to taking small steps.
        
       | Waterluvian wrote:
       | Closer. But I still get lost when words like "tensor" are used.
       | "structured lists of numbers" really doesn't seem to explain it
       | usefully.
       | 
       | This reminds me that explaining seemingly complex things in
       | simple terms is one of the most valuable and rarest skills in
       | engineering. Most people just can't. And often because they no-
       | longer remember what's not general knowledge. You end up with a
       | recursive Feynmannian "now explain what that means" situation.
       | 
       | This is probably why I admire a whole bunch of engineering
       | YouTubers and other engineering "PR" people for their brilliance
       | at making complex stuff seem very very simple.
        
         | 6gvONxR4sf7o wrote:
         | You're talking like using jargon makes something a bad
         | explanation, but maybe you just aren't the audience? Why not
         | use words like that if it's a super basic concept to your
         | intended audience?
        
           | Waterluvian wrote:
           | I saw the scientific term, "Text Understander" and wrongly
           | thought I was the audience.
        
         | Waterluvian wrote:
         | I should have added that the images/figures really help. I
         | think I'm about there.
        
         | netruk44 wrote:
         | If it helps you to understand at all, assuming you have a CS
         | background, any time you see the word "tensor" you can replace
         | it with "array" and you'll be 95% of the way to understanding
         | it. Or "matrix" if you have a mathematical background.
         | 
         | Whereas CS arrays tend to be 1 dimensional, and sometimes 2
         | dimensional, tensors can be as many dimensions as you need. A
         | 256x256 photo with RGB channels would be stored as a [256 x 256
         | x 3] tensor/array. If you want to store a bunch of them? Add a
         | dimension to store each image. Want rows and columns of images?
         | Make the dimensions [width x height x channels x rows x
         | columns].
        
           | Waterluvian wrote:
           | This helps. Thank you. Any advice on where to look to
           | understand why the word tensor was used?
        
             | pvarangot wrote:
             | A Tensor is a mathematical object for symbolic manipulation
             | of relationships between other objects that belong in
             | conceptually similar universes or spaces. Literature on
             | Deep Learning, like Goodfellow, call for the CS-minded
             | reader to just assume it's a fancy word for a matrix of
             | more than two dimensions. That makes matters confusing
             | because mathematically you could have scalar or vectorial
             | tensors. The classic mathematical definition puts more
             | restrictions on the "shape" of the "matrix" by requiring
             | certain properties so that it can create a symbolic
             | language for a tensor calculus. Then you can study how the
             | relationships change as variables on the universes or
             | spaces change.
             | 
             | Understanding the inertia tensor on classical mechanics or
             | the stress tensor may illustrate where tensors come from,
             | and I understand that GR also makes use of a lot of tensor
             | calculus that came to be as mathematics developed
             | manipulating and talking about tensors. I have a kinda firm
             | grasp on some very rudimentary tensor calculus from trying
             | to leanr GR, and a pretty solid grasp on classical
             | mechanics. I've had hour long conversations with deep
             | learning popes and thought leaders in varying states of
             | mind and after that my understanding is that they use the
             | word tensor in an overreaching fashion as like you could
             | call a solar panel a nuclear fission reactor power source.
             | This thought leaders include people with books and 1M+ view
             | Youtube videos on the subject that use the word tensor and
             | I'm not saying their names because they off-the-record
             | admitted that it's a poor choice of term but it harms the
             | ego of many Google engineers to publicly admit that.
        
               | amelius wrote:
               | How are sum-types handled in deep learning?
               | 
               | E.g., a type that holds "a 3x2 tensor OR a 4x6x2x5
               | tensor".
        
             | sva_ wrote:
             | It should be noted that it doesn't have all to much to do
             | with the rigorous mathematical definition of a Tensor.
        
             | jamessb wrote:
             | Very loosely, a number/vector/matrix/tensor can be
             | considered to be objects where specifying the values of
             | 0/1/2/3 indexes will give a number.
             | 
             | (A mathematician might object to this on several grounds,
             | such as that vectors/matrices/tensors are geometric objects
             | which need not be expressed numerically as coordinates in
             | any coordinate system)
        
               | amelius wrote:
               | A tensor can take any number of dimensions (not just 3).
        
             | jayalammar wrote:
             | I updated the post to say "multi-dimensional array".
             | 
             | In a context like this, we use tensor because it allows for
             | any number of dimensions (while vector/ array is only one,
             | matrix is two). When you get into ML libraries, both
             | popular packages PyTorch and TensorFlow use the "tensor"
             | terminology.
             | 
             | It's a good point. Hope it's clearer for devs with "array"
             | terminology.
        
               | zestyping wrote:
               | > we use tensor because it allows for any number of
               | dimensions
               | 
               | "Vector" implies one dimension and "matrix" strongly
               | implies two. But an array can have any number of
               | dimensions, so "array" is the best word.
               | 
               | We don't need the word "tensor"; when the context is
               | programming, "tensor" is only confusing and doesn't
               | really add any useful meaning.
        
             | [deleted]
        
             | avereveard wrote:
             | To reduce it a little, matrix holds numbers, tensor holds
             | whatever. Numbers. Vectors. Operations.
        
               | yreg wrote:
               | So it is math lingo for array.
        
           | minimaxir wrote:
           | A more practical example of the added dimensionality of
           | tensors is the addition of a batch dimension, so a 8 image
           | batch per training step would be a (8, 256, 256, 3) tensor.
           | 
           | Tools such as PyTorch's DataLoader can efficiently collate
           | multiple inputs into a batch.
        
         | TigeriusKirk wrote:
         | One of my favorite moments in Geoffrey Hinton's otherwise
         | pretty info-dense Coursera neural network class was when he
         | said-
         | 
         | "To deal with a 14-dimensional space, visualize a 3-D space and
         | say 'fourteen' to yourself very loudly. Everyone does it."
        
         | [deleted]
        
       | renewiltord wrote:
       | Isn't it really cool? It's like the AI is asking itself what
       | shapes the clouds are making and whether the moon has a face,
       | over and over again.
        
       | coldcode wrote:
       | I find SD to be amazing technology, but it still (mostly) sucks
       | at producing "intelligent" images. It basically fancy math that
       | turns noise into images (from the opposite it trained on) but
       | still has no idea what it is producing. If you run it long enough
       | you eventually get lucky and find a gem. I like to try "George
       | Washington riding a Unicorn in Times Square"; I've so far never
       | gotten anything a first year art student can draw. I wonder how
       | long it will take before something more "AI" than "ML" will have
       | an understanding even close to what a simple human brain can
       | process.
       | 
       | In the meantime it's fun to play with it, plus I'd like to better
       | understand the noise training process.
        
         | jw1224 wrote:
         | > "George Washington riding a Unicorn in Times Square"
         | 
         | The "secret" to Stable Diffusion (and other CLIP-based models)
         | is being as descriptive as possible. This prompt, whilst easy
         | for humans to imagine, actually has a whole lot of ambiguity
         | baked in.
         | 
         | How high is the unicorn flying? Is the unicorn even flying, or
         | on the ground? How old is George Washington? What visual style
         | is the image in? Is the image from the perspective of a
         | pedestrian at ground level, or from up at skyscraper level?
         | 
         | The more ambiguous the prompt, the less cohesive the image.
         | 
         | To demonstrate, here's 4 renders from your original prompt:
         | https://imgur.com/a/Jo4qfOp
         | 
         | And here's 4 using the prompt "George Washington riding a
         | unicorn in Times Square, cinematic composition, concept art,
         | digital illustration, detailed": https://imgur.com/a/lB36JqC
         | 
         | Certainly not perfect, but for an additional 15 seconds of
         | effort, far better.
        
           | CrazyStat wrote:
           | I love how the unicorn horn got stuck on a Washington's head
           | in the bottom right instead of on the unicorn.
        
         | minimaxir wrote:
         | With SD, you _have_ to use modifier quality /positional/artist
         | keywords, as vanilla inputs give the model too much freedom.
        
         | dr_dshiv wrote:
         | > I like to try "George Washington riding a Unicorn in Times
         | Square"; I've so far never gotten anything a first year art
         | student can draw.
         | 
         | Why the hell would a first year art student draw that? Flunk
         | their ass. God damn dumb ass prompts I have to deal with.
         | 
         | --Stable Diffusion
        
           | Psychoshy_bc1q wrote:
           | you might try "Pony Diffusion" for that :-)
           | https://huggingface.co/AstraliteHeart/pony-diffusion
        
         | l33tman wrote:
         | The reason you can't get the images you want from it is not
         | because of the noise diffusion process (after all, this is
         | probably the closest similarity to how a human gets a flash of
         | creativity) but the lack of a large language model in SD - it
         | was deliberately scaled down so the result could fit in
         | consumer GPUs.
         | 
         | DALLE-2 uses a much larger language model and you can explain
         | more complicated concepts to it. Googles Imagen likewise (not
         | released though).
         | 
         | It's mostly a matter of scaling to get this better.
        
           | astrange wrote:
           | It's not just size but also model architecture. DALLE mini
           | (craiyon.com) has the opposite priority because of its
           | different architecture; you can enter a complex prompt and it
           | will follow it, but it's much slower and the image quality is
           | a lot worse. SD prefers to make aesthetic pictures over
           | listening to everything you tell it.
           | 
           | You can improve this in SD by raising cfg_scale at the cost
           | of some weird "oversharpening" artifacts. Or, you can make a
           | crappy image in DallE mini and use that as the img2img prompt
           | with SD to make it prettier.
           | 
           | The real sign it's lacking intelligence is, if you ask it a
           | question it won't draw the answer, it'll just draw the
           | question. Of course, they could fix that too, it's got a GPT
           | in it, they just don't let it recurse...
        
       | ilaksh wrote:
       | It says the final output before pixel space is 64x64x4? How can
       | that be enough information?
        
         | Imnimo wrote:
         | The way I think of it, we have a 512x512x3 target, so that's
         | 48x the information. I don't think it's unreasonable to say
         | that far less than 1/48th of the space of 512x512x3 outputs are
         | natural images (meaning an image that might actually exist,
         | rather than meaningless pixels). So if we think about that
         | 64x64x4 tensor as telling us what in the smaller space of
         | natural images we should draw, it seems like plenty of
         | information. Especially since we have the information stored
         | weights of the output network also.
        
           | zestyping wrote:
           | The amount of information in a 64x64x4 array would depend on
           | the precision of the numbers in it, right? For example, a
           | 512x512 image in 24-bit colour could be completely encoded in
           | a 64x64x4 array if each of the 64 x 64 x 4 = 16,384 values
           | had 384 bits of precision.
           | 
           | So, I wonder -- what's the minimum number of bits of
           | precision in the 64x64x4 array that would be sufficient for
           | this to work?
        
         | l33tman wrote:
         | The autoencoder that maps between that and the 512x512x3 RGB
         | space was trained together with the model, so it is specialized
         | in upscaling the 64x64x4 info to pixel space for this
         | particular purpose. It's "just" a factor of 48
         | (de)compression..
        
           | astrange wrote:
           | And remember we only care about 8 bits of the output
           | (actually less due to JPEG compression) but each latent value
           | is a 32-bit float.
        
       | ginger2016 wrote:
       | This is awesome ! Thank you for creating it. I have been wanting
       | to read about Stable Diffusion. I added this to my reading list
       | in Safari.
        
       | minimaxir wrote:
       | Hugging Face's diffusers library and explainer Colab notebook
       | (https://colab.research.google.com/github/huggingface/noteboo...)
       | are good resources on how diffusion works in practice codewise.
        
         | jayalammar wrote:
         | Agreed. "Stable Diffusion with Diffusers" and "The Annotated
         | Diffusion Model" were excellent and are linked in the article.
         | The code in Diffusers was also a good reference.
        
       | yieldcrv wrote:
       | What are you guys currently using for Stable Diffusion on OSX
       | with M1?
       | 
       | There are so many variants and forks that I don't know which one
       | to install any more. Something that takes advantage of Metal and
       | the CPU cores.
       | 
       | Any that retains the "upload a sketch and then add a description"
       | feature?
        
         | pram wrote:
         | I'm using InvokeAI. Follow the instructions and it will work
         | flawlessly.
         | 
         | https://github.com/invoke-ai/InvokeAI
        
         | subdane wrote:
         | https://github.com/divamgupta/diffusionbee-stable-diffusion-...
        
       | jerpint wrote:
       | Everytime I need a refresher on transformers, I read the same
       | author's post on transformers. Looking forward to this one!
        
       | jmartrican wrote:
       | So its like the How to Draw an Owl meme.
        
         | minimaxir wrote:
         | SD has made this meme into a reality, given how easy it is to
         | take a sketch and use img2img to get something workable out of
         | it.
        
         | thunderbird120 wrote:
         | You can do that, yes https://0x0.st/oJVK.webm
        
       | swyx wrote:
       | i've been collecting other explanations of how SD works here:
       | https://github.com/sw-yx/prompt-eng#sd-model-values
        
       ___________________________________________________________________
       (page generated 2022-10-04 23:00 UTC)