[HN Gopher] The Illustrated Stable Diffusion ___________________________________________________________________ The Illustrated Stable Diffusion Author : mariuz Score : 223 points Date : 2022-10-04 17:59 UTC (5 hours ago) (HTM) web link (jalammar.github.io) (TXT) w3m dump (jalammar.github.io) | uptown wrote: | "We then compare the resulting embeddings using cosine | similarity. When we begin the training process, the similarity | will be low, even if the text describes the image correctly." | | How is this training performed? How is accuracy rated? | marshray wrote: | Nope, still don't understand it. :-/ | torbTurret wrote: | Love the visual explainers for machine learning nowadays. | | The author has more here: https://jalammar.github.io/ | | Amazon has some highly interactive ones here: https://mlu- | explain.github.io/ | | Google had: distill.pub | | Hope to see education in this space grow more. | vanjajaja1 wrote: | This is the perfect level of description, thank you. Looking | forward to checking out more of your work. | minism wrote: | Great overview, I think the part for me which is still very | unintuitive is the denoising process. | | If the diffusion process is removing noise by predicting a final | image and comparing it to the current one, why can't we just jump | to the final predicted image? Or is the point that because its an | iterative process, each noise step results in a different "final | image" prediction? | mota7 wrote: | The problem is that predicting a pixel requires knowing what | the pixels around it looks like. But if we start with lots of | noise, then the neighboring pixels are all just noise and have | no signal. | | You could also think of this as: We start with a terrible | signal to noise ratio. So we need to average over very large | areas to get any reasonable signal. But as we increase the | signal, we can average over a smaller area to get the same | signal-to-ratio. | | In the beginning, we're averaging over large areas, so all the | fine detail is lost. We just get 'might be a dog? maybe??'. | What the network is doing is saying "if this a dog, there | should be a head somewhere over here. So let me make it more | like a head". Which improves the signal to noise ratio a bit. | | After a few more steps, the signal is strong enough that we can | get sufficient signal from smaller areas, so it starts saying | 'head of a dog' in places. So the network will then start doing | "Well, if this is a dog's head, there should be some eyes. | Maybe two, but probably not three. And they'll be kinda | somewhere around here". | | Why do it this way? | | Doing it this ways means the network doesn't need to learn | "Here are all the ways dogs can look". Instead, it can learn a | factored representation: A dog has a head and a body. The | network only needs to learn a very fuzzy representation at this | level. Then a head has some eyes and maybe a nose. Again, it | only needs to learn a very fuzzy representation and (very) | rough relative locations. | | So it only when it get right down into fine detail that it | actually needs to learn pixel perfect representation. But this | is _way_ easier, because in small areas images have | surprisingly very low entropy. | | The 'text-to-image' bit is a just a twist on the basic idea. At | the start when the network is going "dog? or it might be a | horse?", we fiddle with the probabilities a bit so that the | network starts out convinced there's a dog in there somewhere. | At which point it starts making the most likely places look a | little more like a dog. | astrange wrote: | Research is still ongoing here, but it seems like diffusion | models despite being named after the noise addition/removal | process don't actually work because of it. | | There's a paper (which I can't remember the name of) that shows | the process still works with different information removal | operators, including one with a circle wipe, and one where it | blends the original picture with a cat photo. | | Also, this article describes CLIP being trained on text-image | pairs, but Google's Imagen uses an off the shelf text model so | that part doesn't seem to be needed either. | krackers wrote: | I think it might be this paper [1] succintly described by the | author in this twitter thread [2] | | [1] https://arxiv.org/abs/2208.09392 [2] https://twitter.com/ | tomgoldsteincs/status/156250381442263040... | jayalammar wrote: | Two diffusion processes are involved: | | 1- Forward Diffusion (adding noise, and training the Unet to | predict how much noise is added in each step) | | 2- Generating the image by denoising. This doesn't predict the | final image, each step only predicts a small slice of noise | (the removal of which leads to images similar to what the model | encountered in step 1). | | So it is indeed an iterative processes in that way, each step | taking one step towards the final image. | [deleted] | hanrelan wrote: | I was wondering the same and this video [1] helped me better | understand how the prediction is used. The original paper isn't | super clear about this either. | | The diffusion process predicts the total noise that was added | to the image. But that prediction isn't great and applying it | immediately wouldn't result in a good output. So instead, the | noise is multiplied by a small epsilon and then subtracted from | the noisy image. That process is iterated to get to the final | result. | | [1]https://www.youtube.com/watch?v=J87hffSMB60 | cgearhart wrote: | I'm pretty sure it's a stability issue. With small steps the | noise is correlated between steps; if you tried it in one big | jump then you would essentially just memorize the input data. | The maximum noise would act as a "key" and the model would | memorize the corresponding image as the "value". But if we do | it as a bunch of little steps then the nearby steps are | correlated and in the training set you'll find lots of groups | of noise that are similar which allows the model to generalize | instead of memorizing. | nullc wrote: | You can think of it like solving a differential equation | numerically. The diffusion model encodes the relationships | between values in sensible images (technically in the | compressed representations of sensible images). You can try to | jump directly to the solution but the result won't be very good | compared to taking small steps. | Waterluvian wrote: | Closer. But I still get lost when words like "tensor" are used. | "structured lists of numbers" really doesn't seem to explain it | usefully. | | This reminds me that explaining seemingly complex things in | simple terms is one of the most valuable and rarest skills in | engineering. Most people just can't. And often because they no- | longer remember what's not general knowledge. You end up with a | recursive Feynmannian "now explain what that means" situation. | | This is probably why I admire a whole bunch of engineering | YouTubers and other engineering "PR" people for their brilliance | at making complex stuff seem very very simple. | 6gvONxR4sf7o wrote: | You're talking like using jargon makes something a bad | explanation, but maybe you just aren't the audience? Why not | use words like that if it's a super basic concept to your | intended audience? | Waterluvian wrote: | I saw the scientific term, "Text Understander" and wrongly | thought I was the audience. | Waterluvian wrote: | I should have added that the images/figures really help. I | think I'm about there. | netruk44 wrote: | If it helps you to understand at all, assuming you have a CS | background, any time you see the word "tensor" you can replace | it with "array" and you'll be 95% of the way to understanding | it. Or "matrix" if you have a mathematical background. | | Whereas CS arrays tend to be 1 dimensional, and sometimes 2 | dimensional, tensors can be as many dimensions as you need. A | 256x256 photo with RGB channels would be stored as a [256 x 256 | x 3] tensor/array. If you want to store a bunch of them? Add a | dimension to store each image. Want rows and columns of images? | Make the dimensions [width x height x channels x rows x | columns]. | Waterluvian wrote: | This helps. Thank you. Any advice on where to look to | understand why the word tensor was used? | pvarangot wrote: | A Tensor is a mathematical object for symbolic manipulation | of relationships between other objects that belong in | conceptually similar universes or spaces. Literature on | Deep Learning, like Goodfellow, call for the CS-minded | reader to just assume it's a fancy word for a matrix of | more than two dimensions. That makes matters confusing | because mathematically you could have scalar or vectorial | tensors. The classic mathematical definition puts more | restrictions on the "shape" of the "matrix" by requiring | certain properties so that it can create a symbolic | language for a tensor calculus. Then you can study how the | relationships change as variables on the universes or | spaces change. | | Understanding the inertia tensor on classical mechanics or | the stress tensor may illustrate where tensors come from, | and I understand that GR also makes use of a lot of tensor | calculus that came to be as mathematics developed | manipulating and talking about tensors. I have a kinda firm | grasp on some very rudimentary tensor calculus from trying | to leanr GR, and a pretty solid grasp on classical | mechanics. I've had hour long conversations with deep | learning popes and thought leaders in varying states of | mind and after that my understanding is that they use the | word tensor in an overreaching fashion as like you could | call a solar panel a nuclear fission reactor power source. | This thought leaders include people with books and 1M+ view | Youtube videos on the subject that use the word tensor and | I'm not saying their names because they off-the-record | admitted that it's a poor choice of term but it harms the | ego of many Google engineers to publicly admit that. | amelius wrote: | How are sum-types handled in deep learning? | | E.g., a type that holds "a 3x2 tensor OR a 4x6x2x5 | tensor". | sva_ wrote: | It should be noted that it doesn't have all to much to do | with the rigorous mathematical definition of a Tensor. | jamessb wrote: | Very loosely, a number/vector/matrix/tensor can be | considered to be objects where specifying the values of | 0/1/2/3 indexes will give a number. | | (A mathematician might object to this on several grounds, | such as that vectors/matrices/tensors are geometric objects | which need not be expressed numerically as coordinates in | any coordinate system) | amelius wrote: | A tensor can take any number of dimensions (not just 3). | jayalammar wrote: | I updated the post to say "multi-dimensional array". | | In a context like this, we use tensor because it allows for | any number of dimensions (while vector/ array is only one, | matrix is two). When you get into ML libraries, both | popular packages PyTorch and TensorFlow use the "tensor" | terminology. | | It's a good point. Hope it's clearer for devs with "array" | terminology. | zestyping wrote: | > we use tensor because it allows for any number of | dimensions | | "Vector" implies one dimension and "matrix" strongly | implies two. But an array can have any number of | dimensions, so "array" is the best word. | | We don't need the word "tensor"; when the context is | programming, "tensor" is only confusing and doesn't | really add any useful meaning. | [deleted] | avereveard wrote: | To reduce it a little, matrix holds numbers, tensor holds | whatever. Numbers. Vectors. Operations. | yreg wrote: | So it is math lingo for array. | minimaxir wrote: | A more practical example of the added dimensionality of | tensors is the addition of a batch dimension, so a 8 image | batch per training step would be a (8, 256, 256, 3) tensor. | | Tools such as PyTorch's DataLoader can efficiently collate | multiple inputs into a batch. | TigeriusKirk wrote: | One of my favorite moments in Geoffrey Hinton's otherwise | pretty info-dense Coursera neural network class was when he | said- | | "To deal with a 14-dimensional space, visualize a 3-D space and | say 'fourteen' to yourself very loudly. Everyone does it." | [deleted] | renewiltord wrote: | Isn't it really cool? It's like the AI is asking itself what | shapes the clouds are making and whether the moon has a face, | over and over again. | coldcode wrote: | I find SD to be amazing technology, but it still (mostly) sucks | at producing "intelligent" images. It basically fancy math that | turns noise into images (from the opposite it trained on) but | still has no idea what it is producing. If you run it long enough | you eventually get lucky and find a gem. I like to try "George | Washington riding a Unicorn in Times Square"; I've so far never | gotten anything a first year art student can draw. I wonder how | long it will take before something more "AI" than "ML" will have | an understanding even close to what a simple human brain can | process. | | In the meantime it's fun to play with it, plus I'd like to better | understand the noise training process. | jw1224 wrote: | > "George Washington riding a Unicorn in Times Square" | | The "secret" to Stable Diffusion (and other CLIP-based models) | is being as descriptive as possible. This prompt, whilst easy | for humans to imagine, actually has a whole lot of ambiguity | baked in. | | How high is the unicorn flying? Is the unicorn even flying, or | on the ground? How old is George Washington? What visual style | is the image in? Is the image from the perspective of a | pedestrian at ground level, or from up at skyscraper level? | | The more ambiguous the prompt, the less cohesive the image. | | To demonstrate, here's 4 renders from your original prompt: | https://imgur.com/a/Jo4qfOp | | And here's 4 using the prompt "George Washington riding a | unicorn in Times Square, cinematic composition, concept art, | digital illustration, detailed": https://imgur.com/a/lB36JqC | | Certainly not perfect, but for an additional 15 seconds of | effort, far better. | CrazyStat wrote: | I love how the unicorn horn got stuck on a Washington's head | in the bottom right instead of on the unicorn. | minimaxir wrote: | With SD, you _have_ to use modifier quality /positional/artist | keywords, as vanilla inputs give the model too much freedom. | dr_dshiv wrote: | > I like to try "George Washington riding a Unicorn in Times | Square"; I've so far never gotten anything a first year art | student can draw. | | Why the hell would a first year art student draw that? Flunk | their ass. God damn dumb ass prompts I have to deal with. | | --Stable Diffusion | Psychoshy_bc1q wrote: | you might try "Pony Diffusion" for that :-) | https://huggingface.co/AstraliteHeart/pony-diffusion | l33tman wrote: | The reason you can't get the images you want from it is not | because of the noise diffusion process (after all, this is | probably the closest similarity to how a human gets a flash of | creativity) but the lack of a large language model in SD - it | was deliberately scaled down so the result could fit in | consumer GPUs. | | DALLE-2 uses a much larger language model and you can explain | more complicated concepts to it. Googles Imagen likewise (not | released though). | | It's mostly a matter of scaling to get this better. | astrange wrote: | It's not just size but also model architecture. DALLE mini | (craiyon.com) has the opposite priority because of its | different architecture; you can enter a complex prompt and it | will follow it, but it's much slower and the image quality is | a lot worse. SD prefers to make aesthetic pictures over | listening to everything you tell it. | | You can improve this in SD by raising cfg_scale at the cost | of some weird "oversharpening" artifacts. Or, you can make a | crappy image in DallE mini and use that as the img2img prompt | with SD to make it prettier. | | The real sign it's lacking intelligence is, if you ask it a | question it won't draw the answer, it'll just draw the | question. Of course, they could fix that too, it's got a GPT | in it, they just don't let it recurse... | ilaksh wrote: | It says the final output before pixel space is 64x64x4? How can | that be enough information? | Imnimo wrote: | The way I think of it, we have a 512x512x3 target, so that's | 48x the information. I don't think it's unreasonable to say | that far less than 1/48th of the space of 512x512x3 outputs are | natural images (meaning an image that might actually exist, | rather than meaningless pixels). So if we think about that | 64x64x4 tensor as telling us what in the smaller space of | natural images we should draw, it seems like plenty of | information. Especially since we have the information stored | weights of the output network also. | zestyping wrote: | The amount of information in a 64x64x4 array would depend on | the precision of the numbers in it, right? For example, a | 512x512 image in 24-bit colour could be completely encoded in | a 64x64x4 array if each of the 64 x 64 x 4 = 16,384 values | had 384 bits of precision. | | So, I wonder -- what's the minimum number of bits of | precision in the 64x64x4 array that would be sufficient for | this to work? | l33tman wrote: | The autoencoder that maps between that and the 512x512x3 RGB | space was trained together with the model, so it is specialized | in upscaling the 64x64x4 info to pixel space for this | particular purpose. It's "just" a factor of 48 | (de)compression.. | astrange wrote: | And remember we only care about 8 bits of the output | (actually less due to JPEG compression) but each latent value | is a 32-bit float. | ginger2016 wrote: | This is awesome ! Thank you for creating it. I have been wanting | to read about Stable Diffusion. I added this to my reading list | in Safari. | minimaxir wrote: | Hugging Face's diffusers library and explainer Colab notebook | (https://colab.research.google.com/github/huggingface/noteboo...) | are good resources on how diffusion works in practice codewise. | jayalammar wrote: | Agreed. "Stable Diffusion with Diffusers" and "The Annotated | Diffusion Model" were excellent and are linked in the article. | The code in Diffusers was also a good reference. | yieldcrv wrote: | What are you guys currently using for Stable Diffusion on OSX | with M1? | | There are so many variants and forks that I don't know which one | to install any more. Something that takes advantage of Metal and | the CPU cores. | | Any that retains the "upload a sketch and then add a description" | feature? | pram wrote: | I'm using InvokeAI. Follow the instructions and it will work | flawlessly. | | https://github.com/invoke-ai/InvokeAI | subdane wrote: | https://github.com/divamgupta/diffusionbee-stable-diffusion-... | jerpint wrote: | Everytime I need a refresher on transformers, I read the same | author's post on transformers. Looking forward to this one! | jmartrican wrote: | So its like the How to Draw an Owl meme. | minimaxir wrote: | SD has made this meme into a reality, given how easy it is to | take a sketch and use img2img to get something workable out of | it. | thunderbird120 wrote: | You can do that, yes https://0x0.st/oJVK.webm | swyx wrote: | i've been collecting other explanations of how SD works here: | https://github.com/sw-yx/prompt-eng#sd-model-values ___________________________________________________________________ (page generated 2022-10-04 23:00 UTC)