[HN Gopher] DeepFloyd IF: open-source text-to-image model ___________________________________________________________________ DeepFloyd IF: open-source text-to-image model Author : ea016 Score : 120 points Date : 2023-04-26 18:15 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | 55555 wrote: | So this one can create perfect text in images? If true, that's | insane | GaggiX wrote: | LDM-400M was already able to generate text (predecessor of | Stable Diffusion), thanks to the fact that every token in the | text encoder (trained from scratch) was available in the | attention layer. | flangola7 wrote: | >thanks to the fact that every token in the text encoder | (trained from scratch) was available in the attention layer. | | >ChatGPT explain this like I'm 5 | coolspot wrote: | "Every word in the text can be used to help create the | image." | GaggiX wrote: | interesting there are different models: https://github.com/deep- | floyd/IF#-model-zoo- | | I'm also very happy for the release of the two upscaler, I can | use them to upscale to result of my small 64x64 DDIM models | (maybe with some finetuning). | jacob019 wrote: | Any web based front ends yet? I put together a system that runs a | variety of web based open source AI image generation and editing | tools on Vultr GPU instances. It spins up instances on demand, | mounts an NFS filesystem with local caching and a COW layer, | spawns the services, proxies the requests, and then spins down | idle instances when I'm done. Would love to add this, suppose I | could whip something up if none exists. | ronsor wrote: | It'll probably be in the Auto1111 WebUI within a week. | Taek wrote: | For anyone who doesn't know, DeepFloyd is a StableDiffusion style | image model that more or less replaced CLIP with a full LLM (11b | params). The result is that it is much better at responding to | more complex prompts. | | In theory, it is also smarter at learning from its training data. | GaggiX wrote: | >StableDiffusion style | | Not really, it's a cascaded diffusion model conditioned on the | T5 encoder, there is nothing really in common, unless you mean | that using a diffusion model is "SD style". | tmabraham wrote: | It isn't like Stable Diffusion, it's more like Google's Imagen | model. | epivosism wrote: | Example of how much better it can do compared to midjourney, on a | complex prompt: | https://twitter.com/eb_french/status/1623823175170805760 | | It is able to put people on the left/right and put the correct | t-shirts and facial expressions on each one. This is compared to | mj which just mixes together a soup of every word you use and | plops it out into the image. Huge MJ fan of course, it's amazing, | but having compositional power is another step up. | epivosism wrote: | Here are some play markets on manifold markets tracking its | release: | https://manifold.markets/markets?s=relevance&f=all&q=deepflo... | | 35% to full release by end of month, although it may not have | adjusted. | TheBlapse wrote: | "Imagen free" | TheBlapse wrote: | Currently down on hugging face | zimpenfish wrote: | 16GB VRAM minimum is a bit steep. Sadly excludes my 3080 which is | annoying because I'd like something better than Stable Diffusion | locally. | specproc wrote: | There's a note which suggests you might be able to get by on | lower. My 3060 struggles with SD on the defaults, but works | fine with float16. | | _There are multiple ways to speed up the inference time and | lower the memory consumption even more with diffusers. To do | so, please have a look at the Diffusers docs: | Optimizing for inference time [1] Optimizing for | low memory during inference [2] | | [1] | https://huggingface.co/docs/diffusers/api/pipelines/if#optim... | | [2] https://huggingface.co/docs/diffusers/api/pipelines/if#opti | m..._ | thewataccount wrote: | Once these are quantized (I assume they can be), they should be | ~1/4th the size. | | Can anyone explain why it needs so much ram in the first place | though? 4.3B is only ~9GB at 16bit (I'm not as familiar with | image models). | | I'm really happy to see that fits under 24GB - that's what I | consider the limit for being able to run on "consumer | hardware". | GaggiX wrote: | >Can anyone explain why it needs so much ram in the first | place though? | | The T5-XXL text encoder is really large, also we do not | quantize the UNets, the UNet outputs 8-bit pixels, so | quantizing the UNet to that precision will create pretty bad | outputs. | SekstiNi wrote: | They took down the blogpost, but from what I remember the | model is composite and consists of a text encoder as well as | 3 "stages": | | 1. (11B) T5-XXL text encoder [1] | | 2. (4.3B) Stage 1 UNet | | 3. (1.3B) Stage 2 upscaler (64x64 -> 256x256) | | 4. (?B) Stage 3 upscaler (256x256 -> 1024x1024) | | Resolution numbers could be off though. Also the third stage | can apparently use the existing stable diffusion x4, or a new | upscaler that they aren't releasing yet (ever?). | | > Once these are quantized (I assume they can be) | | Based on the success of LLaMA 4bit quantization, I believe | the text encoder could be. As for the other modules, I'm not | sure. | | edit: the text encoder is 11B, not 4.5B as I initially wrote. | | [1]: https://huggingface.co/google/t5-v1_1-xxl | gwern wrote: | You'll be able to optimize it a lot to make it fit on small | systems if you are willing to modify your workflow a bit: | instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> | _n_ images 1 time -> _m_ times... For a given prompt, run | it through the T5 model and store; you can do that in CPU | RAM if you have to because you only need the embedding once | so you don't need a GPU which can run T5-XXL naively. Then | you can get a large batch of samples from #2; 64px is | enough to preview; only once you pick some do you run | through #3, and then from those through #4. Your peak VRAM | should be 1 image in #2 or #4 and that can be quantized or | pruned down to something that will fit on many GPUs. | TaylorAlexander wrote: | If you don't mind the power consumption I noticed that older | nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000 | is pretty handy for this stuff. | coolspot wrote: | Looks like P6000 24Gb goes for $800-$1200 while you can get | superior 3090 24Gb for $800-$1000 . | CamperBob2 wrote: | 4090s are only $1600 or so now, for that matter. | TaylorAlexander wrote: | oh! My mistake thanks for letting me know. | NBJack wrote: | An M40 24GB is less than $200, if you don't mind the trouble | to get it's drivers installed, cooled, etc. It's also | important to note your motherboard must support larger VRAM | addressing; many older chipsets won't be able to boot with it | (i.e. some, perhaps almost all, Zen 1 supporters). | connerruhl wrote: | The full release will be soon! | | https://twitter.com/EMostaque/status/1651328161148174337 | atleastoptimal wrote: | > Text | | > Hands | | good god it solves the two biggest meme issues with image models | in one go. Will this be the new state of the art every other | model is compared to? | Taek wrote: | There are good reasons to believe that this will be the new | state of the art by a comfortable margin. Hard to know until we | can actually play with it. | gwern wrote: | We already knew those were going to be solved by scale like | using T5 instead of the really small bad text encoder SD used, | because they _were_ solved by Imagen etc. | orra wrote: | Neither the source code nor the weights are open source... This | is actually worse than Stability AI's previous offering, in that | regard. | connerruhl wrote: | That'll change when the full non-research release occurs... | https://twitter.com/EMostaque/status/1651328161148174337 | orra wrote: | That tweet is vague. Besides, it says like 'like SD', so I | will be pleasantly shocked if the models are open source. | ilaksh wrote: | They are technically open source. It's just that the model | license prohibits commercial use and the code license prohibits | bypassing the filters. So it's kind of worse than closed source | in a way because it's like a tease. With no API apparently. | | Theoretically large companies or rich people might be able to | make a licensing agreement. | rgbrgb wrote: | > model license prohibits commercial use | | I thought that at first, but I think it only prohibits | commercial use that breaks regional copyright or privacy | laws. | orra wrote: | It prohibits both commercial use, whether or not you break | regional laws; and it prohibits breaking certain laws. As | another user said, encoding the law into a licence is | pointless but makes it non-free. | | There are also problematic restrictions on your ability to | modify the software under clause 2(c). And nor do you have | the right to sublicence, it's not clear to me what rights | somebody has if you give them a copy. | yellowapple wrote: | That's already prohibited by, you know, those very same | copyright and privacy laws. Adding those same prohibitions | to the license not only makes the software nonfree, but | _pointlessly_ does so. | dragonwriter wrote: | Its not pointless, it means the model _licensor_ has a | claim against you, as well as whoever would for violating | the referenced laws; it also means, and this is probably | more important, that in some juridictions, the model | licensor has a better defense against liability for | contributory infringement if the licensee infringes. | | EDIT: That said, it's unambiguously not open source. | jrm4 wrote: | I am a lawyer, and as flimsy and wishy-washy as the term | "open-source" already is, I can't even fathom what is meant | by "open source" here? | | Are people suggesting that "look at the code but don't touch" | actually fits what some people think of as open source? | yellowapple wrote: | > They are technically open source. It's just that the model | license prohibits commercial use and the code license | prohibits bypassing the filters. | | Your second sentence contradicts the first. Prohibiting | commercial use and prohibiting modification are each in and | of themselves mutually exclusive being being "technically | open source" (let alone both at the same time). | kingcharles wrote: | The examples on the README are extremely compelling; the state of | the art has been raised yet again. | lalaithion wrote: | Has anyone tried the Scott Alexander AI bet prompts? | | 1. A stained glass picture of a woman in a library with a raven | on her shoulder with a key in its mouth | | 2. An oil painting of a man in a factory looking at a cat wearing | a top hat | | 3. A digital art picture of a child riding a llama with a bell on | its tail through a desert | | 4. A 3D render of an astronaut in space holding a fox wearing | lipstick | | 5. Pixel art of a farmer in a cathedral holding a red basketball | swyx wrote: | where are these prompts from? | epivosism wrote: | Yes, I tried them here on an earlier version of IF: | https://twitter.com/eb_french/status/1618354180577714176 | epivosism wrote: | I thought it was pretty definitive at the time, but when you | look really closely (as Scott's opponent is likely to do), it | didn't seem like a clear win yet. But that was 3 months ago, | and hopefully DF is even better now. | hunkins wrote: | New restriction in their License suggests the software can't be | modified. | | "2. All persons obtaining a copy or substantial portion of the | Software, a modified version of the Software (or substantial | portion thereof), or a derivative work based upon this Software | (or substantial portion thereof) must not delete, remove, | disable, diminish, or circumvent any inference filters or | inference filter mechanisms in the Software, or any portion of | the Software that implements any such filters or filter | mechanisms." | thewataccount wrote: | > New restriction in their License suggests the software can't | be modified. | | It can be modified. That just says it can't be modified to | bypass their filters. | [deleted] | GaggiX wrote: | >New restriction in their License suggests the software can't | be modified. | | To remove filters. | | "Permission is hereby granted, free of charge, to any person | obtaining a copy of this software and associated documentation | files (the "Software"), to deal in the Software without | restriction, including without limitation the rights to use, | copy, modify, merge, publish, distribute, sublicense, and/or | sell copies of the Software, and to permit persons to whom the | Software is furnished to do so, subject to the following | conditions:" | oh_sigh wrote: | You can't remove the filters per the license, but the weights | will be available soon and so anyone can just reimplement this | code using the weights | Jackson__ wrote: | There is a similar license clause for the weights[0] as well, | so I'm not sure this would apply unless you write the code | and train your model from scratch. | | [0] https://github.com/deep-floyd/IF/blob/main/LICENSE- | MODEL#L54 | dragonwriter wrote: | Or unless, as seems to be fairly widely expected but | untested, model weights are not actually copyrightable, so | model licenses are superfluous. | ronsor wrote: | That's already been done: | https://github.com/lucidrains/imagen-pytorch | RobotToaster wrote: | Then by definition it isn't open source, violating points 3, 4, | and 6 of the open source definition. | https://opensource.org/osd/ | yellowapple wrote: | Yep. It's getting really exhausting seeing projects falsely | advertising themselves as "open source". Either be FOSS or | don't be; don't pretend to be while using some nonsense like | the BSL or whatever adhocery is in play here. | Mizza wrote: | In the README they even call it "Modified MIT", the | modification being where they turned it from a very | permissive license into a fully proprietary one. Very cool | model though. | kmeisthax wrote: | As someone who's largely "OK" with morality clauses in | otherwise liberal AI licenses, I think we should start calling | these "weights-available" models to distinguish from capital-F | Free Software[1] ones. | | I'm starting to get irritated by all these 'non-commercial' | licensed models, though, because there is no such thing as a | non-commercial license. In copyright law, merely having the | work in question is considered a commercial benefit. So you | need to specify every single act you think is 'non-commercial', | and users of the license have to read and understand that. Even | Creative Commons' NC clause only specifies one; they say that | filesharing is not commercial. So it's just a fancy covenant | not to sue BitTorrent users. | | And then there's LLaMA, whose model weights were only ever | shared privately with other researchers. Everyone using LLaMA | publicly is likely pirating it. _Actual_ weights-available or | Free models already exist, such as BLOOM, Dolly, StableLM[0], | Pythia, GPT-J, GPT-NeoX, and CerebrasGPT. | | [0] Untuned only; the instruction-tuned models are | frustratingly CC-BY-NC-SA because apparently nobody made an | open dataset for instruction tuning. | | [1] Insamuch as an AI model trained on copyrighted data can | even be considered Free. | simonw wrote: | It looks like the model on Hugging Face either hasn't been | published yet or was withdrawn. I got this error in their Colab | notebook: | | OSError: DeepFloyd/IF-I-IF-v1.0 is not a local folder and is not | a valid model identifier listed on | 'https://huggingface.co/models' If this is a private repository, | make sure to pass a token having permission to this repo with | `use_auth_token` or log in with `huggingface-cli login` and pass | `use_auth_token=True`. | Zetobal wrote: | You need to accept the license on the HuggingFace model card. | lerchmo wrote: | it doesn't seem like they have anything published | https://huggingface.co/DeepFloyd | thewataccount wrote: | I swear I saw it a few minutes ago but I might be crazy. | Zetobal wrote: | Same got the weights on gdrive. | og_kalu wrote: | could you link them ? | simonw wrote: | https://huggingface.co/DeepFloyd/IF-I-IF-v1.0 is a 404 | currently. ___________________________________________________________________ (page generated 2023-04-26 23:01 UTC)