[HN Gopher] DeepFloyd IF: open-source text-to-image model
       ___________________________________________________________________
        
       DeepFloyd IF: open-source text-to-image model
        
       Author : ea016
       Score  : 120 points
       Date   : 2023-04-26 18:15 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | 55555 wrote:
       | So this one can create perfect text in images? If true, that's
       | insane
        
         | GaggiX wrote:
         | LDM-400M was already able to generate text (predecessor of
         | Stable Diffusion), thanks to the fact that every token in the
         | text encoder (trained from scratch) was available in the
         | attention layer.
        
           | flangola7 wrote:
           | >thanks to the fact that every token in the text encoder
           | (trained from scratch) was available in the attention layer.
           | 
           | >ChatGPT explain this like I'm 5
        
             | coolspot wrote:
             | "Every word in the text can be used to help create the
             | image."
        
       | GaggiX wrote:
       | interesting there are different models: https://github.com/deep-
       | floyd/IF#-model-zoo-
       | 
       | I'm also very happy for the release of the two upscaler, I can
       | use them to upscale to result of my small 64x64 DDIM models
       | (maybe with some finetuning).
        
       | jacob019 wrote:
       | Any web based front ends yet? I put together a system that runs a
       | variety of web based open source AI image generation and editing
       | tools on Vultr GPU instances. It spins up instances on demand,
       | mounts an NFS filesystem with local caching and a COW layer,
       | spawns the services, proxies the requests, and then spins down
       | idle instances when I'm done. Would love to add this, suppose I
       | could whip something up if none exists.
        
         | ronsor wrote:
         | It'll probably be in the Auto1111 WebUI within a week.
        
       | Taek wrote:
       | For anyone who doesn't know, DeepFloyd is a StableDiffusion style
       | image model that more or less replaced CLIP with a full LLM (11b
       | params). The result is that it is much better at responding to
       | more complex prompts.
       | 
       | In theory, it is also smarter at learning from its training data.
        
         | GaggiX wrote:
         | >StableDiffusion style
         | 
         | Not really, it's a cascaded diffusion model conditioned on the
         | T5 encoder, there is nothing really in common, unless you mean
         | that using a diffusion model is "SD style".
        
         | tmabraham wrote:
         | It isn't like Stable Diffusion, it's more like Google's Imagen
         | model.
        
       | epivosism wrote:
       | Example of how much better it can do compared to midjourney, on a
       | complex prompt:
       | https://twitter.com/eb_french/status/1623823175170805760
       | 
       | It is able to put people on the left/right and put the correct
       | t-shirts and facial expressions on each one. This is compared to
       | mj which just mixes together a soup of every word you use and
       | plops it out into the image. Huge MJ fan of course, it's amazing,
       | but having compositional power is another step up.
        
       | epivosism wrote:
       | Here are some play markets on manifold markets tracking its
       | release:
       | https://manifold.markets/markets?s=relevance&f=all&q=deepflo...
       | 
       | 35% to full release by end of month, although it may not have
       | adjusted.
        
       | TheBlapse wrote:
       | "Imagen free"
        
       | TheBlapse wrote:
       | Currently down on hugging face
        
       | zimpenfish wrote:
       | 16GB VRAM minimum is a bit steep. Sadly excludes my 3080 which is
       | annoying because I'd like something better than Stable Diffusion
       | locally.
        
         | specproc wrote:
         | There's a note which suggests you might be able to get by on
         | lower. My 3060 struggles with SD on the defaults, but works
         | fine with float16.
         | 
         |  _There are multiple ways to speed up the inference time and
         | lower the memory consumption even more with diffusers. To do
         | so, please have a look at the Diffusers docs:
         | Optimizing for inference time [1]             Optimizing for
         | low memory during inference [2]
         | 
         | [1]
         | https://huggingface.co/docs/diffusers/api/pipelines/if#optim...
         | 
         | [2] https://huggingface.co/docs/diffusers/api/pipelines/if#opti
         | m..._
        
         | thewataccount wrote:
         | Once these are quantized (I assume they can be), they should be
         | ~1/4th the size.
         | 
         | Can anyone explain why it needs so much ram in the first place
         | though? 4.3B is only ~9GB at 16bit (I'm not as familiar with
         | image models).
         | 
         | I'm really happy to see that fits under 24GB - that's what I
         | consider the limit for being able to run on "consumer
         | hardware".
        
           | GaggiX wrote:
           | >Can anyone explain why it needs so much ram in the first
           | place though?
           | 
           | The T5-XXL text encoder is really large, also we do not
           | quantize the UNets, the UNet outputs 8-bit pixels, so
           | quantizing the UNet to that precision will create pretty bad
           | outputs.
        
           | SekstiNi wrote:
           | They took down the blogpost, but from what I remember the
           | model is composite and consists of a text encoder as well as
           | 3 "stages":
           | 
           | 1. (11B) T5-XXL text encoder [1]
           | 
           | 2. (4.3B) Stage 1 UNet
           | 
           | 3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)
           | 
           | 4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)
           | 
           | Resolution numbers could be off though. Also the third stage
           | can apparently use the existing stable diffusion x4, or a new
           | upscaler that they aren't releasing yet (ever?).
           | 
           | > Once these are quantized (I assume they can be)
           | 
           | Based on the success of LLaMA 4bit quantization, I believe
           | the text encoder could be. As for the other modules, I'm not
           | sure.
           | 
           | edit: the text encoder is 11B, not 4.5B as I initially wrote.
           | 
           | [1]: https://huggingface.co/google/t5-v1_1-xxl
        
             | gwern wrote:
             | You'll be able to optimize it a lot to make it fit on small
             | systems if you are willing to modify your workflow a bit:
             | instead of 1 prompt -> 1 image _n_ times, do 1 prompt ->
             | _n_ images 1 time -> _m_ times... For a given prompt, run
             | it through the T5 model and store; you can do that in CPU
             | RAM if you have to because you only need the embedding once
             | so you don't need a GPU which can run T5-XXL naively. Then
             | you can get a large batch of samples from #2; 64px is
             | enough to preview; only once you pick some do you run
             | through #3, and then from those through #4. Your peak VRAM
             | should be 1 image in #2 or #4 and that can be quantized or
             | pruned down to something that will fit on many GPUs.
        
         | TaylorAlexander wrote:
         | If you don't mind the power consumption I noticed that older
         | nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000
         | is pretty handy for this stuff.
        
           | coolspot wrote:
           | Looks like P6000 24Gb goes for $800-$1200 while you can get
           | superior 3090 24Gb for $800-$1000 .
        
             | CamperBob2 wrote:
             | 4090s are only $1600 or so now, for that matter.
        
             | TaylorAlexander wrote:
             | oh! My mistake thanks for letting me know.
        
           | NBJack wrote:
           | An M40 24GB is less than $200, if you don't mind the trouble
           | to get it's drivers installed, cooled, etc. It's also
           | important to note your motherboard must support larger VRAM
           | addressing; many older chipsets won't be able to boot with it
           | (i.e. some, perhaps almost all, Zen 1 supporters).
        
       | connerruhl wrote:
       | The full release will be soon!
       | 
       | https://twitter.com/EMostaque/status/1651328161148174337
        
       | atleastoptimal wrote:
       | > Text
       | 
       | > Hands
       | 
       | good god it solves the two biggest meme issues with image models
       | in one go. Will this be the new state of the art every other
       | model is compared to?
        
         | Taek wrote:
         | There are good reasons to believe that this will be the new
         | state of the art by a comfortable margin. Hard to know until we
         | can actually play with it.
        
         | gwern wrote:
         | We already knew those were going to be solved by scale like
         | using T5 instead of the really small bad text encoder SD used,
         | because they _were_ solved by Imagen etc.
        
       | orra wrote:
       | Neither the source code nor the weights are open source... This
       | is actually worse than Stability AI's previous offering, in that
       | regard.
        
         | connerruhl wrote:
         | That'll change when the full non-research release occurs...
         | https://twitter.com/EMostaque/status/1651328161148174337
        
           | orra wrote:
           | That tweet is vague. Besides, it says like 'like SD', so I
           | will be pleasantly shocked if the models are open source.
        
         | ilaksh wrote:
         | They are technically open source. It's just that the model
         | license prohibits commercial use and the code license prohibits
         | bypassing the filters. So it's kind of worse than closed source
         | in a way because it's like a tease. With no API apparently.
         | 
         | Theoretically large companies or rich people might be able to
         | make a licensing agreement.
        
           | rgbrgb wrote:
           | > model license prohibits commercial use
           | 
           | I thought that at first, but I think it only prohibits
           | commercial use that breaks regional copyright or privacy
           | laws.
        
             | orra wrote:
             | It prohibits both commercial use, whether or not you break
             | regional laws; and it prohibits breaking certain laws. As
             | another user said, encoding the law into a licence is
             | pointless but makes it non-free.
             | 
             | There are also problematic restrictions on your ability to
             | modify the software under clause 2(c). And nor do you have
             | the right to sublicence, it's not clear to me what rights
             | somebody has if you give them a copy.
        
             | yellowapple wrote:
             | That's already prohibited by, you know, those very same
             | copyright and privacy laws. Adding those same prohibitions
             | to the license not only makes the software nonfree, but
             | _pointlessly_ does so.
        
               | dragonwriter wrote:
               | Its not pointless, it means the model _licensor_ has a
               | claim against you, as well as whoever would for violating
               | the referenced laws; it also means, and this is probably
               | more important, that in some juridictions, the model
               | licensor has a better defense against liability for
               | contributory infringement if the licensee infringes.
               | 
               | EDIT: That said, it's unambiguously not open source.
        
           | jrm4 wrote:
           | I am a lawyer, and as flimsy and wishy-washy as the term
           | "open-source" already is, I can't even fathom what is meant
           | by "open source" here?
           | 
           | Are people suggesting that "look at the code but don't touch"
           | actually fits what some people think of as open source?
        
           | yellowapple wrote:
           | > They are technically open source. It's just that the model
           | license prohibits commercial use and the code license
           | prohibits bypassing the filters.
           | 
           | Your second sentence contradicts the first. Prohibiting
           | commercial use and prohibiting modification are each in and
           | of themselves mutually exclusive being being "technically
           | open source" (let alone both at the same time).
        
       | kingcharles wrote:
       | The examples on the README are extremely compelling; the state of
       | the art has been raised yet again.
        
       | lalaithion wrote:
       | Has anyone tried the Scott Alexander AI bet prompts?
       | 
       | 1. A stained glass picture of a woman in a library with a raven
       | on her shoulder with a key in its mouth
       | 
       | 2. An oil painting of a man in a factory looking at a cat wearing
       | a top hat
       | 
       | 3. A digital art picture of a child riding a llama with a bell on
       | its tail through a desert
       | 
       | 4. A 3D render of an astronaut in space holding a fox wearing
       | lipstick
       | 
       | 5. Pixel art of a farmer in a cathedral holding a red basketball
        
         | swyx wrote:
         | where are these prompts from?
        
         | epivosism wrote:
         | Yes, I tried them here on an earlier version of IF:
         | https://twitter.com/eb_french/status/1618354180577714176
        
           | epivosism wrote:
           | I thought it was pretty definitive at the time, but when you
           | look really closely (as Scott's opponent is likely to do), it
           | didn't seem like a clear win yet. But that was 3 months ago,
           | and hopefully DF is even better now.
        
       | hunkins wrote:
       | New restriction in their License suggests the software can't be
       | modified.
       | 
       | "2. All persons obtaining a copy or substantial portion of the
       | Software, a modified version of the Software (or substantial
       | portion thereof), or a derivative work based upon this Software
       | (or substantial portion thereof) must not delete, remove,
       | disable, diminish, or circumvent any inference filters or
       | inference filter mechanisms in the Software, or any portion of
       | the Software that implements any such filters or filter
       | mechanisms."
        
         | thewataccount wrote:
         | > New restriction in their License suggests the software can't
         | be modified.
         | 
         | It can be modified. That just says it can't be modified to
         | bypass their filters.
        
           | [deleted]
        
         | GaggiX wrote:
         | >New restriction in their License suggests the software can't
         | be modified.
         | 
         | To remove filters.
         | 
         | "Permission is hereby granted, free of charge, to any person
         | obtaining a copy of this software and associated documentation
         | files (the "Software"), to deal in the Software without
         | restriction, including without limitation the rights to use,
         | copy, modify, merge, publish, distribute, sublicense, and/or
         | sell copies of the Software, and to permit persons to whom the
         | Software is furnished to do so, subject to the following
         | conditions:"
        
         | oh_sigh wrote:
         | You can't remove the filters per the license, but the weights
         | will be available soon and so anyone can just reimplement this
         | code using the weights
        
           | Jackson__ wrote:
           | There is a similar license clause for the weights[0] as well,
           | so I'm not sure this would apply unless you write the code
           | and train your model from scratch.
           | 
           | [0] https://github.com/deep-floyd/IF/blob/main/LICENSE-
           | MODEL#L54
        
             | dragonwriter wrote:
             | Or unless, as seems to be fairly widely expected but
             | untested, model weights are not actually copyrightable, so
             | model licenses are superfluous.
        
           | ronsor wrote:
           | That's already been done:
           | https://github.com/lucidrains/imagen-pytorch
        
         | RobotToaster wrote:
         | Then by definition it isn't open source, violating points 3, 4,
         | and 6 of the open source definition.
         | https://opensource.org/osd/
        
           | yellowapple wrote:
           | Yep. It's getting really exhausting seeing projects falsely
           | advertising themselves as "open source". Either be FOSS or
           | don't be; don't pretend to be while using some nonsense like
           | the BSL or whatever adhocery is in play here.
        
             | Mizza wrote:
             | In the README they even call it "Modified MIT", the
             | modification being where they turned it from a very
             | permissive license into a fully proprietary one. Very cool
             | model though.
        
         | kmeisthax wrote:
         | As someone who's largely "OK" with morality clauses in
         | otherwise liberal AI licenses, I think we should start calling
         | these "weights-available" models to distinguish from capital-F
         | Free Software[1] ones.
         | 
         | I'm starting to get irritated by all these 'non-commercial'
         | licensed models, though, because there is no such thing as a
         | non-commercial license. In copyright law, merely having the
         | work in question is considered a commercial benefit. So you
         | need to specify every single act you think is 'non-commercial',
         | and users of the license have to read and understand that. Even
         | Creative Commons' NC clause only specifies one; they say that
         | filesharing is not commercial. So it's just a fancy covenant
         | not to sue BitTorrent users.
         | 
         | And then there's LLaMA, whose model weights were only ever
         | shared privately with other researchers. Everyone using LLaMA
         | publicly is likely pirating it. _Actual_ weights-available or
         | Free models already exist, such as BLOOM, Dolly, StableLM[0],
         | Pythia, GPT-J, GPT-NeoX, and CerebrasGPT.
         | 
         | [0] Untuned only; the instruction-tuned models are
         | frustratingly CC-BY-NC-SA because apparently nobody made an
         | open dataset for instruction tuning.
         | 
         | [1] Insamuch as an AI model trained on copyrighted data can
         | even be considered Free.
        
       | simonw wrote:
       | It looks like the model on Hugging Face either hasn't been
       | published yet or was withdrawn. I got this error in their Colab
       | notebook:
       | 
       | OSError: DeepFloyd/IF-I-IF-v1.0 is not a local folder and is not
       | a valid model identifier listed on
       | 'https://huggingface.co/models' If this is a private repository,
       | make sure to pass a token having permission to this repo with
       | `use_auth_token` or log in with `huggingface-cli login` and pass
       | `use_auth_token=True`.
        
         | Zetobal wrote:
         | You need to accept the license on the HuggingFace model card.
        
           | lerchmo wrote:
           | it doesn't seem like they have anything published
           | https://huggingface.co/DeepFloyd
        
             | thewataccount wrote:
             | I swear I saw it a few minutes ago but I might be crazy.
        
               | Zetobal wrote:
               | Same got the weights on gdrive.
        
               | og_kalu wrote:
               | could you link them ?
        
           | simonw wrote:
           | https://huggingface.co/DeepFloyd/IF-I-IF-v1.0 is a 404
           | currently.
        
       ___________________________________________________________________
       (page generated 2023-04-26 23:01 UTC)