[HN Gopher] How DALL-E 2 Works
       ___________________________________________________________________
        
       How DALL-E 2 Works
        
       Author : SleekEagle
       Score  : 183 points
       Date   : 2022-04-19 15:19 UTC (7 hours ago)
        
 (HTM) web link (www.assemblyai.com)
 (TXT) w3m dump (www.assemblyai.com)
        
       | MengerSponge wrote:
       | How does DALL-E 2 handle stereotypes? For example, what kind of
       | output would you see for:
       | 
       | > A person being shot by a police officer
       | 
       | > A scientist emptying a dishwasher
       | 
       | > A nurse driving a minivan
       | 
       | AI training sets are famously biased, and I'm curious how
       | egregious the outputs are...
        
         | axg11 wrote:
         | The authors of the Dall-E 2 / unCLIP paper describe some of
         | their efforts to mitigate biases in the paper. ML models will
         | always exhibit the biases present in their training dataset,
         | without intervention. It's not really possible to remove bias
         | from an ML model, at least not completely. Some stereotypes,
         | but not all, are backed up by statistics. In those cases,
         | should we completely remove the bias in the training dataset?
         | Doing so would bias the model towards outputs that are not
         | representative of the real world.
         | 
         | When people say that they want to remove bias from ML models,
         | what they really mean is that they want to manipulate the
         | output distribution into something they deem acceptable. I'm
         | not arguing against this practise, there are plenty of
         | situations where the output of an ML model is very clearly
         | biased towards specific classes/samples. I'm merely arguing
         | that there is no such thing as an unbiased model, just as there
         | is no such thing as an unbiased human. Unbiased models would
         | produce no output.
         | 
         | To get around some of these problems OpenAI restricted the
         | training dataset (e.g. filtering sexual and violent content)
         | and also prevent generating images with recognizable faces.
         | This doesn't prevent bias but it does reduce the number of
         | controversial outputs.
        
         | blamazon wrote:
         | One way to dodge this and other issues related to depiction of
         | human bodies is to trim the dataset such that humans are not
         | generally recognizable as realistic humans in the output. It is
         | also currently explicitly forbidden by OpenAI to share publicly
         | realistic images of human faces generated by DE2.
         | 
         | Via LessWrong.com: [1]
         | 
         | > _" One place where DE2 clearly falls down is in generating
         | people. I generated an image for [four people playing poker in
         | a dark room, with the table brightly lit by an ornate
         | chandelier], and people didn't look human -- more like the
         | typical GAN-style images where you can see the concept but the
         | details are all wrong.
         | 
         | >Update: image removed because the guidelines specifically call
         | out not sharing realistic human faces.
         | 
         | >Anything involving people, small defined objects, and so on,
         | looks much more like the previous systems in this area. You can
         | tell that it has all the concepts, but can't translate them
         | into something realistic.
         | 
         | >This could be deliberate, for safety reasons -- realistic
         | images of people are much more open to abuse than other things.
         | Porn, deep fakes, violence, and so on are much more worrisome
         | with people. They also mentioned that they scrubbed out lots of
         | bad stuff from the training data; possibly one way they did
         | that was removing most images with people.
         | 
         | >Things look much better with animals, and better again with an
         | artistic style."_
         | 
         | [1]: https://www.lesswrong.com/posts/r99tazGiLgzqFX7ka/playing-
         | wi...
        
         | [deleted]
        
         | tiborsaas wrote:
         | I guess we will figure that out quite soon, but does it matter
         | that much? Your only job with DALL-E 2 is to prompt it properly
         | so if you want a female scientist, just say it so. If it comes
         | up with the "wrong" gender or ethnicity, then it takes a second
         | to fix it, which would probably take a bit less than ranting
         | about it on Twitter :)
        
         | snovv_crash wrote:
         | It will, being a deterministic machine, generate any kind of
         | wrongthink that is in its training data. Ironically, all of the
         | media coverage of negative stereotypes by well intentioned
         | activists probably even makes it more likely to generate this
         | kind of data.
        
         | radu_floricica wrote:
         | I can't think of a way that would "fix" this that wouldn't also
         | make it less useful overall. If people are looking for people
         | being shot by police officers, they probably already have those
         | stereotypes and thus expectations of the end product. You can
         | argue that you want to insert a certain morality set in the
         | process, but that to me sounds a hell of a lot scarier than the
         | scientist emptying the dishwasher being a women in 60% of the
         | pictures. Once you have the mechanism for morality bias, you
         | also have people with the capacity to change the settings.
        
         | SleekEagle wrote:
         | Great questions! I'd also be interested in this. I supposed the
         | generations would mimic the general distribution of information
         | that is on the internet, but what that would look like
         | specifically is hard to say without OpenAI releasing more
         | information.
        
       | achr2 wrote:
       | Over the next decade, ML advancements will erode the monetary
       | value of _countless_ professions. Hopefully AI research will be
       | turned towards solving the problems of society
       | /economics/civilization before it is too late to avoid major
       | disruptions in human wellbeing.
        
       | [deleted]
        
       | mmastrac wrote:
       | Are we finally past the AI winter? We seem to be seeing major
       | advances at least once a year. I recall there was a bit of a lull
       | after GPT3, but clearly the boundaries of AI are expanding
       | ridiculously fast.
        
         | mellosouls wrote:
         | In some ways (narrow AI), yes, its been a fantastic few years
         | including tools like the one in context.
         | 
         | In the important way that the AI winter originally referred to
         | though, no, there doesn't seem to have been any progress
         | towards AGI.
        
           | SleekEagle wrote:
           | Was the original AI winter with reference to AGI? I thought
           | it was in reference to the resulting lack of research and
           | interest after the "bubble bust". If we're not close to AGI
           | now I can't imagine researchers 40 years ago really thought
           | AGI was around the corner, right? Just curious, I'm not an
           | expert on the history of ML!
        
             | mellosouls wrote:
             | I think there have been several really, and they tend to
             | follow hype periods, which over-promise.
             | 
             | I do think the last few years have been more productive
             | than previous periods in advancing narrow AI, and to be
             | fair to those researchers who just get on with the work, it
             | is not on them if the advances are over-sold by others.
        
             | visarga wrote:
             | > If we're not close to AGI now
             | 
             | I bet we're closer than most people think. Instruct GPT-3
             | can do semantic tasks just as efficiently as DALL-E 2 can
             | draw. NLP tasks that took whole teams multiple years can be
             | simply described in a few words and they work right away.
             | 
             | The entry barrier to implement new tasks will get very low.
             | The large models will be the new operating system. This
             | means more investments and data, leading to new
             | improvements.
             | 
             | I believe GPT-3 is already close to median human level on
             | most semantic tasks that fit in a 4000 token window. I'm
             | researching how to use it right now for a variety of tasks,
             | it just works from plain text requirements with no
             | training.
        
         | redredrobot wrote:
         | There has not been an AI winter in at least a decade, arguably
         | more.
        
           | Tossrock wrote:
           | Indeed, and I called it 7 years ago:
           | https://news.ycombinator.com/item?id=9882217
        
           | Polygator wrote:
           | I guess he's referring to the fact that the glut of
           | investment around 2017-2018 was followed by disappointment
           | due to startups overpromising. I agree that from the
           | technical side (I mostly follow NLP, might be different in
           | other subfields) there's been no hint of a winter.
        
         | jollybean wrote:
         | At least the big public display of this tech seems to me that
         | it's mostly merging photos in interesting ways. That the 'seed'
         | comes from a word is not hugely interesting to me.
         | 
         | I'm actually more curious if we could parse the underlying
         | logic that ultimately it emulates to merge those images
         | together.
         | 
         | It 'looks like' something kind of sophisticated is being
         | modelled with AI but there's some nice algorithms hidden in
         | there.
        
           | ma2rten wrote:
           | It doesn't merge images. It generates them from scratch. Sure
           | it's trained on a corpus of existing images, but I don't
           | think it ,,merges" them any more than human artists do with
           | images they have seen in their lifetime.
        
             | jollybean wrote:
             | I don't believe 'creating them' is the write word.
             | 'Merging' them is probably a bad choice of words on my
             | part.
             | 
             | More like 'averaging them' and finding variations from vast
             | inputs.
             | 
             | Which is more a long the lines of what I mean.
        
           | SleekEagle wrote:
           | Luckily the use of Transformer models makes what's going on
           | under the hood a bit more interpretable, but I think the
           | fundamental part at which ideas are merged is translating
           | from CLIP text embeddings to CLIP image embeddings.
           | 
           | The training principle of CLIP is very simple, but
           | intuitively understanding how the diffusion prior maps
           | between semantically similar textual and visual
           | representations is a bit more unclear (if that's even a well-
           | formulated question!)
        
           | alar44 wrote:
           | Well that's a absolutely not what's happening. It seems like
           | you haven't done any reading in this space, so I'm not even
           | sure what to link for you.
        
             | jollybean wrote:
             | 'Merging' was a poor choice of words on my part, but I'm
             | aware of what it does.
        
         | SleekEagle wrote:
         | GPT-3 was released 2 years ago, and in that time CLIP, GLIDE,
         | and DALL-Es 1 and 2 have been released. All of this is just
         | from OpenAI too! DL research is cranking along as quickly as
         | ever imo!
        
           | lurker619 wrote:
           | Just need a music one please.
        
             | gwern wrote:
             | Jukebox. If you listen to Jukebox samples, recall that that
             | was quite a while ago in dog/DL years, and imagine what the
             | DALL-E 2 equivalent would be for a Jukebox 2...
        
               | p1esk wrote:
               | I'm surprised no one has tried to launch a music
               | generation startup based on Jukebox. I'd be interested in
               | collaboration if anyone wants to work on it (and has
               | compute resources).
        
         | Der_Einzige wrote:
         | I resent this notion that AI doesn't advance if we aren't
         | making new larger and larger foundation models.
         | 
         | Even during that lull between GPT3 and DALL-E/CLIP, there was
         | tons of truly wonderful advances in AI...
        
       | nsxwolf wrote:
       | I'll just never understand how any of this works. I know it is
       | trained on millions of existing images, but when you say "... a
       | bowl ..." in your prompt, how does it decide what the bowl should
       | look like? Does it pick one of the bowls it's seen at random? It
       | doesn't ever quite draw the same bowl twice, does it? Is it
       | somehow "imagining" a bowl, the way a human would, and some all
       | new image of a bowl pops into its "head"?
        
         | simonw wrote:
         | The trick is to start with random "gaussian noise" - something
         | like https://opendatascience.com/wp-
         | content/uploads/2017/03/noise... - and then iteratively modify
         | that image until it starts to look like the concept you want it
         | to look like.
         | 
         | I find the concept of a GAN - a Generative Adversarial Network
         | - useful.
         | 
         | My high-level attempt at explaining how those work is that you
         | create two machine learning models, one that tries to create
         | fake images and one that tries to see if an image is fake or
         | not.
         | 
         | The first one says "here's an image", the second one says
         | "that's a fake", the first one learns from that and tries
         | again, then keep going until an image scores highly on the
         | test.
         | 
         | The networks are adversarial because they are trying to outwit
         | each other.
         | 
         | (I'm sure a ML researcher could provide a better explanation
         | than I can, but that's the way I think about it.)
        
           | goodside wrote:
           | I don't believe Dall-E 2 incorporates GANs at all, but I
           | haven't read the paper in detail. GANs were the best text-to-
           | image models maybe a year ago but lately diffusion techniques
           | are taking over.
        
             | simonw wrote:
             | Thanks for the keyword hint - this explanation looks good
             | for diffusion models:
             | https://ai.googleblog.com/2021/07/high-fidelity-image-
             | genera...
        
         | astrange wrote:
         | It's not trained on labeled data so it doesn't know bowls are a
         | specific concept necessarily. It's all statistical similarity
         | in the same way Google Image Search works. (from the original
         | CLIP paper, it seems to think an apple and the word "apple"
         | written on a piece of paper are the same thing)
         | 
         | The model in step 3 produces an image encoding (something like
         | a sketch of the output) from a text encoding (something like
         | what you typed), and the unCLIP model in step 2 produces images
         | from that encoding. How much variation you get inside a
         | specific input word varies a lot and is spread across those
         | models.
        
         | SleekEagle wrote:
         | If you have a bit of background in math, I would encourage you
         | to read the CLIP paper: https://arxiv.org/abs/2103.00020
         | 
         | Ultimately, the link between words and their representations
         | comes from the CLIP training. The model generates encodings
         | (vectors) for both an image and its corresponding caption, and
         | then the parameters of these encoders (the functions that
         | generate the vectors) are tuned in order to minimize the angle
         | between the textual and visual encodings that represent the
         | same concept.
         | 
         | The core of your question is why minimizing the angle between
         | like vectors is equivalent to learning what the "Platonic
         | ideal" of a given object (in your example, a bowl) is, whether
         | appearing as a textual representation or a visual one. This
         | question is subtle and difficult to answer (if it's even a
         | well-formulated question), but I'd say that the easiest
         | interpretation is that the vector space is composed of a basis
         | of vectors that each represent a distinct feature (which the
         | model learns).
        
       | oofbey wrote:
       | One thing I find really interesting about the DALL-E-2 is that
       | the popular blog name ("DALL-E-2") never shows up in either of
       | the research papers that describe it. The paper commonly referred
       | to as DALL-E-2 calls its own algorithm "unCLIP". UnCLIP is
       | _heavily_ based on a paper from a few months earlier called GLIDE
       | - in fact you can't really understand the unCLIP paper by reading
       | it without first reading the GLIDE paper.
       | 
       | I suspect what's going on is that OpenAI has decoupled their PR
       | activities from their science activities. They told the
       | researchers to publish papers when they're ready, and then the PR
       | apparatus decides when one is good enough to be crowned
       | "DALL-E-2" and writes a blog post about it.
        
         | axg11 wrote:
         | This is only surprising if you're not familiar with product
         | launches that result from R&D. In this case DALL-E 2 is the
         | consumer-facing name, unCLIP is the name used during research
         | and in this case for publication. OpenAI may also have a
         | further internal codename that they used for the project.
         | Currently DALL-E 2 access is limited but there are lots of
         | reasons to believe that OpenAI will try to productize Dall-E 2
         | as an API. If you're selling a product, you need a product
         | name.
        
         | FrenchDevRemote wrote:
         | Maybe it's because there is more features on the openai
         | websites? for exemple with GPT, you get different models,
         | different templates, a playground, an api etc...
        
         | KaoruAoiShiho wrote:
         | I don't get it, isn't this how literally every product launch
         | works.
        
           | radicaldreamer wrote:
           | I don't know why you're being voted down, internal/research
           | names are often way weirder and decided on ad-hoc by the
           | researchers themselves and then a good PM comes in when
           | product-ionizing and part of this is deciding on a catchy
           | name for public use.
        
             | oofbey wrote:
             | The phrasing "I don't get it" is fairly rude - it implies
             | the post is obvious to the point of not being worth
             | mentioning. However obvious this might seem to somebody, I
             | would point out that turning AI research papers turn into
             | products is hardly commonplace.
        
         | SleekEagle wrote:
         | I noticed that as well! It confused me a bit at first. They say
         | that their "image generation stack" is referred to as unCLIP,
         | and I was trying to figure out how it's distinct from DALL-E 2
         | at first!
         | 
         | My only guess would be that unCLIP is the end-to-end image
         | generation model, but if the model is used for manipulation,
         | interpolation, or variations, then it is referred to as DALL-E
         | 2. So unCLIP is a subset of DALL-E 2.
        
         | tmabraham wrote:
         | This behavior is not exclusive to OpenAI. NVIDIA did this too.
         | Originally StyleGAN3 was published under the name "Alias-free
         | GAN" and the paper itself uses that terminology.
        
         | ShannonLimiter wrote:
         | DALL-E is mentioned in several places in the paper.
         | 
         | DALL-E 2 specifically is on page 18 and the system card:
         | https://github.com/openai/dalle-2-preview/blob/main/system-c...
         | 
         | DALL-E 2 = the stack of unCLIP and the image generator.
        
         | phailhaus wrote:
         | Unrelated: I've noticed the common use of underscores for
         | emphasis in HN comments. Why use that when italics are
         | supported via asterisks? _Like this_?
        
           | burke wrote:
           | HN's markup is idiosyncratic but similar enough to markdown
           | that it's hard for occasional commenters to remember the
           | details. It's also minimal enough that users are already used
           | to parsing extra-syntactic markup visually.
        
             | oofbey wrote:
             | Yeah. HN should just switch to markdown. ;)
        
           | bern4444 wrote:
           | Markdown syntax is that a single underscore around words
           | _like this_ renders it italicized.
        
             | phailhaus wrote:
             | Sure, but HN doesn't. I'm trying to understand why I see so
             | many comments using underscores when only asterisks work.
        
               | tingletech wrote:
               | this was also a common convention during the usenet news
               | era
        
               | LeifCarrotson wrote:
               | I think it's a combination of Markdown being completely
               | readable in plain text especially if you're familiar with
               | the syntax and even if you're not. Similarly, I see a lot
               | of people using TeX-style mathematics, it's not
               | particularly readable but "The quadratic formula is
               | $x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}$" is a decent way of
               | representing the formula to people fluent in LaTeX even
               | in plaintext conditions. I suppose there's also likely to
               | be a bit of muscle memory where people accustomed to
               | typing in Github/Stack Overflow/Reddit markdown use it on
               | other systems, and even if they see it's not supported
               | it's good enough to not need editing.
               | 
               | I don't think it's particularly worthwhile to learn a new
               | comment format (one that's not even linked or described
               | in the comment editor, for that matter) for every site.
        
               | thomasahle wrote:
               | _Underscores_ still _work_ , even if they don't get
               | converted into <u></u>, they convey the meaning just
               | fine.
               | 
               | Similar to how people use ">" to indicate quotes, even if
               | it doesn't get special treatment by the editor.
        
         | dwighttk wrote:
         | unCLIP as in "this algorithm is NOT going to be told to make
         | paper clips resulting in all the mass in the solar system
         | converted into paper clips"?
        
       | oofbey wrote:
       | Diffusion models seem like they're poised to completely replace
       | GANs. They obviously work super well, and you don't have this
       | super finicky minimax training problem.
        
         | SleekEagle wrote:
         | Yeah, I haven't seen any big advancements in GANs in few years.
         | Have I missed anything big or is the research volume trending
         | down on them?
        
           | astrange wrote:
           | There's this but I don't know if it's been followed up on.
           | 
           | https://www.microsoft.com/en-
           | us/research/publication/manifol...
        
         | mdda wrote:
         | Or the two can be combined :
         | https://nvlabs.github.io/denoising-diffusion-gan/index.html
        
           | oofbey wrote:
           | Sounds like it gets the worst of both worlds? The difficult
           | training of a GAN with the slow runtime of a diffusion model.
        
             | mdda wrote:
             | Could be... Except their page (should you choose to believe
             | it, of course) specifically addresses the advantages:
             | 
             | """
             | 
             | "Advantages over Traditional GANs" : Thus, we observe that
             | our model exhibits _better training stability_ and mode
             | coverage.
             | 
             | "Why is Sampling from Denoising Diffusion Models so Slow?"
             | : After training, we generate novel instances by sampling
             | from noise and iteratively denoising it _in a few steps_
             | using our denoising diffusion GAN generator.
             | 
             | """
        
         | machinekob wrote:
         | Biggest problem for diffusion models were performance (as you
         | need to iterate even at inference) But I'm not up to date with
         | newest architectures maybe its already solved :P
        
           | johndough wrote:
           | I was wondering it if would be possible to train a neural
           | network to do multiple iterative steps at once. As it turns
           | out, it has already been done and it requires about 4 to 8
           | distilled iterations for comparable quality. If this pace
           | keeps up, we will probably see similar running time to GANs
           | in the near future.
           | 
           | https://arxiv.org/pdf/2202.00512.pdf
        
       | ak391 wrote:
       | open source alternative to dalle:
       | https://huggingface.co/spaces/multimodalart/latentdiffusion
        
       | madiator wrote:
       | I think DALL-E 3 will generate short clips. But I am curious to
       | know what HN thinks will OpenAI will do with these technologies?
        
         | aabhay wrote:
         | Try to commercialize it, but fail to create much of a moat from
         | it. Just like their past commercialization efforts.
        
         | SleekEagle wrote:
         | GPT-3 was sold with exclusive usage rights to MicroSoft, so
         | maybe something along those lines with a different company
         | (Meta?). As for what they will do with it, it's hard to say ...
        
           | astrange wrote:
           | You can use GPT-3 right now on OpenAI playground and there's
           | commercial apps running on it that as far as I know aren't on
           | Azure. It's not clear what they meant by exclusive.
        
       | aantix wrote:
       | How are objects differentiated from their background?
        
       | password54321 wrote:
       | Tech bros are high-fiving their way to the top in every field
       | with some neural nets. No one is safe.
        
         | ausbah wrote:
         | these are teams of PhD research scientists and research
         | engineers. I wouldn't quantify them quite as just tech bros
        
         | SleekEagle wrote:
         | The rate of advancement over the past 10-15 years really has
         | been incredible. Now the question is - is this growth curve
         | logistic or exponential!
        
       | pupppet wrote:
       | If I have it generate a "bowl of soup" will I find an identical
       | bowl in some clip art collection somewhere? How much does it
       | deviate from the source images?
        
         | SleekEagle wrote:
         | You can try to reverse image search - from what I've seen of
         | other people doing this, the renditions are quite distinct. The
         | diffusion process is ultimately the root of the model's ability
         | to not just copy images. Variational methods truly allow for
         | the learning of a distribution, which is why VAEs can generate
         | new data and AEs can't.
         | 
         | Also, practically from a data point of view, the same object
         | can be represented in numerous ways (different artistic styles,
         | different filters, abstract paintings, etc.) and the model has
         | to optimize across all of these samples. What this means is
         | that the model truly is forced to learn the semantic meaning
         | behind a concept and not just rely on specific features.
         | 
         | Check out the dropdown under the "Significance of CLIP to
         | DALL-E 2" section in the article
        
         | corysama wrote:
         | I've played with tech like this for over a year now. You won't
         | find the bowl in the source images. It doesn't evolve the noise
         | into a source image. It slowly nudges the noise into feeling
         | more and more bowl-like. Do that enough, and you get something
         | that feels quite a bit like a bowl.
         | 
         | Put it this way: The model file is absurdly smaller than the
         | half billion source images files. If it actually contained the
         | source images, it would be the greatest feat of image
         | compression ever. Instead it only contains the impression left
         | over by the images. A lot closer to a memory than a jpg.
        
       | rglover wrote:
       | I've always been skeptical of AI stuff (for obvious reasons/long-
       | term implications), but I have to say this application has me
       | excited beyond belief. This is pure magic. Kudos to the OpenAI
       | team.
        
         | SleekEagle wrote:
         | A few years ago when photorealistic facial image generation
         | models started getting really good I had my first "holy crap"
         | moment. OpenAI expanding the domain from faces to essentially
         | _anything_ is absolutely mind blowing. An absolutely seminal
         | step forward undoubtedly!
        
         | [deleted]
        
         | [deleted]
        
         | chrisco255 wrote:
         | I find myself oscillating between excitement and sheer terror,
         | sometimes several times a day.
        
           | SleekEagle wrote:
           | Sometimes at the same time!
        
       | breakfastduck wrote:
       | This is a truly horrifying piece of technology, destined to
       | destroy the livelihoods of countless artists. It's incredible in
       | terms of the technology, but... scary in equal measure.
       | 
       | I can't think of a single good reason for this to exist that
       | doesn't have huge negative impacts on our world.
       | 
       | Why pay an artist/graphic designer when this does what you need?
       | 
       | "Now those damned creatives can go and find real jobs"
        
         | alar44 wrote:
         | Only non-artists say this. Every graphic designer I know thinks
         | this is great.
        
         | chrisco255 wrote:
         | I'm less worried about the jobs angle, as this can be viewed as
         | a productivity tool. I'm more worried about the ability to use
         | this tech for deep fakes. It's going to erode trust in society
         | even further than it already has.
        
           | MartinCron wrote:
           | The cynic in me is wondering if that will make any
           | difference. It's not like people need deep fakes or even the
           | possibility of deep fakes to believe that the world is flat
           | or that Obama was born in Kenya or that lizard people are
           | running sex trafficking rings out the basements of pizza
           | parlors with no basements.
           | 
           | People look at the objective reality, provided by the sources
           | that should have the most credibility, and just shrug it off.
        
             | wormer wrote:
             | Right people _should_ but this will only increase people
             | being deluded, because of it 's ease of use. And it's not
             | like any of us are immune to being deluded either; I'm sure
             | there are things I and others take as truths because the
             | facts we found them upon were carefully fabricated to have
             | no holes.
             | 
             | If I saw a masterfully crafted video of vaccines _actually_
             | being implanted with microchips, wouldn 't I believe it?
             | I'm not an expert on identifying deepfakes, nor should I be
             | just to consume media. I think this is a valid cause for
             | concern and will make things worse rather than keep it the
             | same.
        
         | sephlietz wrote:
         | How is this any different than any other technological
         | innovation which has made a job obsolete or otherwise allowed
         | fewer people to do more work?
        
           | wormer wrote:
           | I argue there is a difference because of the nature of the
           | work. Machines aiding in farming is only a good thing,
           | because it can maximize output and minimize input. People
           | (largely) don't care about the process of how it was grown,
           | but rather having the product to eat (Of course there's
           | cruelty free agriculture, organic, etc. but stay with me
           | here). But artistry is a personal thing, and maximizing the
           | output of art pieces isn't something that most are interested
           | in. Art is a uniquely unquantifiable subject, and we want it
           | to have a personal and emotional connection to both the
           | creator and the viewer, something that is lost when AI boil
           | it down to it's essential components and rebuild them in it's
           | image.
        
             | WillPostForFood wrote:
             | _Machines aiding in farming is only a good thing, because
             | it can maximize output and minimize input._
             | 
             | Machines aiding in art is only a good thing, because it can
             | maximize output and minimize input?
             | 
             | Makes art cheaper, more accessible, allows more people to
             | create?
             | 
             | It is like how digital filmmaking has cracked the Hollywood
             | monopoly on content.
        
               | wormer wrote:
               | I think that this doesn't really help artists as much as
               | just do it for them. Art, the way I see it, requires a
               | human to do because it is something that requires
               | emotion, something a robot could _replicate_ but not
               | feel. For example, a gut wrenching image of innocents
               | being beat by police is gut wrenching _because_ it 's
               | something that exists in the real world, and the artist
               | and the subjects are real and their emotion is real. But
               | a computer generated image only has a likeness; it
               | doesn't have actual emotion.
               | 
               | I aldo don't think that it makes it "cheaper, more
               | accessible, and allows more people to create". Digital
               | art supplies being something readily available and
               | relatively cheap to their classic counterparts is what
               | makes things more accessible, and to make it more so
               | would be to drive the cost down or something. Having the
               | computer draw for you isn't exactly creating art.
               | 
               | And art isn't a commodity and I argue it shouldn't be a
               | commodity. It's something, again, personal and special.
               | 
               | And this doesn't end at the visual arts, I think it
               | applies too to writing. AI could write what's written in
               | my journal word for word but my journal would have more
               | value just by virtue of it being written by me.
        
         | madiator wrote:
         | Your argument is weak in that it could have been invoked for
         | several previous inventions: ATMs replacing cashiers, search
         | engines replacing librarians and so on.
        
         | password54321 wrote:
         | Well they won't be alone at least. Even us programmers are
         | eventually going to be replaced.
        
         | p1esk wrote:
         | Same was true many times throughout the history. Why do people
         | still pay musicians to play in live concerts when they could
         | listen to a recording? Why do people still watch other people
         | play chess when they could watch two AIs play much better
         | chess?
         | 
         | Think long term. Eventually AI will be able to do most of human
         | jobs. As a result, products and services will become cheaper.
         | As a result, people will have to work less for a living. As a
         | result, more people will be able to draw and paint for
         | pleasure, and not necessarily to make a buck.
        
           | Bud wrote:
           | There are a couple gigantic blind spots here:
           | 
           | 1) AI appears to have approximately zero chance of making
           | housing and food and other basic needs cheaper.
           | 
           | 2) Artists WANT to make money for creating art, music, etc.
        
             | emteycz wrote:
             | I'm working on applying this technology to housing as we
             | speak, you're very wrong IMHO.
             | 
             | Yeah some people are going to loose jobs over this, happens
             | all the time. People are not isolated from the market, they
             | function on it and need to take it into account.
        
           | badRNG wrote:
           | > Think long term. Eventually AI will be able to do most of
           | human jobs. As a result, products and services will become
           | cheaper. As a result, people will have to work less for a
           | living. As a result, more people will be able to draw and
           | paint for pleasure, and not necessarily to make a buck.
           | 
           | This is ahistorical. The fact is that you must at least seem
           | to produce more market value than your total compensation in
           | order for a company to hire you. There will simply be less
           | people who make a "livable" wage while those who own these
           | automations will become increasingly wealthy. Depending on
           | how the market changes, there may also be increasing
           | unemployment. But why would that matter? Unless unemployment
           | gets too high, the market will continue to work as usual.
           | 
           | There's simply no reason for the owners and inheritors of an
           | increasingly automated economy to share the value increase
           | with their workers. The worker's wages will be market-
           | determined just as before. Perhaps if unemployment gets too
           | high it will be in their interests to offer something like
           | UBI, though no reason for anything beyond what's strictly
           | necessary for the economy to function, and the minimum
           | required to avoid excessive social turmoil.
        
             | gbasin wrote:
             | Your claim is very theoretical. In practice, everyone in
             | the world has grown increasingly wealthy, and unemployment
             | levels are lower than ever.
        
           | bckr wrote:
           | > As a result, products and services will become cheaper. As
           | a result, people will have to work less for a living.
           | 
           | I think we've seen this play out before and instead of
           | reducing work, our standards of living increase and people
           | keep working about the same amount. See e.g. the post
           | industrial world where homemakers had to scrub clothes, then
           | got machines to do the scrubbing, but subsequently had to
           | clean the clothes more frequently.
           | 
           | We might be able to reduce the overall amount of human work
           | only through extremely successful social/political reforms
           | similar to the ones that outlawed child labor and established
           | the 40 hour work week. Assuming the technology will cause it
           | to happen is bound to lead to disappointment.
        
         | Der_Einzige wrote:
         | The future is now old man.
        
         | macawfish wrote:
         | meanwhile artists are like the most curious about this
        
         | karmasimida wrote:
         | technology once invented is not going back
         | 
         | you can't demand some technology not to be used when it is not
         | a weapon
         | 
         | there isn't a reason to believe that our current world is in a
         | stage that is free from changes, in fact our world become what
         | it is due to invention of disruptive technologies, regardless
         | you like it or not.
        
         | MarcoZavala wrote:
        
         | zackbrown wrote:
         | Last week at a birthday party, I met a 74-year-old career
         | visual artist who still creates with various media: paint,
         | colored pencils, sculpture, etc.
         | 
         | Curious for her thoughts on DALL-E, I pulled out my phone and
         | invited her to generate some imagery. (I have early access via
         | a family member at OpenAI.) She didn't skip a beat, and
         | immediately started _getting creative_ with it. We even did a
         | "collaborative piece" a la Mad Lib.
         | 
         | I asked her if she felt threatened by DALL-E. Surprised by the
         | question, she said: "No! I could see this really accelerating
         | my process. Sometimes I'm blocked on an idea and I could see
         | this being a great tool for finding inspiration. Can I get
         | access to this?"
         | 
         | My take-away was that art is not zero-sum: someone's art isn't
         | "less" because more entities are creating art. If computers can
         | do it too -- even if they're arguably more mechanical in the
         | recombination of existing ideas (note: humans do the same) --
         | nothing stops human art from being art.
        
           | izzygonzalez wrote:
           | Arguably, the biggest barrier to any creative domain is
           | technical capability.
           | 
           | An immediate thought is that locked-in people who can only
           | communicate by text would be able to share their thoughts
           | more expressively.
           | 
           | In terms of the creation loop, anyone can create a bunch of
           | AI-generated images. Wombo is huge right now. The
           | differentiating factors will be prompt design, commitment to
           | iteration, aesthetic-driven curation of generated works and
           | presentation.
           | 
           | Photographers take and process thousands of photos to create
           | just one masterpiece.
        
           | andreilys wrote:
           | _My take-away was that art is not zero-sum:_
           | 
           | Art is zero sum in that there are a limited number of artist
           | residencies, exhibitions and funds available.
           | 
           | In this case, we will likely see further contraction in the
           | number of artists able to support themselves. There will ofc
           | always be the super stars and hobbyists.
        
             | rictic wrote:
             | The amount of art that people want in their lives is much
             | larger than the amount that's there now.
             | 
             | Artists who are willing to direct their talents towards
             | satisfying others' desires for art will find the world is
             | very positive sum. Those that vie for a limited number of
             | spots in a prestige game may find that it's zero or even
             | negative sum, but those are not good games to play anyways.
        
         | mrfusion wrote:
         | Maybe an artist can make huge images or whole catalogs of
         | images with technology like this.
         | 
         | Maybe more people can be game developers with access to free
         | original artwork at their fingertips.
         | 
         | I don't see it as replacing artists, I see it as amplifying
         | artists.
        
       | billconan wrote:
       | Can I train DALL-E2 on my personal computer with a fairly decent
       | gpu? or it is out of the question?
        
         | SleekEagle wrote:
         | Unfortunately, it is out of the question. OpenAI trains on
         | hundreds of thousands of dollars of GPUs and even then the
         | trainings take two weeks. Also, as far as I know their training
         | data (400 M image/caption pairs) is not available to the
         | public!
        
           | GaggiX wrote:
           | fortunately there are even larger public datasets like LAION
           | 5b
        
           | manquer wrote:
           | You estimate is off by 2 orders of magnitude, it is ore like
           | 10M+ for single run for the latest generation models[1]. This
           | is the primary reason why not lot of models are out there.
           | 
           | Few groups have that kind of money to commit, also the
           | viability is not yet very clear , i.e. how much the model
           | with make if commercialized so they can recoup the
           | investment.
           | 
           | There is also cost of running the model on each API call,
           | this is of course not factoring in any of the employee costs.
           | 
           | [1] https://venturebeat.com/2020/06/01/ai-machine-learning-
           | opena...
        
         | axg11 wrote:
         | This is a cute question. Not today! I hope someone comes back
         | to read this question in 10-15 years time, when we will all
         | have the ability to train Dall-E quality models on our AR
         | glasses.
        
         | ShamelessC wrote:
         | Never gonna happen ha.
        
         | oofbey wrote:
         | Maybe possible with a fabulous GPU, but still likely not, and
         | if it did work it would take a horrendously long time. The real
         | blocker is gonna be GPU memory. With an RTX 3090 you have 24 GB
         | of GPU RAM and _might_ be able to try it, but I'm still not
         | sure it would fit. The key model has 3.5 billion parameters,
         | which at 16-bit requires 7GB of GPU-memory for each copy.
         | Training requires 3 or 4 copies of the model, depending on the
         | algorithm you use. And then you need memory for the data and
         | activations, which you can reduce with a small batch size. But
         | if it did fit, on a single GPU with a small batch size, you're
         | probably looking at years of training time.
         | 
         | Even an RTX 3080 is a complete non-starter.
        
           | manquer wrote:
           | Something like the Quadro RTX 8000 may theoretically work, it
           | does have 48GB of RAM [1].
           | 
           | [1] https://www.nvidia.com/content/dam/en-
           | zz/Solutions/design-vi...
        
         | simonw wrote:
         | I'm pretty confident that part of OpenAI's competitive edge is
         | that they can train these models on GIANT clusters of machines.
         | 
         | This article predicts that GPT-3 cost $10-$20m to train. I
         | imagine DALL-E could cost even more:
         | https://lastweekin.ai/p/gpt-3-is-no-longer-the-only-game?s=r
        
         | joshcryer wrote:
         | Nope, and you'll still need a pretty beefy computer to run the
         | trained data. Currently GPT-NeoX-20B, the "open source GPT3,"
         | requires 42 GB of VRAM, so you're looking at minimum a $5-6k
         | graphics card (though a Quadro RTX 8000 is actually in stock so
         | there's that). Or use a service like GooseAI.
         | 
         | Eleuther.ai or some other open source / open research
         | developers will likely try to reproduce DALL-E 2 but it'll take
         | some time and a lot of donated hardware and cycles.
        
       | mokchira wrote:
       | From the article:
       | 
       | "CLIP is trained on hundreds of millions of images and their
       | associated captions..."
       | 
       | Does anyone have any insight as to which images were trained on?
       | Was it all open-domain stuff? And if not were the original
       | authors of those images made aware their work was being use to
       | train an AI that would likely put them out of work? Were they
       | compensated appropriately?
        
         | SleekEagle wrote:
         | As far as I know, OpenAI has not made this dataset publicly
         | available. IIRC the dataset is images scraped from instagram
         | and their corresponding captions. Check out the CLIP paper for
         | more details:
         | 
         | https://arxiv.org/abs/2103.00020
         | 
         | Theoretically, you could build a web-scraping tool to do
         | something like this, but even storing that data would take an
         | absolutely insane amount of storage.
         | 
         | I would assume OpenAI has some deal with Meta to make the
         | creation of datasets like this easier.
        
           | mokchira wrote:
           | Thanks for the link. I hope they do make the data set
           | publicly available at some point so that the artists whose
           | work helped train this can know. I think, while it is
           | absolutely impressive on a technical level what the OpenAI
           | team has been able to do, it is also important to consider
           | what damage it will do to artists and their livelihood.
           | 
           | Many professional artists stake their career on one unique
           | style of art that they have honed and developed over many
           | years. It's this unique style that clients generally pay for,
           | and that now faces a very real threat of being stolen from
           | them by a technology that frankly no human can hope to
           | compete with. Without artist compensation, this can only lead
           | to artists terminating their careers early once the AI has
           | co-opted all work from them. Or future artists never
           | beginning their careers in the first place. This is a net
           | loss for humanity, as it will deprive us of works and styles
           | of art that have yet to be imagined.
           | 
           | I'm not saying AI like this needs to go away. There is no
           | putting that genie back in the bottle, of course. But it
           | needs to be something that artists opt into. If someone's
           | style is worth it for OpenAI to train on, then that style
           | obviously should have a price tag. And it ought to be up to
           | the artist whether they want to sell or not. Anything short
           | of that is theft in my eyes.
        
       | superasn wrote:
       | Gaming is going to get so interesting with these emerging
       | technologies. I played A.I. dungeon sometime ago and I was amazed
       | at how good it was at making up believable stories on the fly.
       | 
       | Now imagine joining this with dall-e and you truly have a game
       | which has never existed until now, with it's own story and
       | graphics that you are creating on the spot.
       | 
       | Unlike the adventure games like King's quest where everything was
       | pre-programmed, this is truly infinite never-ending game with a
       | unique experience for every single player.
       | 
       | Like the guy from the '2 min papers': what a time to be alive. I
       | feel so happy and excited just thinking about the possibilities
       | these techs are going to bring.
        
         | SleekEagle wrote:
         | Yes! So many exciting possibilities in so many industries.
         | Hopefully it won't displace artists though, we'll have to find
         | a way to manage the efficiency of AI with the curation of
         | artists!
        
         | smaudet wrote:
         | > I feel so happy and excited just thinking about the
         | possibilities these techs are going to bring.
         | 
         | Like what?
         | 
         | I should preface by saying I think art is great, I know a lot
         | of artists who struggle to make a living, and it is somewhat
         | heartbreaking to think of all the poor art students who I guess
         | we should pay for their educations and will never have careers
         | now?
        
           | cercatrova wrote:
           | So we should halt progress just so people can keep their
           | jobs? I hate this argument whenever AI or automation is
           | brought up, it's probably one of the worst ways to deal with
           | it.
        
             | nightski wrote:
             | As long as it is progress. What usually happens though is
             | we get a watered down version of what we had before, but
             | since it is cheaper and far more profitable the big
             | companies exploit it to maximum effect. So in reality we
             | lose a lot.
             | 
             | I'm hopeful but skeptical at the same time.
        
           | exolymph wrote:
           | We will still need people with taste to drive the machines
           | and curate output.
           | 
           | Also, like, this is how the world works. To cite a hackneyed
           | example, people who worked with horses had to figure
           | something out when new tech displaced them. So will graphic
           | designers, illustrators, et al, if indeed AI is a more
           | competitive option for their services.
        
             | visarga wrote:
             | Not just graphic designers. In NLP, what used to take years
             | of data labelling, architecture design and model training
             | now is being done zero-shot by GPT-3.
             | 
             | Simple automations can be driven by GPT-3 as well. It needs
             | a representation of the screen and it will automate the
             | task described in natural language.
        
           | elhesuu wrote:
           | AI generated images are not art. They might use the same
           | medium visual arts do, but they lack a meaningful vision of
           | the world.
           | 
           | Of course defining art is a subject in itself, but I think
           | that being afraid of AI replacing artists is comparable to
           | thinking photography would when it was invented.
        
           | password54321 wrote:
           | If you wanted one original character and you wanted shots of
           | that character from multiple angles with a consistent look,
           | Dalle 2 would already fail.
        
             | simonw wrote:
             | But a variant of Dall-E that output a textured 3D character
             | model would work fantastically well.
        
               | astrange wrote:
               | Only if you're sure it didn't memorize a copyrighted
               | input, and only until everyone gets bored with its style
               | or you want your assets to look the same in a predictable
               | way.
        
             | smaudet wrote:
             | True, but artists also sell artwork - my question could be
             | reframed as, if Dalle 2 can produce a Rembrandt, is a
             | Rembrandt worth anything, even emotionally?
        
               | SleekEagle wrote:
               | I think it's worth pointing out that DALL-E 2 _mimics_
               | the style of famous artists. The artists had to come up
               | with the original style in the first place!
               | 
               | There are highly competent artists that can create highly
               | convincing copies (fabrications? forgeries?) of famous
               | paintings. Are these paintings worth anything? No,
               | because people find value in the specific contribution to
               | the field of art that the particular painting represents.
               | 
               | I think we should look at DALL-E 2 like a highly
               | competent artist that can produce convincing forgeries
               | and even mimic the style of famous artists, but cannot
               | replace the artists themselves.
        
         | mupuff1234 wrote:
         | I doubt it will beat a curated experience any time soon, but I
         | do see a future where it could assist in creating that curated
         | experience.
        
           | kromem wrote:
           | For the format, AI dungeon already does.
           | 
           | When I played with it, I started a quest as a wizard looking
           | for a book.
           | 
           | I was able to cast a tracking spell that led me to a giant
           | library.
           | 
           | I could have it read the titles of the books on a shelf in
           | front of me.
           | 
           | I could pick up any book and open to a random page and have
           | it tell me what was in it.
           | 
           | One was about a little half-elf that had a magic flute that
           | broke.
           | 
           | I cast a summoning spell to summon the half-elf and fix its
           | flute, after which it happily played a song opening a door to
           | another dimension filled with musical instruments.
           | 
           | Give me that level of emergent gameplay in a VR open world,
           | and then just take my money and all my free time, as I'm
           | never leaving.
           | 
           | We're simply very, very early on in what's arguably going to
           | be the most transformative tech since the Internet. People
           | predicted back then that the slow network which only offered
           | basic things like email wasn't going to significantly disrupt
           | things like retail.
           | 
           | They were only right in that it didn't remain slow and ended
           | up doing a lot more than email.
           | 
           | This stuff is getting better way faster than any tech I've
           | seen, and I used to consult for CEOs at Fortune 500s and sit
           | on advisory boards on the topic of emerging tech.
           | 
           | I wouldn't be so quick to bet against it. We really haven't
           | even started to see what these models can do in application.
        
         | kromem wrote:
         | Yeah, it's getting to the point I'm starting to see current
         | game design as getting long in the tooth by comparison to what
         | I know is ahead.
         | 
         | There's a great tech demo a dev did a year or two ago
         | showcasing GPT-3, speech-to-text, and text-to-speech to have
         | random NPCs in a VR open world respond to anything the guy said
         | if he walked up to them and talked to them.
         | 
         | Procedural generation has taken on almost a "dirty word"
         | reputation in the past few years in gaming, but as AI continues
         | to allow for exponential variety at increasingly high quality,
         | it's going to enable some truly mind boggling experiences.
         | 
         | Expect to see MMO models (subscription fee and server-oriented)
         | but for single-player instanced worlds dynamically generated
         | around your interactions in them.
         | 
         | I can't wait to have a party of friends to go on epic
         | adventures with that are all just AIs I picked up across a
         | world along the way.
         | 
         | Less than 20 years away, and possibly even less than 10.
        
       | seanwilson wrote:
       | > "DALL-E 2's works very simply: ... a model called the prior
       | maps the text encoding to a corresponding image encoding that
       | captures the semantic information of the prompt contained in the
       | text encoding. Finally, an image decoding model stochastically
       | generates an image which is a visual manifestation of this
       | semantic information."
       | 
       | > "The fundamental principles of training CLIP are quite simple:
       | First, all images and their associated captions are passed
       | through their respective encoders, mapping all objects into an
       | m-dimensional space."
       | 
       | Not scared to admit I don't find this simple at all haha. I'm
       | probably not in the target audience. I'd love a description that
       | doesn't assume machine learning basics. Is there one?
        
         | pas wrote:
         | https://ml.berkeley.edu/blog/posts/dalle2/
         | 
         | it's "simple" because how it works is "just" brute-fucking-
         | force. of course coming up with the architecture and making it
         | fast (so it scales up well) is the challenge.
         | 
         | and scaling works .. because .. well, no one knows why (but
         | likely because it's just a nice architecture for learning,
         | evolution also converged on it without knowing _why_ )
         | 
         | see also: https://www.gwern.net/Scaling-hypothesis
        
       ___________________________________________________________________
       (page generated 2022-04-19 23:00 UTC)