[HN Gopher] No, DALL-E doesn't have a secret language
       ___________________________________________________________________
        
       No, DALL-E doesn't have a secret language
        
       Author : doener
       Score  : 111 points
       Date   : 2022-06-01 20:08 UTC (2 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | jawarner wrote:
       | The tweet is in response to a preliminary paper [1] [2] studying
       | text found in images generated by, e.g., "Two whales talking
       | about food, with subtitles." DALL-E doesn't generate meaningful
       | text strings in the images, but if you feed the gibberish text it
       | produces -- "Wa ch zod ahaakes rea." back into the system as a
       | prompt, you would get semantically meaningful images, e.g.,
       | pictures of fish and shrimp.
       | 
       | [1]
       | https://giannisdaras.github.io/publications/Discovering_the_...
       | 
       | [2] https://twitter.com/giannis_daras/status/1531693093040230402
        
         | dekhn wrote:
         | I think the tweeter is being a bit too pedantic. Personally I
         | spent some time thinking about embeddings, manifolds, the
         | structure of language, scientific naming, and what the decoding
         | of the points near the center of clusters in embedding spaces
         | look like (archetypes), after seeing this paper. I think making
         | networks and asking them to explain themselves using their own
         | capabilities is a wonderful idea that will turn out to be a
         | fruitful area of research in its own right.
        
           | austinjp wrote:
           | > asking [neural networks] to explain themselves using their
           | own capabilities
           | 
           | Exactly. This could be profound. I'm looking forward to
           | further work here. Sure, the examples here are daft, but
           | developing this approach could be like understanding a
           | talking lion [0] only this time it's a lion of our making.
           | 
           | [0] https://tzal.org/understanding-the-lion-the-in-joke-of-
           | psych...
        
           | lotaezenwa wrote:
           | I concur that the tweeter is being pedantic.
           | 
           | This is largely some embedding of semantics that we currently
           | do not fully have a mapping for, precisely because it was
           | generated stochastically.
           | 
           | Saying it was "not true" seems like clickbait.
        
             | koboll wrote:
             | Especially since his results confirm _most_ of what the
             | original thread claimed. A couple of the inputs did not
             | reliably replicate, but  "for the most part, they're not
             | true" seems straightforwardly false. He even seems to
             | deliberately ignore this sometimes, such as when he says "I
             | don't see any bugs" when there is very obviously a bug in
             | the beak of all but two or three of the birds.
        
               | mannykannot wrote:
               | When I zoomed in, I felt only four in ten birds clearly
               | had anything in their beaks, and in each case it looked
               | like vegetable matter. In the original set, only one
               | clearly has an insect in its beak.
               | 
               | Are there higher-resolution images to be had?
        
               | ASalazarMX wrote:
               | Lower in the same thread he accepts that his main tweet
               | was clickbaity, and that actually there's consistency in
               | some of the results.
        
               | Jweb_Guru wrote:
               | Not really, he afterwards says that he was more trying to
               | inject some humility. He really doesn't think this is
               | measuring anything of interest. For the birds result in
               | particular, see https://twitter.com/BarneyFlames/status/1
               | 531736708903051265.
        
             | ASalazarMX wrote:
             | If DALL-E had a choice to output "Command not understood",
             | maybe we wouldn't be discussing this.
             | 
             | Like those AIs that guess what you draw, and recognize
             | random doodling as "clouds", DALL-E is probably using the
             | least unlikely route. That a gibberish word is drawn as a
             | bird is maybe because it was "bird (2%), goat (1%), radish
             | (1%)".
             | 
             | 1. https://quickdraw.withgoogle.com
        
       | wruza wrote:
       | Is this and a previous tweet a ML-guys discussion? My layman
       | understanding of neural networks is that the core operation is
       | you basically kick a figure down the hill and see where it ends
       | up, but both the figure and the hill are N-dimensional objects,
       | where N is too huge to comprehend. Of course some nonsensical
       | figures end up at valid locations, but can you really expect some
       | stable inner structure of the hill-figure interaction? I think
       | it's unlikely that there is a place in a learning method to
       | produce one. NNs can give interesting results, but they don't
       | magically rewrite their own design yet.
       | 
       | Would still be interesting to see how the output changes with
       | little changes to these inputs. If my vague understanding is at
       | all close, this will reveal the "faces" that are more "noisy"
       | than the others. Not sure what that gives though.
        
       | quink wrote:
       | If it smells like overfitting, it probably is overfitting.
        
       | danamit wrote:
       | I do not believe AI claims whenever I read them, and this is
       | happening at the risk of me being a cynic and a disbeliever in
       | the field. And I am not sure if that's bad for society or bad
       | just for me.
       | 
       | I am more likely to believe celebrity gossip than AI news
       | articles.
        
       | aaron695 wrote:
        
       | peanut_worm wrote:
       | I think this guy is being a bit pedantic. It is returning semi-
       | consistent results for gibberish, which is interesting. Thats all
       | the original poster meant.
        
       | KaoruAoiShiho wrote:
       | Am I dumb but does that thread prove that DALL-E does in fact
       | have a secret language, it's just not exactly the meaning
       | described in the paper?
        
         | deltaonefour wrote:
         | There's some form of language here... the correlations are
         | evidence enough. The Grammar I believe is complex and likely
         | not human grammar thus certain words when paired with other
         | words can negate the meaning of the word all together or even
         | completely change it.
         | 
         | For example "hedge" combined with "hog" is neither a "hedge"
         | nor is it a "hog" nor is it some sort of horrific hybrid
         | mixture of hedges and hogs. A hedgehog is tiny rodent. Most
         | likely this is what's going on here.
         | 
         | The domain is almost infinite. And the range is even greater.
         | Thus it's actually realistic to say that there are must be
         | hundreds of input and output sets that form alternative
         | languages.
        
           | Jweb_Guru wrote:
           | I don't think there is any evidence of a language here unless
           | you stress the definition to the point of absurdity. It will
           | not even reliably produce the same kinds of images that had
           | the text that it output, which was the original premise of
           | the claim. Obviously, probing some overconstrained high
           | dimensional space where it's never rewarded for uncertainty
           | has to produce _something_ ; that doesn't mean that something
           | is a language.
        
       | [deleted]
        
       | redredrobot wrote:
       | So his argument is that the text clearly maps to concepts in the
       | latent space, but when composing them the results are unexpected,
       | so it isn't language? Why isn't this better described as 'the
       | rules of composition are unknown'?
        
         | rcoveson wrote:
         | That framing is worse because it hides an assumed conclusion,
         | i.e. that there are rules of composition.
        
           | redredrobot wrote:
           | But don't we already know that composition exists in DALL-E?
           | Don't the points shown in the tweet indicate that some form
           | of composition exists? The 3D renders are clearly render-
           | like, the painting and cartoons are clearly in the
           | appropriate style.
        
             | rcoveson wrote:
             | "That there exist rules of composition of the hypothesized
             | secret DALL-E language" is a much stronger claim than that
             | it "understands" composition of text in the real languages
             | it was trained on.
             | 
             | Though I'll also point out that even evidence for that
             | weaker claim is tenuous. It definitely knows how to move an
             | image closer to "3D render" in concept-space, but it
             | doesn't seem to understand the linguistic composition of
             | your request. For example, you'd have an extremely hard
             | time getting it to generate an image of a person using 3D
             | rendering software, or a "person in any style that isn't 3D
             | render"; it would probably just make 3D renders of persons.
             | 
             | I haven't played around with it myself, I'm going off the
             | experiences of others. For example:
             | 
             | https://astralcodexten.substack.com/p/a-guide-to-asking-
             | robo...
        
       | joshmarlow wrote:
       | I found this analysis interesting
       | https://twitter.com/Plinz/status/1531711345585860609?t=Yinol...
        
       | SilverBirch wrote:
       | This just feels like one of these topics where you'd really want
       | a liguist. Someone who really understands the construction and
       | evolution of langauge to observe some of the underlying _reasons_
       | for why language is constructed the way it is. Because I guess
       | that 's what DALL-E partly is, it's trying to approximate that,
       | and the interesting thing would be where it differs from real
       | language, rather than matches it. If I give it a made up word
       | that looks like the latin phrase that looks like a species of
       | bird, then it working like I've given it a latin phrase that is a
       | species of bird is pretty reasonable. If you said to me "Homo
       | heidelbergensis" I wouldn't _know_ that was a species of pre-
       | historic human, but I would feel pretty comfortable making that
       | kind of leap.
       | 
       | I also think you could probably hire a team of linguists pretty
       | cheap compared to a team of AI engineers.
        
         | masswerk wrote:
         | I don't think that this related to language, at all. First,
         | let's ask, is there a way for DALL-E to refuse an output (as
         | in, this makes no sense). Then, what would we expect the output
         | for gibberish to be like? Isn't this still subject to filtering
         | for best "clarity" and best signals? While I don't think that
         | these are collisions in the traditional sense of a hash
         | collision, any input must produce a signal, as there is no null
         | path, and what we see is sort of a result of "collisions" with
         | "legitimate" paths. Still, this may tell us some about the
         | inner structure.
         | 
         | Also, there is no way for vocabulary to exist on its own
         | without grammar, as these are two sides of the phenomenon, we
         | call language. Some signs of grammar had to emerge together
         | with this, at once. However...
         | 
         | ----
         | 
         | Edit: Let's imagine a typical movie scene. Our nondescript
         | individual points at himself and utters "Atuk" (yes, Ringo
         | Starr!) and then points at his counterpart in this
         | conversation, who utters "Carl Benjamin von Richterslohe". This
         | involves quite an elaborate system of grammar, where we already
         | know that we're asking for a designator, that this is not the
         | designator for the act of pointing, and that by decidedly
         | pointing at a specific object, we'd ask for a specific
         | designator not a general one. Them C.B. von Richterslohe, our
         | fearless explorer, waves his hand over the backdrop of the
         | jungle, asking for "blittiri" in an attempt to verify that this
         | means "bird", for which Atuk readily points out a monkey. -
         | While only nouns have been exchanged, there's a ton of grammar
         | in this.
         | 
         | And we haven't even arrived at thinks like, "a monkey sitting
         | at the foot of a tree". Which is mostly about the horizontal
         | and vertical axes of grammar, along which we align things and
         | where we can substitute one thing for another in a specific
         | position, which ultimately provides them with meaning (by what
         | combinations and substitutions are legitimate ones and which
         | are not).
         | 
         | Now, in light of this, that specific compounds are changing
         | their alleged "meaning" radically, when aligned, doesn't allow
         | for high hopes for this to be language.
        
           | runj__ wrote:
           | I was thinking about a system for pulling data from verbal
           | nonsense the other day, speaking in tongues or something
           | similar. I can create a bunch of noises that lack obvious
           | meaning for me, but obviously they have some meaning that can
           | be learned since humans are terrible at being truly random
           | (lol XD).
           | 
           | I wonder what level I would be able to share ideas I lack the
           | words for, my perceived bitrate at creating "random" noise is
           | certainly higher than when verbally communicating an idea to
           | another human. Will we even share a common language in the
           | future? Or will we have our own language that is translated
           | to other people?
        
             | masswerk wrote:
             | Well, I can only answer with kind of a pun. With
             | Wittgenstein, language is a constant conversation about the
             | extent of the world, about what is and what is not. As
             | such, it is necessarily shared. In the _tractatus_ we find,
             | 
             | > 5.62 (...) For what the solipsist means is quite correct;
             | only it cannot be _said,_ but makes itself manifest. The
             | world is my world: this is manifest in the fact that the
             | limits of _language_ (of that language which alone I
             | understand) mean the limits of my world. [1]
             | 
             | So, something could become _apparent,_ but you would still
             | haven 't _said_ anything (as it 's not part of that
             | conversation). ;-)
             | 
             | [1] https://www.masswerk.at/digital-
             | library/catalog/wittgenstein...
             | 
             | (I deem this edition to be somewhat appropriate in
             | context.)
        
       | belugacat wrote:
       | Given that DALL-E is a giant matrix multiplication that
       | correlates fuzzy concepts in text to fuzzy concepts in images,
       | wouldn't one expect that there will be hotspots of nonsensical
       | (to us) correlations, eg between "apoploe vesrreaitais" and
       | "bird"? Intuitively feels like an aspect of the no free lunch
       | theorem.
        
         | axg11 wrote:
         | Exactly this. At a high level, DALL-E is mapping text to a
         | (continuous) matrix and then mapping that matrix to an image
         | (another a matrix). All text inputs will map to _something_.
         | DALL-E doesn't care if that mapping makes sense, it has been
         | trained to produce high-quality outputs, not to ensure the
         | validity of mappings.
         | 
         | None of this makes DALL-E any less impressive to me. High
         | quality image generation is a truly amazing result. Results
         | from foundational models (GPT-3, PaLM, DALL-E, etc) are so
         | impressive that they're forcing us to reconsider the nature of
         | intelligence and raise the bar. That's a sign of a job well
         | done to me.
        
           | LoveMortuus wrote:
           | But if it's just mapping text to image then it would be fair
           | to assume that using the same text would result in the same
           | image. But does that actually happen?
        
             | Jweb_Guru wrote:
             | No, it does not. It also doesn't always generate the same
             | category of image. See https://twitter.com/realmeatyhuman/s
             | tatus/153173861680386457....
             | 
             | As much as people would like there to be, there really does
             | not seem to be anything here. The original author doesn't
             | think so, either (would need to refind the tweet).
        
         | tbalsam wrote:
         | I have seen many bad abuses of the NFL theorem's name.
         | 
         | This is by far the worst.
        
         | smeagull wrote:
         | Yeah. The problem here is that the network only has room for
         | concepts, and hasn't been trained to see meaningless crap. Nor
         | does it really have any way to respond with "This isn't a
         | sentence I know", it just has to come up with an image that
         | best matches whatever prompt it has been fed.
        
       | skybrian wrote:
       | "Secret language" is clickbait, but it seems like systematically
       | exploring how it responds to gibberish might find something
       | interesting?
       | 
       | Also, I'm wondering if there is some way that these models could
       | have a decent error response rather than responding to every
       | input?
        
       | dang wrote:
       | Recent and related:
       | 
       |  _DALL-E 2 has a secret language_ -
       | https://news.ycombinator.com/item?id=31573282 - May 2022 (109
       | comments)
        
       | deltaonefour wrote:
       | It's OBVIOUS what's going on here. When you combine TWO different
       | languages you get stuff that appears as NONSENSE. You have to
       | stay in the same language!
       | 
       | There is for sure a set of consistent words that produce output
       | that makes sense to us. He just picked the wrong set!
        
       | sydthrowaway wrote:
       | Dumb question, but how are DALL-E's (and any other AI generative
       | algorithm) result's so.. smooth?
       | 
       | For example, I could write a heuristic algorithm to product the
       | same thing using a Google image search, but it would look like MS
       | word clip art.
        
         | Enginerrrd wrote:
         | Lots of denoising steps after the initial attempt at forming a
         | connection to the prompt is made?
        
         | sillysaurusx wrote:
         | This is one of my favorite topics in all of AI. It was the most
         | surprising and mysterious discovery for me.
         | 
         | The answer is that the training process literally has to make
         | the results smooth. That's how training works.
         | 
         | Imagine you have 100 photos. Your job is to classify them by
         | color. You can place them however you want, but similar colors
         | should be physically closer together.
         | 
         | You can imagine the result would look a lot like a photoshop
         | RGB picker, which is smooth.
         | 
         | The surprise is, this works for any kind of input. Even text
         | paired with images.
         | 
         | The key is the loss function (a horrible name). In the color
         | picker example, the loss function would be how similar two
         | colors are. In the text to image example, it's how _dissimilar_
         | the input examples are from each other (Contrastive Loss). The
         | brilliance of that is, pushing dissimilar pairs apart is the
         | same thing as pulling similar pairs together, when you train
         | for a long time on millions of examples. Electrons are all
         | trying to push each other apart, but your body is still smooth.
         | 
         | The reason it's brilliant is because it's far easier to measure
         | dissimilar pairs than to come up with a good way of judging
         | "does this text describe this image?" -- you definitely know
         | that it isn't a bicycle, but you might not know whether a car
         | is a corvette or a Tesla. But both the corvette and the Tesla
         | will be pushed away from text that says it's a bicycle, and
         | toward text that says it's a car.
         | 
         | That means for a well-trained model, the input _by definition_
         | is smooth with respect to the output, the same way that a small
         | change in {latitude,longitude} in real life has a small change
         | in the cultural difference of a given region of the world.
        
           | Michelangelo11 wrote:
           | Do you by any chance have a link to a paper or article that
           | explains this in detail? I'd love to understand it better.
        
             | jeabays wrote:
        
             | sillysaurusx wrote:
             | It doesn't exist. The above explanation is the result of me
             | spending almost all of my time immersing myself in ML for
             | the last three years.
             | 
             | gwern helped too. He has an intuition for ML that I'm still
             | jealous of.
             | 
             | Your best bet is to just start building things and worry
             | about explanations later. It's not far from the truth to
             | say that even the most detailed explanation is still a
             | longform way of saying "we don't really know." Some people
             | get upset and refuse to believe that fundamental truth, but
             | I've always been along for the ride more than the
             | destination.
             | 
             | It's never been easier to dive in. I've always wanted to
             | write detailed guides on how to start, and how to navigate
             | the AI space, but somehow I wound up writing an ML fanfic
             | instead: https://blog.gpt4.org/jaxtpu
             | 
             | (Fun fact: my blog runs on a TPU.)
             | 
             | I'm increasingly of the belief that all you need is a
             | strong desire to create things, and some resources to play
             | with. If you have both of those, it's just a matter of time
             | -- especially putting in the time.
             | 
             | That link explains how to get the resources. But I can't
             | help with how to get a desire to create things with ML.
             | Mine was just a fascination with how strange computers can
             | be when you wire them up with a small dose of calculus that
             | I didn't bother trying to understand until two years after
             | I started.
             | 
             | (If you mean contrastive loss specifically,
             | https://openai.com/blog/clip/ is decent. But it's just a
             | droplet in the pond of all the wonderful things there are
             | to learn about ML.)
        
               | Michelangelo11 wrote:
               | Thanks! Really appreciate the response.
        
           | snovv_crash wrote:
           | IMO the term "cost function" is much more intuitive than
           | "loss function" - it tells you the cost, which it attempts to
           | minimize by some iterative process (in this case training)
        
           | hooande wrote:
           | this is a very intuitive analysis. well done and thanks
        
           | sizzle wrote:
           | thanks for sharing your hard fought knowledge to us curious
           | bystanders
        
           | NickNaraghi wrote:
           | Fantastic, this helped me a lot! Thanks for taking the time
           | to write this out.
        
           | deltaonefour wrote:
           | I actually completely lost interest once I found this out.
           | Simply taking some ML course like the old Andrew Ng courses
           | online are enough for you to get the general idea.
           | 
           | ML is simply curve fitting. It's a applied math problem
           | that's quite common. In fact I lost a lot of interest in
           | intelligence in general once I realized this was all that was
           | going on. The implications really say that all of
           | intelligence is really some form of curve fitting.
           | 
           | The simplest form of this is linear regression which is used
           | to derive an equation for a line from a set of 2D points. All
           | ML is basically a 10,000 (or much more) dimensional extension
           | of that. The magic is lost.
           | 
           | Most of ML research is just to find the most efficient way to
           | find the best fitting curve given the least amount of data
           | points. A ML guys knowledge is centered around a bunch of
           | tricks and techniques to achieve that goal with some N-D
           | template equation. And the general template equation is all
           | the same: A neural network. The answer to what intelligence
           | is seems to be quite simple and not that profound at all...
           | which makes sense given that we're able to create things like
           | DALL-E in such a short time frame.
        
             | sillysaurusx wrote:
             | It's the other way around. ML is cool precisely because
             | it's a guitar for the mind, not a mind itself.
             | https://soundcloud.com/theshawwn/sets/ai-generated-
             | videogame...
             | 
             | I made that by using ML as a guitar. I chose instruments
             | and style the way a guitarist's fingers chooses frets.
             | 
             | And saying "give me this style with these instruments" is
             | far easier than recording it yourself.
             | 
             | For what it's worth, I agree with you about AGI. https://tw
             | itter.com/theshawwn/status/1446076902607888385?s=2...
             | 
             | But for me, that means it's far more interesting than AGI.
             | Everyone has their eye on AGI, and no one seems to be
             | taking ML at face value. That means the first companies to
             | do it will stand to make a fortune.
        
               | deltaonefour wrote:
               | Why do people use analogies to prove a point? It doesn't
               | prove anything.
               | 
               | What was your point here? ML is like a guitar? What you
               | said doesn't seem to contradict anything I said other
               | then you find curve fitting interesting and I don't.
               | 
               | Not trying to be offensive here, don't take it the wrong
               | way.
        
       | fnordpiglet wrote:
       | Boring. Give me emergent sentient AI, fact or fiction!
        
       ___________________________________________________________________
       (page generated 2022-06-01 23:00 UTC)