[HN Gopher] No, DALL-E doesn't have a secret language ___________________________________________________________________ No, DALL-E doesn't have a secret language Author : doener Score : 111 points Date : 2022-06-01 20:08 UTC (2 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | jawarner wrote: | The tweet is in response to a preliminary paper [1] [2] studying | text found in images generated by, e.g., "Two whales talking | about food, with subtitles." DALL-E doesn't generate meaningful | text strings in the images, but if you feed the gibberish text it | produces -- "Wa ch zod ahaakes rea." back into the system as a | prompt, you would get semantically meaningful images, e.g., | pictures of fish and shrimp. | | [1] | https://giannisdaras.github.io/publications/Discovering_the_... | | [2] https://twitter.com/giannis_daras/status/1531693093040230402 | dekhn wrote: | I think the tweeter is being a bit too pedantic. Personally I | spent some time thinking about embeddings, manifolds, the | structure of language, scientific naming, and what the decoding | of the points near the center of clusters in embedding spaces | look like (archetypes), after seeing this paper. I think making | networks and asking them to explain themselves using their own | capabilities is a wonderful idea that will turn out to be a | fruitful area of research in its own right. | austinjp wrote: | > asking [neural networks] to explain themselves using their | own capabilities | | Exactly. This could be profound. I'm looking forward to | further work here. Sure, the examples here are daft, but | developing this approach could be like understanding a | talking lion [0] only this time it's a lion of our making. | | [0] https://tzal.org/understanding-the-lion-the-in-joke-of- | psych... | lotaezenwa wrote: | I concur that the tweeter is being pedantic. | | This is largely some embedding of semantics that we currently | do not fully have a mapping for, precisely because it was | generated stochastically. | | Saying it was "not true" seems like clickbait. | koboll wrote: | Especially since his results confirm _most_ of what the | original thread claimed. A couple of the inputs did not | reliably replicate, but "for the most part, they're not | true" seems straightforwardly false. He even seems to | deliberately ignore this sometimes, such as when he says "I | don't see any bugs" when there is very obviously a bug in | the beak of all but two or three of the birds. | mannykannot wrote: | When I zoomed in, I felt only four in ten birds clearly | had anything in their beaks, and in each case it looked | like vegetable matter. In the original set, only one | clearly has an insect in its beak. | | Are there higher-resolution images to be had? | ASalazarMX wrote: | Lower in the same thread he accepts that his main tweet | was clickbaity, and that actually there's consistency in | some of the results. | Jweb_Guru wrote: | Not really, he afterwards says that he was more trying to | inject some humility. He really doesn't think this is | measuring anything of interest. For the birds result in | particular, see https://twitter.com/BarneyFlames/status/1 | 531736708903051265. | ASalazarMX wrote: | If DALL-E had a choice to output "Command not understood", | maybe we wouldn't be discussing this. | | Like those AIs that guess what you draw, and recognize | random doodling as "clouds", DALL-E is probably using the | least unlikely route. That a gibberish word is drawn as a | bird is maybe because it was "bird (2%), goat (1%), radish | (1%)". | | 1. https://quickdraw.withgoogle.com | wruza wrote: | Is this and a previous tweet a ML-guys discussion? My layman | understanding of neural networks is that the core operation is | you basically kick a figure down the hill and see where it ends | up, but both the figure and the hill are N-dimensional objects, | where N is too huge to comprehend. Of course some nonsensical | figures end up at valid locations, but can you really expect some | stable inner structure of the hill-figure interaction? I think | it's unlikely that there is a place in a learning method to | produce one. NNs can give interesting results, but they don't | magically rewrite their own design yet. | | Would still be interesting to see how the output changes with | little changes to these inputs. If my vague understanding is at | all close, this will reveal the "faces" that are more "noisy" | than the others. Not sure what that gives though. | quink wrote: | If it smells like overfitting, it probably is overfitting. | danamit wrote: | I do not believe AI claims whenever I read them, and this is | happening at the risk of me being a cynic and a disbeliever in | the field. And I am not sure if that's bad for society or bad | just for me. | | I am more likely to believe celebrity gossip than AI news | articles. | aaron695 wrote: | peanut_worm wrote: | I think this guy is being a bit pedantic. It is returning semi- | consistent results for gibberish, which is interesting. Thats all | the original poster meant. | KaoruAoiShiho wrote: | Am I dumb but does that thread prove that DALL-E does in fact | have a secret language, it's just not exactly the meaning | described in the paper? | deltaonefour wrote: | There's some form of language here... the correlations are | evidence enough. The Grammar I believe is complex and likely | not human grammar thus certain words when paired with other | words can negate the meaning of the word all together or even | completely change it. | | For example "hedge" combined with "hog" is neither a "hedge" | nor is it a "hog" nor is it some sort of horrific hybrid | mixture of hedges and hogs. A hedgehog is tiny rodent. Most | likely this is what's going on here. | | The domain is almost infinite. And the range is even greater. | Thus it's actually realistic to say that there are must be | hundreds of input and output sets that form alternative | languages. | Jweb_Guru wrote: | I don't think there is any evidence of a language here unless | you stress the definition to the point of absurdity. It will | not even reliably produce the same kinds of images that had | the text that it output, which was the original premise of | the claim. Obviously, probing some overconstrained high | dimensional space where it's never rewarded for uncertainty | has to produce _something_ ; that doesn't mean that something | is a language. | [deleted] | redredrobot wrote: | So his argument is that the text clearly maps to concepts in the | latent space, but when composing them the results are unexpected, | so it isn't language? Why isn't this better described as 'the | rules of composition are unknown'? | rcoveson wrote: | That framing is worse because it hides an assumed conclusion, | i.e. that there are rules of composition. | redredrobot wrote: | But don't we already know that composition exists in DALL-E? | Don't the points shown in the tweet indicate that some form | of composition exists? The 3D renders are clearly render- | like, the painting and cartoons are clearly in the | appropriate style. | rcoveson wrote: | "That there exist rules of composition of the hypothesized | secret DALL-E language" is a much stronger claim than that | it "understands" composition of text in the real languages | it was trained on. | | Though I'll also point out that even evidence for that | weaker claim is tenuous. It definitely knows how to move an | image closer to "3D render" in concept-space, but it | doesn't seem to understand the linguistic composition of | your request. For example, you'd have an extremely hard | time getting it to generate an image of a person using 3D | rendering software, or a "person in any style that isn't 3D | render"; it would probably just make 3D renders of persons. | | I haven't played around with it myself, I'm going off the | experiences of others. For example: | | https://astralcodexten.substack.com/p/a-guide-to-asking- | robo... | joshmarlow wrote: | I found this analysis interesting | https://twitter.com/Plinz/status/1531711345585860609?t=Yinol... | SilverBirch wrote: | This just feels like one of these topics where you'd really want | a liguist. Someone who really understands the construction and | evolution of langauge to observe some of the underlying _reasons_ | for why language is constructed the way it is. Because I guess | that 's what DALL-E partly is, it's trying to approximate that, | and the interesting thing would be where it differs from real | language, rather than matches it. If I give it a made up word | that looks like the latin phrase that looks like a species of | bird, then it working like I've given it a latin phrase that is a | species of bird is pretty reasonable. If you said to me "Homo | heidelbergensis" I wouldn't _know_ that was a species of pre- | historic human, but I would feel pretty comfortable making that | kind of leap. | | I also think you could probably hire a team of linguists pretty | cheap compared to a team of AI engineers. | masswerk wrote: | I don't think that this related to language, at all. First, | let's ask, is there a way for DALL-E to refuse an output (as | in, this makes no sense). Then, what would we expect the output | for gibberish to be like? Isn't this still subject to filtering | for best "clarity" and best signals? While I don't think that | these are collisions in the traditional sense of a hash | collision, any input must produce a signal, as there is no null | path, and what we see is sort of a result of "collisions" with | "legitimate" paths. Still, this may tell us some about the | inner structure. | | Also, there is no way for vocabulary to exist on its own | without grammar, as these are two sides of the phenomenon, we | call language. Some signs of grammar had to emerge together | with this, at once. However... | | ---- | | Edit: Let's imagine a typical movie scene. Our nondescript | individual points at himself and utters "Atuk" (yes, Ringo | Starr!) and then points at his counterpart in this | conversation, who utters "Carl Benjamin von Richterslohe". This | involves quite an elaborate system of grammar, where we already | know that we're asking for a designator, that this is not the | designator for the act of pointing, and that by decidedly | pointing at a specific object, we'd ask for a specific | designator not a general one. Them C.B. von Richterslohe, our | fearless explorer, waves his hand over the backdrop of the | jungle, asking for "blittiri" in an attempt to verify that this | means "bird", for which Atuk readily points out a monkey. - | While only nouns have been exchanged, there's a ton of grammar | in this. | | And we haven't even arrived at thinks like, "a monkey sitting | at the foot of a tree". Which is mostly about the horizontal | and vertical axes of grammar, along which we align things and | where we can substitute one thing for another in a specific | position, which ultimately provides them with meaning (by what | combinations and substitutions are legitimate ones and which | are not). | | Now, in light of this, that specific compounds are changing | their alleged "meaning" radically, when aligned, doesn't allow | for high hopes for this to be language. | runj__ wrote: | I was thinking about a system for pulling data from verbal | nonsense the other day, speaking in tongues or something | similar. I can create a bunch of noises that lack obvious | meaning for me, but obviously they have some meaning that can | be learned since humans are terrible at being truly random | (lol XD). | | I wonder what level I would be able to share ideas I lack the | words for, my perceived bitrate at creating "random" noise is | certainly higher than when verbally communicating an idea to | another human. Will we even share a common language in the | future? Or will we have our own language that is translated | to other people? | masswerk wrote: | Well, I can only answer with kind of a pun. With | Wittgenstein, language is a constant conversation about the | extent of the world, about what is and what is not. As | such, it is necessarily shared. In the _tractatus_ we find, | | > 5.62 (...) For what the solipsist means is quite correct; | only it cannot be _said,_ but makes itself manifest. The | world is my world: this is manifest in the fact that the | limits of _language_ (of that language which alone I | understand) mean the limits of my world. [1] | | So, something could become _apparent,_ but you would still | haven 't _said_ anything (as it 's not part of that | conversation). ;-) | | [1] https://www.masswerk.at/digital- | library/catalog/wittgenstein... | | (I deem this edition to be somewhat appropriate in | context.) | belugacat wrote: | Given that DALL-E is a giant matrix multiplication that | correlates fuzzy concepts in text to fuzzy concepts in images, | wouldn't one expect that there will be hotspots of nonsensical | (to us) correlations, eg between "apoploe vesrreaitais" and | "bird"? Intuitively feels like an aspect of the no free lunch | theorem. | axg11 wrote: | Exactly this. At a high level, DALL-E is mapping text to a | (continuous) matrix and then mapping that matrix to an image | (another a matrix). All text inputs will map to _something_. | DALL-E doesn't care if that mapping makes sense, it has been | trained to produce high-quality outputs, not to ensure the | validity of mappings. | | None of this makes DALL-E any less impressive to me. High | quality image generation is a truly amazing result. Results | from foundational models (GPT-3, PaLM, DALL-E, etc) are so | impressive that they're forcing us to reconsider the nature of | intelligence and raise the bar. That's a sign of a job well | done to me. | LoveMortuus wrote: | But if it's just mapping text to image then it would be fair | to assume that using the same text would result in the same | image. But does that actually happen? | Jweb_Guru wrote: | No, it does not. It also doesn't always generate the same | category of image. See https://twitter.com/realmeatyhuman/s | tatus/153173861680386457.... | | As much as people would like there to be, there really does | not seem to be anything here. The original author doesn't | think so, either (would need to refind the tweet). | tbalsam wrote: | I have seen many bad abuses of the NFL theorem's name. | | This is by far the worst. | smeagull wrote: | Yeah. The problem here is that the network only has room for | concepts, and hasn't been trained to see meaningless crap. Nor | does it really have any way to respond with "This isn't a | sentence I know", it just has to come up with an image that | best matches whatever prompt it has been fed. | skybrian wrote: | "Secret language" is clickbait, but it seems like systematically | exploring how it responds to gibberish might find something | interesting? | | Also, I'm wondering if there is some way that these models could | have a decent error response rather than responding to every | input? | dang wrote: | Recent and related: | | _DALL-E 2 has a secret language_ - | https://news.ycombinator.com/item?id=31573282 - May 2022 (109 | comments) | deltaonefour wrote: | It's OBVIOUS what's going on here. When you combine TWO different | languages you get stuff that appears as NONSENSE. You have to | stay in the same language! | | There is for sure a set of consistent words that produce output | that makes sense to us. He just picked the wrong set! | sydthrowaway wrote: | Dumb question, but how are DALL-E's (and any other AI generative | algorithm) result's so.. smooth? | | For example, I could write a heuristic algorithm to product the | same thing using a Google image search, but it would look like MS | word clip art. | Enginerrrd wrote: | Lots of denoising steps after the initial attempt at forming a | connection to the prompt is made? | sillysaurusx wrote: | This is one of my favorite topics in all of AI. It was the most | surprising and mysterious discovery for me. | | The answer is that the training process literally has to make | the results smooth. That's how training works. | | Imagine you have 100 photos. Your job is to classify them by | color. You can place them however you want, but similar colors | should be physically closer together. | | You can imagine the result would look a lot like a photoshop | RGB picker, which is smooth. | | The surprise is, this works for any kind of input. Even text | paired with images. | | The key is the loss function (a horrible name). In the color | picker example, the loss function would be how similar two | colors are. In the text to image example, it's how _dissimilar_ | the input examples are from each other (Contrastive Loss). The | brilliance of that is, pushing dissimilar pairs apart is the | same thing as pulling similar pairs together, when you train | for a long time on millions of examples. Electrons are all | trying to push each other apart, but your body is still smooth. | | The reason it's brilliant is because it's far easier to measure | dissimilar pairs than to come up with a good way of judging | "does this text describe this image?" -- you definitely know | that it isn't a bicycle, but you might not know whether a car | is a corvette or a Tesla. But both the corvette and the Tesla | will be pushed away from text that says it's a bicycle, and | toward text that says it's a car. | | That means for a well-trained model, the input _by definition_ | is smooth with respect to the output, the same way that a small | change in {latitude,longitude} in real life has a small change | in the cultural difference of a given region of the world. | Michelangelo11 wrote: | Do you by any chance have a link to a paper or article that | explains this in detail? I'd love to understand it better. | jeabays wrote: | sillysaurusx wrote: | It doesn't exist. The above explanation is the result of me | spending almost all of my time immersing myself in ML for | the last three years. | | gwern helped too. He has an intuition for ML that I'm still | jealous of. | | Your best bet is to just start building things and worry | about explanations later. It's not far from the truth to | say that even the most detailed explanation is still a | longform way of saying "we don't really know." Some people | get upset and refuse to believe that fundamental truth, but | I've always been along for the ride more than the | destination. | | It's never been easier to dive in. I've always wanted to | write detailed guides on how to start, and how to navigate | the AI space, but somehow I wound up writing an ML fanfic | instead: https://blog.gpt4.org/jaxtpu | | (Fun fact: my blog runs on a TPU.) | | I'm increasingly of the belief that all you need is a | strong desire to create things, and some resources to play | with. If you have both of those, it's just a matter of time | -- especially putting in the time. | | That link explains how to get the resources. But I can't | help with how to get a desire to create things with ML. | Mine was just a fascination with how strange computers can | be when you wire them up with a small dose of calculus that | I didn't bother trying to understand until two years after | I started. | | (If you mean contrastive loss specifically, | https://openai.com/blog/clip/ is decent. But it's just a | droplet in the pond of all the wonderful things there are | to learn about ML.) | Michelangelo11 wrote: | Thanks! Really appreciate the response. | snovv_crash wrote: | IMO the term "cost function" is much more intuitive than | "loss function" - it tells you the cost, which it attempts to | minimize by some iterative process (in this case training) | hooande wrote: | this is a very intuitive analysis. well done and thanks | sizzle wrote: | thanks for sharing your hard fought knowledge to us curious | bystanders | NickNaraghi wrote: | Fantastic, this helped me a lot! Thanks for taking the time | to write this out. | deltaonefour wrote: | I actually completely lost interest once I found this out. | Simply taking some ML course like the old Andrew Ng courses | online are enough for you to get the general idea. | | ML is simply curve fitting. It's a applied math problem | that's quite common. In fact I lost a lot of interest in | intelligence in general once I realized this was all that was | going on. The implications really say that all of | intelligence is really some form of curve fitting. | | The simplest form of this is linear regression which is used | to derive an equation for a line from a set of 2D points. All | ML is basically a 10,000 (or much more) dimensional extension | of that. The magic is lost. | | Most of ML research is just to find the most efficient way to | find the best fitting curve given the least amount of data | points. A ML guys knowledge is centered around a bunch of | tricks and techniques to achieve that goal with some N-D | template equation. And the general template equation is all | the same: A neural network. The answer to what intelligence | is seems to be quite simple and not that profound at all... | which makes sense given that we're able to create things like | DALL-E in such a short time frame. | sillysaurusx wrote: | It's the other way around. ML is cool precisely because | it's a guitar for the mind, not a mind itself. | https://soundcloud.com/theshawwn/sets/ai-generated- | videogame... | | I made that by using ML as a guitar. I chose instruments | and style the way a guitarist's fingers chooses frets. | | And saying "give me this style with these instruments" is | far easier than recording it yourself. | | For what it's worth, I agree with you about AGI. https://tw | itter.com/theshawwn/status/1446076902607888385?s=2... | | But for me, that means it's far more interesting than AGI. | Everyone has their eye on AGI, and no one seems to be | taking ML at face value. That means the first companies to | do it will stand to make a fortune. | deltaonefour wrote: | Why do people use analogies to prove a point? It doesn't | prove anything. | | What was your point here? ML is like a guitar? What you | said doesn't seem to contradict anything I said other | then you find curve fitting interesting and I don't. | | Not trying to be offensive here, don't take it the wrong | way. | fnordpiglet wrote: | Boring. Give me emergent sentient AI, fact or fiction! ___________________________________________________________________ (page generated 2022-06-01 23:00 UTC)