[HN Gopher] DALL-E 2 has a secret language ___________________________________________________________________ DALL-E 2 has a secret language Author : smarx Score : 371 points Date : 2022-05-31 18:46 UTC (4 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | kazinator wrote: | That's reminiscent of small children making up their own words | for things. Those words are stable in that you can converse with | the child using those words. | notimpotent wrote: | My first thought upon reading this: what if DALL-E (or a similar | AI) uncovers some kind of hidden universal language that is | somehow more "optimal" than any existing language? | | i.e. anything can be completely described in a more succinct | manner than any current spoken language. | | Or maybe some kind of universal language that naturally occurs | and any semi-intelligence life can understand it. | | Fun stuff! | extr wrote: | This is kind of already what's happening inside the NN. You can | think of intermediate layers in the network as talking to each | other in "NN-ease", that is, translating from one form of | representation (encoding) to another. At the final encoder | layer, the input is maximally compressed (for that given | dataset/model architecture/training regime). The picture | (millions of pixels) of the dog is reduced to a few bits of | information about what kind of dog it is and how it's posed, | what color the background is, etc. | | However, optimality of encoding is entirely relative to the | decoding scheme used and your purposes. Obviously a matrix of | numbers representing a summary of a paragraph can be in some | sense "more compressed" than the English equivalent, but it's | useless if you don't speak matrices. Similarly, you could | invent an encoding scheme with Latin characters that is more | compressed than English, but it's again useless if you don't | know it or want to take the time to learn it. If we wanted we | could make English more regular and easier to learn/compress, | but we don't, for a whole bunch of practical/real life reasons. | There's no free lunch in information theory. You always have to | keep the decoder/reader in mind. | astrange wrote: | That's not possible - it's like asking for a compression system | that can compress any message. | | All human languages are about the same efficiency when spoken, | but of course this mainly depends on having short enough words | for the most common concepts in the specific thing you're | talking about. | | https://www.science.org/content/article/human-speech-may-hav... | | And there can't be a universal language because the symbols | (words) used are completely arbitrary even if the grammar has | universal concepts. | elil17 wrote: | There are a couple sci-fi short stories in the book "Stories of | Your Life and Others" by Ted Chiang which explore the idea that | highly advanced intelligences might create special languages | which accommodate special thoughts which we cannot easily | think. | jcims wrote: | I think something like this is actually quite likely. | | I've been wondering if there is a way to do psychological | experiments on these large language models that we couldn't do | with a person. | julianbuse wrote: | I imagine these would be very interesting, but not very | applicable to humans (which I presume is the intended | outcome). OTOH, since these language models are trained on | human language and media, they might have some value. I'm | quite split on which I think is more likely (I don't have any | experience in ai/ml nor in psychology so what do I know). | sbierwagen wrote: | Ithkuil (Ithkuil: Itkuil) is an experimental constructed | language created by John Quijada.[1] It is designed to express | more profound levels of human cognition briefly yet overtly and | clearly, particularly about human categorization. | | Meaningful phrases or sentences can usually be expressed in | Ithkuil with fewer linguistic units than natural languages.[2] | For example, the two-word Ithkuil sentence "Tram-mloi | hhasmarptuktox" can be translated into English as "On the | contrary, I think it may turn out that this rugged mountain | range trails off at some point."[2] | | https://en.wikipedia.org/wiki/Ithkuil | jws wrote: | In short: DALLE-2 generates apparent gibberish for text in some | circumstances, but feeding the gibberish back in gets recognized | and you can tease out the meaning of words in this unknown | language. | carabiner wrote: | Science has gone too far. | astrange wrote: | It seems obvious this would happen (it's just adversarial inputs | again) - they didn't make DALL-E reject "nonsense" prompts, so it | doesn't try to, and indeed there's no reason you'd want to make | it do that. | | Seems like a useful enhancement would be to invert the text and | image prior stages, so it'd be able to explain what it thinks | your prompt meant along with making images of it. | [deleted] | schroeding wrote: | Interesting! I wonder if the model would "understand" the made-up | names from today's stained glass window post[1] like "Oila Whamm" | for William Ockham and output similar images. | | [1] https://astralcodexten.substack.com/p/a-guide-to-asking- | robo... | layer8 wrote: | Sounds like an effect similar to illegal opcodes: | https://en.m.wikipedia.org/wiki/Illegal_opcode | wongarsu wrote: | Link to the 5 page paper, for those that don't like twitter | threads: | | https://giannisdaras.github.io/publications/Discovering_the_... | TOMDM wrote: | Shouldn't this be expected to a certain extent? | | Gibberish has to map _somewhere_ in the models concept space. | | Whether is maps onto anything we'd recognise as consistent | doesn't mean that the AI wouldn't have some concept of where it | relates, as other people have noted, the gibberish breaks down | when you move it into another context, but who's to say that | Dall-E 2 isn't remaining consistent to some concept it | understands that isn't immediately recognisable to us. | | The interesting part is if you can trick it to spit out gibberish | in targeted areas of that concept space using crafted queries. | gwern wrote: | > Shouldn't this be expected to a certain extent? | | Not really. It's a stochastic model, so after a bunch of random | denoising steps, it could easily just be mapping every bit of | gibberish to a random image, and it be vanishingly unlikely for | any of them to be similar or the relationship to run in | reverse. | codeflo wrote: | I mean, everything is easy to predict in retrospect. :) | Personally, I'm a bit surprised that it has learned any | connection between the letters in the generated image and the | prompt text at all. I had assumed (somewhat falsely it seems) | that the gibberish means that the generator just thinks of text | as a "pretty pattern" that it fills in without meaning. For | example, a recent post on HN suggested that it likes the word | "Bay", simply because that appears so often on maps. | momojo wrote: | > Shouldn't this be expected to a certain extent? | | In hindsight, sure. Given enough time someone might have | predicted the phenomenon. But I don't think most of us did. | | What's more fascinating to me is how often this has happened in | this space in just the last few years. | | 1. Some phenomenon is discovered | | 2. I'm surprised | | 3. It makes sense in hindsight | jerf wrote: | Expected after the fact, somewhat. Before hand it would not be | unreasonable to expect that the output text and the input text | aren't necessarily that kind of connected, though, especially | as as I understand it, DALL-E was not given input labelling | explaining the text in various images. To it, text is just a | frequently-recurring set of shapes that relate to each other a | lot. This may yet be a false positive, based on other | discussion. | | That the model would have a consistent form of _some_ kind of | gibberish would be a given. Even humans have it: | https://en.wikipedia.org/wiki/Bouba/kiki_effect And I'm sure if | you asked native English speakers, "Hey, we know this isn't a | word, but if it _was_ a word, what would it be? 'Apoploe | vesrreaitars'" you would get something very far from a | uniformly random distribution of all nameable concepts. | EvgeniyZh wrote: | You could expect that gibberish is distributed uniformly in | latent space, disconnected from it's langual counterpart -- | after all those are textual inputs that model have never seen, | and it can't even map words it have seen many times to their | writing in image properly: "seafood" word and "seafood" image | are in the same place in latent space, but "seafood" word in | image isn't. Yet some gibberish word in image is, and also the | same gibberish word is. It's very counterintuitive for me. | TOMDM wrote: | A uniform distribution makes sense for gibberish, not | something I'd considered. | | A counterpoint I'd raise is I wonder how aggressive Dall-E 2 | is in making assumptions about words it hasn't seen before. | | Hard to do given that it's read essentially the entire | internet, however someone could make up some latin-esque | words that people would be able to guess the meaning of. | | If the model is as good as people at assuming the meaning of | such made up words, it could stand to reason that if it were | aggressive enough in this it might be doing the same thing | with gibberish and thus ending up with it's own | interpretation of the word, which would land it back in a | more targeted concept space. | | I'd love to see someone craft some words that most people | could guess the meaning of, and see how Dall-E 2 fairs. | jamal-kumar wrote: | This is really interesting because I was just looking at | gibberish detection using GPT models. Seems like mitigating AI | with AI doesn't sound like it's all that secure since you can | probably mess with the gibberish detection similarly - Or maybe | the 'secret language' as they're calling it here passes GPT | gibberish detection? [1] | | [1] https://arr.am/2020/07/25/gpt-3-uncertainty-prompts/ | [deleted] | GamerUncle wrote: | https://nitter.net/giannis_daras/status/1531693093040230402 | 726D7266 wrote: | Possibly related: In 2017 AI bots formed a derived shorthand that | allowed them to communicate faster: | https://www.facebook.com/dhruv.batra.dbatra/posts/1943791229... | | > While the idea of AI agents inventing their own language may | sound alarming/unexpected to people outside the field, it is a | well-established sub-field of AI, with publications dating back | decades. | | > Simply put, agents in environments attempting to solve a task | will often find unintuitive ways to maximize reward. | joshstrange wrote: | Which, to a lessor extent, isn't too terribly different from | humans if you think about. We don't use a full new language but | every profession has it's own jargon. Some of it spans the | whole industry and some is company-specific. | gibolt wrote: | Unintuitive to biased humans. The solutions may actually be | super intuitive/efficient, and we just can't wrap our heads | around it yet | neopallium wrote: | Would it be possible to build a rosetta stone for this secret | language with prompts asking for labeled pictures of different | categories of objects? Or prompts about teaching kids different | words? | MaxBorsch228 wrote: | What if give it the same promt but "with subtitles in French" for | example? | [deleted] | jsnell wrote: | One of the replies is a thread with a fairly convincing rebuttal, | with examples: | | https://twitter.com/Thomas_Woodside/status/15317102510150819... | dwallin wrote: | I'm not sure it's a convincing rebuttal, the examples shown all | seem to have some visible commonality. | | Eg. "Apoploe vesrreaitais" Could refer to something along the | lines of a "fan / wedge" or "wing-like" | | If you look at the examples of cheese, when compared to the | "birds and cheese" the cheese tends to be laid out in a fan | like pattern and shaped in sharp angled wedges. | sudosysgen wrote: | It seems to refer to "bird plant" which means birds on trees, | so it would make sense there would be cheese and plants if it | can't find how to fit a bird. | joshcryer wrote: | Yeah, and his example about bugs in the kitchen. Everything | is edible and 'wild' or 'heirloom' and "contarra ccetnxniams | luryca tanniounons" comes from the farmers talking about ... | vegetables. So there's a definite interrelationship between | the 'words' and the images. | | I'm unconvinced by the rebuttal as well, not to say I am | convinced we have a fully formal language going on here, but | there's definitely some shared concepts with the generated | text. | | I wonder what imagen would come up with or if it's 'language' | is more correlated to real language. | ericb wrote: | > Apoploe vesrreaitais" Could refer to something along the | lines of a "fan / wedge" | | "feathered" maybe? | f38zf5vdt wrote: | I'm curious what it generates when given randomly generated | strings of seemingly pronounceable words like "Fedlope | Dipeioreitcus". | jimhi wrote: | We don't know the rules or grammar of this "language". Maybe | nouns change based on how they are used | | https://en.wikipedia.org/wiki/Declension | lmc wrote: | A rebuttal to the rebuttal (without examples)... | | How many French people speak Breton? | goodside wrote: | My first reaction to this was, "It probably has to do with | tokenization. If there's a 'language' buried in here, its native | alphabet is GPT-3 tokens, and the text we see is a concatenation | of how it thinks those tokens map to Unicode text." | | Most randomly concatenated pairs of tokens simply do not occur in | any training text, because their translation to Unicode doesn't | correspond to any real word. There are also combinations that do | correspond to real words ("pres" + "ident" + "ial") but still | never occur in training because some other tokenization is | preferred to represent the same string ("president" + "ial"). | | Maybe DALL-E 2 is assigning some sort of isolated (as in, no | bound morphemes) meaning to tokens -- e.g., combinations of | letters that are statistically likely to mean "bird" in some | language when more letters are revealed. When a group of such | tokens are combined, you get a word that's more "birdlike" than | the word "bird" could ever be, because it's composed exclusively | of tokens that mean "bird": tokens that, unlike "bird" itself, | never describe non-birds (e.g., a Pontiac Firebird). The exact | tokens it uses to achieve this aren't directly accessible to us, | because all we get is poorly rendered roman text. | | I'm maybe not the ideal person to be speculating about this, but | it bothers me that the word "token" isn't even mentioned in the | article reporting this discovery (https://giannisdaras.github.io/ | publications/Discovering_the_...). | normaldist wrote: | I'm seeing a lot more people experimenting with DALL-E 2. | | How does getting access work, do you need a referral? | mikequinlan wrote: | https://labs.openai.com/waitlist | minimaxir wrote: | There is a waitlist, but OpenAI just announced they are opening | access more widely from it. | Cloudef wrote: | I wonder why they call it "Open"AI | MatthiasPortzel wrote: | It's wild to see the discoveries being made in ML research. Like | most of these 'discoveries,' it makes a fair amount of sense | after thinking about it. Of course it's not just going to spit | out random noise for random input, it's been trained to generate | realistic looking images. | | But I think it is an interesting discovery because I don't think | anyone could have predicted this. | | One of my favorite examples is the classification model that will | identify an apple with a sticker on it that says "pear" as a pear | --it makes sense, but is still surprising when you first see it. | astrange wrote: | > One of my favorite examples is the classification model that | will identify an apple with a sticker on it that says "pear" as | a pear--it makes sense, but is still surprising when you first | see it. | | That classification model (CLIP) is the first stage of this | image generator (DALLE) - and actually this shows that it | doesn't think they're exactly the same thing, or at least | that's not the full story, because DALL-E doesn't confuse the | two. | | However, other CLIP guided image generation models do like to | start writing the prompt as text into the image if you push | them too hard. | wongarsu wrote: | Was DALL-E 2 trained on captions from multiple languages? If so, | this makes a lot of sense. Somewhere early in the model the words | "bird", "vogel", "oiseau" and "pajaro" have to be mapped to the | same concept. And "Apoploe vesrreaitais" happens to map to the | same concept. Or maybe "Apoploe vesrreaitais" is rather the | tokenization of that concept, since it also appears in the | output. So in a sense DALL-E is using an internal language to | make sense of our world. | link0ff wrote: | This looks like the artificial language Lojban was constructed: | its words share parts from completely unrelated languages to | the point when none of the original words are recognizable in | the result. | alxndr wrote: | The original words aren't recognizable at first glance, but | they do serve as potential mnemonics for remembering the | terms/definitions for any learners who speak one of those | source languages (English, Spanish, Mandarin, Arabic, | Russian, Hindi) | melony wrote: | But that's expected behavior for a language model (especially | VAEs), where's the novelty? In a VAE, the vectors are | probabilistic in the latent space so this is basically the NLP | version of the classic VAE facial image generation where you | can tweak the parameters to emphasize or de-emphasize a | feature. | tomrod wrote: | Novel in engineering together of multiple concepts, if | nothing else! | la64710 wrote: | Does google translate supports this? | godelski wrote: | Interestingly Google detects these words as Greek. I know they | are nonsensical and not actually Greek but I'm wondering if any | Greek speakers might be able to provide some insights. Are these | gibberish words close to meaningful words? (clear shot in the | dark here) Maybe a linguist could find more meaning? | deckeraa wrote: | One could conjecture that "Apoploe" is similar to apo pouli, | "from bird". But I don't have much support for that conjecture. | PartiallyTyped wrote: | The word is apoplous, or apoploI | noizejoy wrote: | Or maybe it's a subtle joke by Google as a play on the idiom | "it's all Greek to me"? | PartiallyTyped wrote: | As a native Greek, no, they don't make any sense.. sort of. My | hunch is that they read significantly more like Latin than they | do Greek. However it tells us something about google translate. | | The reason "Apoploe vesrreaitais" is detected as Greek is | because the first "word" is "phonetically" similar to the word | apoplous, which means sailing/shipping and it is rooted in | ancient Greek. If we were to write Apoplous using roman | characters, we would write apoplous or apoloi (plural, in Greek | is apoploI). So I think that the model understands that "oe" | suffix is used to represent the Greek suffix "oi" that is used | for plurals. The rest of the word is rather close phonetically, | so there is some model that maps phonetic representations to | the correct word. | | The other phrase seems to be combined of words classified as | Portuguese, Spanish, Lithuanian, and Luxembourgish. | stavros wrote: | I don't think that's how language detection works, they most | likely use the frequencies of n-grams to detect language | probability. It's still detected as Greek if you change to | "Apoulon vesrreaitais", just because it kind of looks the way | Greek words look, not because it resembles any specific word. | PartiallyTyped wrote: | You are wrong. Had it been that simple I would __not__ have | suggested that and for whatever reason I find your reply | borderline infuriating but I can't pinpoint exactly why | that is. | | Regardless, here is me, a native speaker, disproving your | hypothesis. | | I tried the following words in google translate elefantas | ailaifantas ailaiphantas elaiphandas elaiphandac. | | The suggested detections are elephantas, ailaiphantas, | ailaiphantas, elaiphantas, elaiphantas, however, the | translations are elephant, illuminated, illuminated, | elephant, elephant respectively. The first is correct. When | mapping the roman characters back to greek, there is loss | of information, this is seen in the umlaut above iota which | makes the pronunciation from e [e] - like to ai [ai], and | the emphasis denoted via the mark above epsilon (e). | | Notice that all all the words have an edit distance of >=4, | a soundex distance of at most 1, and a metaphone distance | of at most 1 [1]. The suggested words as I said above are | near homophones of the correct word bar a few minor | details. | | [1] http://www.ripelacunae.net/projects/levenshtein | stavros wrote: | > for whatever reason I find your reply borderline | infuriating but I can't pinpoint exactly why that is. | | I guess that says more about you than about my reply. | Also, I'm a native speaker as well. That doesn't really | have any bearing, my comment above comes from what I know | about common implementations of language detection | algorithms, not so much from looking at how Google | Translate behaves. | PartiallyTyped wrote: | And I was honest about how I felt given how you | structured it. | | It does have a lot of bearing actually. While I am a | native speaker, my spelling skills are atrocious as | everything is a sequence of sounds in my head more so | than a sequence of letters. To get around my spelling | issues I frequently use homophones to find the correct | spelling of a word which uses soundex or similar | algorithms to find the correct word along with character | mappings between the two languages. | | Regardless, I believe I have proved the hypothesis to not | be true. | godelski wrote: | This is a great response (I also suspected we'd learn | something from the Google Translate black box). And I agree | with the idea of being closer to Latin gibberish. The | phonetic relationships are a great hint to what's actually | going on. | | My hypothesis here is more that these models are trained more | on western languages than others and thus our latent | representation of "language" is going to appear like Latin | gibberish due to a combination of the evolution of these | languages as well as human bias. ("It's all Greek to me") | PoignardAzur wrote: | Wait, how does that make any sense? | | I thought DALL-E's language model was tokenized, so it doesn't | understand that eg "car" is made up of the letters 'c', 'a' and | 'r'. | | So how could the generated pictures contain letters that form | words that are tokenized into DALL-E's internal "language"? | Shouldn't we expect that feeding those words to the model would | give the same result as feeding it random invented words? | | Actually, now that I think about it, how does DALL-E react when | given words made of completely random letters? | seydor wrote: | damn. i hope arcaeologists can use that to decipher old scripts | ricardobeat wrote: | The paper is just as long as the twitter thread. | smusamashah wrote: | A few days ago I was wondering what DALL-E would generate if | given gibberish (tried to request which wasn't entertained). This | sounds like an answer to that to some extent. | | I think, there will be multiple words for the same thing. Also, | unlike 'bird' the word 'Apoploe vesrreaitais' might actually mean | specific kind of bird in specific setting. | DonHopkins wrote: | Has anyone tried talking to it in Simlish? | | https://en.wikipedia.org/wiki/Simlish | | https://web.archive.org/web/20040722043906/http://thesims.ea... | | https://web.archive.org/web/20121102012431/http://bbs.thesim... | ml_basics wrote: | I find it really interesting how these new large models (DALLE, | GPT3, PaLM etc) are opening up new research areas that do not | require the same massive resources required to actually train the | models. | | This may act as a counter balance to the trends of the last few | years of all major research becoming concentrated in a few tech | companies. | YeGoblynQueenne wrote: | If I understand correctly from the twitter thread (I haven't read | the linked technical report) the author and a collaborator found | that DALL-E generated some gibberish in an image that showed two | men talking, one holding two ... cabbages? They fed (some of) the | gibberish back to DALL-E and it generated images of birds, | pecking at things. | | Conclusion: the gibberish is the expression for birds eating | things in DALL-E's secret language. | | But, wait. Why is the same gibberish in the first image, that has | the two men and the cabbages(?), but no birds? | | Explanation: the two men are clearly talking about birds: | | >> We then feed the words: "Apoploe vesrreaitars" and we get | birds. It seems that the farmers are talking about birds, messing | with their vegetables! | | With apologies to my two compatriots, but that is circular | thinking to make my head spin. I'm reminded of nothing else as | much as the scene in the Knights of the Round Table where the | wise Sir Bedivere explains why witches are made of wood: | | https://youtu.be/zrzMhU_4m-g | throw457 wrote: | I bet it's just a form of copy protection. | ceejayoz wrote: | Like https://en.wikipedia.org/wiki/Trap_street? | 867-5309 wrote: | and Wagatha | Imnimo wrote: | I tried a few of these in one of the available CLIP-guided | diffusion notebooks, but wasn't able to get anything that looks | like DALL-E meanings. Not sure if DALL-E retrained CLIP (I don't | think they did?), but it maybe suggests that whatever weirdness | is going on here is on the decoder side? | | All the cool images that DALL-E spits out are fun to look at, but | this sort of thing is an even more interesting experiment in my | book. I've been patiently sitting on the waitlist for access, but | I can't wait to play around with it. | dpierce9 wrote: | Gavagai! | alxndr wrote: | (explaining the joke: | https://en.m.wikipedia.org/wiki/Indeterminacy_of_translation ) | ortusdux wrote: | I wonder if any linguists are training a neural network to | generate Esperanto 2.0. | Veedrac wrote: | Wow, I am totally going to need to wait for more experimentation | before believing any given thing here, but this seems like a big | deal. | | It's one thing if DALL-E 2 was trying to map words in the prompt | to their letter sequences and failing because of BPEs; that shows | an impressive amount of compositionality but it's still image- | model territory. It's another if DALL-E 2 was trying to map the | prompt to semantically meaningful content and then failing to | finish converting that content to language because it's too small | and diffusion is a poor fit for language generation. That makes | for worse images but it says terrifying things about how much | DALL-E 2 has understood the semantic structure of dialog in | images, and how this is likely to change with scale. Normally I'd | expect the physical representation to precede semantic | understanding, not follow it! | | That said I reiterate that a degree of skepticism seems warranted | at this point. | trebligdivad wrote: | Is this finally a need for a xenolinguist? ___________________________________________________________________ (page generated 2022-05-31 23:00 UTC)